Next: Migrating a Job Instance
Up: Job Management
Previous: Stoping a Job Instance
  Contents
As you've tired of writing endless reams of your own code, you've decide to outsource
some of this work. When perusing some internet message groups you head of this set
up with monkeys and type writers trying to reproduce the works of Shakespeare. What
a great idea, you think. Why not do the same with code.
So, after many years of trekking through the deepest darkest of Africa to scout out the
best of the best of the baboons, you're off to work. The only problem is the code these
baboons produce, well, isn't of the highest quality. It keeps on crashing. Whoever wrote
that post in that internet group was an idiot! But here you are with code that may be
a bit unstable, but which you need to run. One way of lessening the burden in such
a situation is to use checkpointing.
Checkpointing is a procedure in which a job's state is saved to long term storage.
Hence, if a process is in the midst of a critical calculation, but may crash at any
moment, one can save the state of this job by checkpointing it. After checkpointing
a job, it can crash, but one will only lose any results calculated found after the
job was checkpointed and before it crashed. So, you don't loose the whole run,
just the little niggling bits around the edges.
GAT allows one to checkpoint properly instrumented jobs. (One can determine if
a job is properly instrumented by calling the GATJob_GetInfo function and
looking for the value of the key checkpointable.) However, a GATJob
must be in the proper state before is can be checkpointed. As the GATJob
must be running to save its state, one can see that it is only possible to checkpoint
a GATJob in the running state. In particular, the checkpointing state
is a substate of the running state as illustrated in figure .
Figure:
Detail of the GATJob state diagram.
|
[width=7cm]jobstatesdetail3
|
One can checkpoint a properly instrumented job by making a call to the function
GATResult GATJob_Checkpoint(GATJob_const object)
The only argument to this function is a GATJob instance identifying the
GATJob to checkpoint, and this function returns a GATResult,
covered in Appendix , which indicates its completion
status. This is a non-blocking call; in other words, it does not wait until the
checkpoint is actually completed. It simply delivers the checkpointing request
to the job, then returns immediately. One should also note that this function will
not complete successfully unless the passed GATJob is in a running
state. As mentioned previously, the operation of checkpointing a GATJob
is not even defined for a job not in the running state; so, it should come as no
surprise that this function will fail when acting upon of non-running GATJob.
Next: Migrating a Job Instance
Up: Job Management
Previous: Stoping a Job Instance
  Contents
Andre Merzky
2004-05-13
|