GridLab
Grid Application Toolkit

A simple API for Grid Applications
GAT

Menu



next up previous contents
Next: Migrating a Job Instance Up: Job Management Previous: Stoping a Job Instance   Contents

Checkpointing a Job Instance

As you've tired of writing endless reams of your own code, you've decide to outsource some of this work. When perusing some internet message groups you head of this set up with monkeys and type writers trying to reproduce the works of Shakespeare. What a great idea, you think. Why not do the same with code.

So, after many years of trekking through the deepest darkest of Africa to scout out the best of the best of the baboons, you're off to work. The only problem is the code these baboons produce, well, isn't of the highest quality. It keeps on crashing. Whoever wrote that post in that internet group was an idiot! But here you are with code that may be a bit unstable, but which you need to run. One way of lessening the burden in such a situation is to use checkpointing.

Checkpointing is a procedure in which a job's state is saved to long term storage. Hence, if a process is in the midst of a critical calculation, but may crash at any moment, one can save the state of this job by checkpointing it. After checkpointing a job, it can crash, but one will only lose any results calculated found after the job was checkpointed and before it crashed. So, you don't loose the whole run, just the little niggling bits around the edges.

GAT allows one to checkpoint properly instrumented jobs. (One can determine if a job is properly instrumented by calling the GATJob_GetInfo function and looking for the value of the key checkpointable.) However, a GATJob must be in the proper state before is can be checkpointed. As the GATJob must be running to save its state, one can see that it is only possible to checkpoint a GATJob in the running state. In particular, the checkpointing state is a substate of the running state as illustrated in figure [*].

Figure: Detail of the GATJob state diagram.
[width=7cm]jobstatesdetail3

One can checkpoint a properly instrumented job by making a call to the function

GATResult GATJob_Checkpoint(GATJob_const object)

The only argument to this function is a GATJob instance identifying the GATJob to checkpoint, and this function returns a GATResult, covered in Appendix [*], which indicates its completion status. This is a non-blocking call; in other words, it does not wait until the checkpoint is actually completed. It simply delivers the checkpointing request to the job, then returns immediately. One should also note that this function will not complete successfully unless the passed GATJob is in a running state. As mentioned previously, the operation of checkpointing a GATJob is not even defined for a job not in the running state; so, it should come as no surprise that this function will fail when acting upon of non-running GATJob.


next up previous contents
Next: Migrating a Job Instance Up: Job Management Previous: Stoping a Job Instance   Contents
Andre Merzky 2004-05-13