Articles/CondorUserCheckpoints
Condor User Checkpoints
by Matt Terry
Thanks to Peter Keller for answering my questions. Quotations used with permission.
If you've used condor for job scheduling, you've probably been introduced to condor checkpoints. If you properly compile your program with condor_compile, condor will be able to pause your job and restart it later, possibly on different hardware. The ability to checkpoint and restart your program automatically frees you from the need to have dedicated hardware. You can scavenge cycles where ever you find them. It also ensures you don't loose your work if someone more important comes along and boots you from your machine.
Sometimes you will be unable or unwilling to use condor_compile to compile your executable. One of my programs mixes C++ and Fortran and I've had difficultly building it with condor_compile. Perhaps you're porting existing software with existing restart capabilities and you don't want to circumvent that mechanism. Thankfully, condor is capable of handling user checkpoints.
Before working with condor, your program must be able write checkpoint files and restart from a checkpoint file without additional command line arguments. Next, your program must have a signal handler that causes your program to write a checkpoint and quit when a particular signal is sent. Say your program is named bucky and it watches for SIGTSTP (20), the following sequence of shell commands should work and make sense. You can find a listing of standard linux signals here.
$ bucky & # Get the PID from the second collumn $ ps ux | grep bucky # Wait for the program to get started up and do stuff $ sleep 300 # Send signal 20 to tell your program to checkpoint and quit $ kill -s 20 PID # Wait for bucky to cleanup and quit $ sleep 60 # bucky wrote a checkpoint $ ls *.ckpt 1234.ckpt # Run bucky again, automatically restarting from 1234.ckpt $ bucky
Now you're ready to start running your program on condor. Compile your program as normal. You're going to be running vanilla universe jobs.
When a condor job is preempted it is politely asked to quit by sending a signal (default is SIGTERM). The job is given some amount of time (policy is 10 minutes on CHTC) to clear out before being un-politely killed via SIGKILL. User checkpointing works by catching the eviction signal, writing a checkpoint, and automatically restarting from that checkpoint when the program is re-executed.
Here's an example condor_submit file to enable user checkpointing.
| Line | |
|---|---|
| 1 | universe = vanilla |
| 2 | executable = bucky |
| 3 | kill_sig = 20 |
| 4 | should_transfer_files = yes |
| 5 | when_to_transfer_output = on_exit_or_evict |
| 6 | log = bucky.log |
| 7 | queue |
- This is a vanilla universe job (not standard universe!)
- Executable is named bucky
- Condor will send signal 20 (SIGTSTP on linux) to our program for eviction. Our program should expect it for checkpointing.
- You should transfer files. If you're on a shared filesystem if_needed might work, but I don't have experience here.
- Make sure that files get transfered on eviction. If they don't your checkpoint file won't follow your job.
- Log condor events in bucky.log
Caveats
Checkpoint logging is a standard universe feature
Confusingly, your log (bucky.log in the example) will always report that your "Job was not checkpointed.", even when it actually was. This is because condor knows nothing about your checkpointing. Condor only knows that it sent a signal and your program quit. "Job successfully checkpointed" is strictly a standard universe feature.
hold_kill_sig is not not applicable to vanilla jobs
You might have round hold_kill_sig in the condor documentation. hold_kill_sig only applies to scheduler and local universes. We're in the vanilla universe, so it is not helpful for us.
Use condor_vacate_job rather than condor_hold to test checkpointing
According to condor developer Peter Keller
"You have to know a very non-obvious fact: condor_hold in the vanilla universe causes a *fast* removal of a job from the execution machine and isn't treated like a normal eviction."
condor_vacate_job, however, will send eviction signals and do what you expect.
File synchronization may not happen until your job completes
Say your start a job and then use condor_vacate_job to induce writing a checkpoint. You might expect that checkpoint file to show up in your programs run directory. Depending on your condor pool configuration, this may not happen. In response to my question as to where my checkpoint files went, Peter Keller replied:
"Condor transferred these files to a spool directory under its control and will transfer them back to the job once it starts running again. You don't see these files until the job completes and all of them are brought back to the initialdir of the job."
![(please configure the [header_logo] section in trac.ini)](/cgi-bin/hackerwithin.fcgi/chrome/site/thwlogo-small.png)