Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages


home | help
srun_cr(1)			Slurm Commands			    srun_cr(1)

       srun_cr - run parallel jobs with	checkpoint/restart support

       srun_cr [OPTIONS...]

       The  design  of	srun_cr	 is  inspired  by mpiexec_cr from MVAPICH2 and
       cr_restart form BLCR.  It is a wrapper around the srun command  to  en-
       able batch job checkpoint/restart support when used with	Slurm's	check-
       point/blcr plugin.

       The srun_cr execute line	options	are identical to  those	 of  the  srun
       command.	 See "man srun"	for details.

       After initialization, srun_cr registers a thread	context	callback func-
       tion.  Then it forks a process and executes "cr_run --omit  srun"  with
       its arguments.  cr_run is employed to exclude the srun process from be-
       ing dumped upon checkpoint.  All	catchable signals except SIGCHLD  sent
       to  srun_cr  will be forwarded to the child srun	process.  SIGCHLD will
       be captured to mimic the	exit status  of	 srun  when  it	 exits.	  Then
       srun_cr	loops  waiting	for  termination  of tasks being launched from

       The step	launch logic of	Slurm is augmented to check if srun is running
       under  srun_cr.	If true, the environment variable SLURM_SRUN_CR_SOCKET
       should be present, the value of which is	the address of a  Unix	domain
       socket  created and listened to be srun_cr.  After launching the	tasks,
       srun tries to connect to	the socket and sends the job ID, step  ID  and
       the nodes allocated to the step to srun_cr.

       Upon checkpoint,	srun_cr	checks to see if the tasks have	been launched.
       If not srun_cr first forwards the checkpoint request to	the  tasks  by
       calling	the  Slurm  API	 slurm_checkpoint_tasks()  before  dumping its
       process context.

       Upon restart, srun_cr checks to see if the tasks	have  been  previously
       launched	  and	checkpointed.	 If  true,  the	 environment  variable
       SLURM_RESTART_DIR is set	to the directory of the	checkpoint image files
       of the tasks.  Then srun	is forked and executed again.  The environment
       variable	will be	used by	the srun command to restart execution  of  the
       tasks from the previous checkpoint.

       Copyright  (C)  2009  National University of Defense Technology,	China.
       Produced	at National University of Defense Technology, China (cf,  DIS-

       This  file  is  part  of	Slurm, a resource management program.  For de-
       tails, see <>.

       Slurm is	free software; you can redistribute it and/or modify it	 under
       the  terms  of  the GNU General Public License as published by the Free
       Software	Foundation; either version 2 of	the License, or	(at  your  op-
       tion) any later version.

       Slurm  is  distributed  in the hope that	it will	be useful, but WITHOUT
       ANY WARRANTY; without even the implied warranty of  MERCHANTABILITY  or
       FITNESS	FOR  A PARTICULAR PURPOSE.  See	the GNU	General	Public License
       for more	details.


April 2015			Slurm Commands			    srun_cr(1)


Want to link to this manual page? Use this URL:

home | help