Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages


home | help
OPAL_CRS(7)			   Open	MPI			   OPAL_CRS(7)

       OPAL_CRS	 -  Open PAL MCA Checkpoint/Restart Service (CRS): Overview of
       Open PAL's CRS framework, and selected modules.	Open MPI 4.0.5.

       Open PAL	can involuntarily checkpoint and restart sequential  programs.
       Doing  so  requires  that Open PAL was compiled with thread support and
       that the	back-end checkpointing systems are available at	run-time.

   Phases of Checkpoint	/ Restart
       Open PAL	defines	three phases for checkpoint /  restart	support	 in  a

	   When	 the  checkpoint  request arrives, the procress	is notified of
	   the request before the checkpoint is	taken.

	   After a checkpoint has successfully completed, the same process  as
	   the checkpoint is notified of its successful	continuation of	execu-

	   After a checkpoint has successfully completed, a  new  /  restarted
	   process is notified of its successful restart.

       The Continue and	Restart	phases are identical except for	the process in
       which they are invoked. The Continue  phase  is	invoked	 in  the  same
       process	as the Checkpoint phase	was invoked. The Restart phase is only
       invoked in newly	restarted processes.

       In order	for a process to use the Open PAL CRS components it  must  ad-
       hear to a few programmatic requirements.

       First,  the  program  must  call	OPAL_INIT early	in its execution. This
       should only be called once, and it is not possible  to  checkpoint  the
       process without it first	having called this function.

       The  program  must  call	 OPAL_FINALIZE before termination. This	does a
       significant amount of cleanup. If it is not called,  then  it  is  very
       likely that remnants are	left in	the filesystem.

       To  checkpoint and restart a process you	must use the Open PAL tools to
       do so. Using the	backend	checkpointer's checkpoint  and	restart	 tools
       will   lead  to	undefined  behavior.   To  checkpoint  a  process  use
       opal_checkpoint	(opal_checkpoint(1)).	To  restart  a	 process   use
       opal_restart (opal_restart(1)).

       Open PAL	ships with two CRS components: self and	blcr.

       The following MCA parameters apply to all components:

	   Set the verbosity level for all components. Default is 0, or	silent
	   except on error.

   self	CRS Component
       The self	component invokes user-defined functions to save  and  restore
       checkpoints.  It	is simply a mechanism for user-defined functions to be
       invoked at Open PAL's Checkpoint, Continue, and Restart phases.	Hence,
       the only	data that is saved during the checkpoint is what is written in
       the user's checkpoint function. No libary state is saved	at all.

       As such,	the model for the self component is slightly differnt than for
       other  components. Specifically,	the Restart function is	not invoked in
       the same	process	image  of  the	process	 that  was  checkpointed.  The
       Restart	phase  is  invoked during OPAL_INIT of the new instance	of the
       applicaiton (i.e., it starts over from main()).

       The self	component has the following MCA	parameters:

	   Speficy a string prefix for the name	of the	checkpoint,  continue,
	   and	restart	functions that Open PAL	will invoke during the respec-
	   tive	stages.	That is,  by  specifying  "-mca	 crs_self_prefix  foo"
	   means that Open PAL expects to find three functions at run-time:

	      int foo_checkpoint()

	      int foo_continue()

	      int foo_restart()

	   By default, the prefix is set to "opal_crs_self_user".

	   Set the self	components default priority

	   Set the verbosity level. Default is 0, or silent except on error.

	   This	is mostly internally used. A general user should never need to
	   set this value. This	is set to non-0	when a the new process	should
	   invoke  the	restart	callback in OPAL_INIT. Default is 0, or	normal

   blcr	CRS Component
       The Berkeley Lab	Checkpoint/Restart (BLCR) single-process checkpoint is
       a  software  system developed at	Lawrence Berkeley National Laboratory.
       See the project website for more	details:

       The blcr	component has the following MCA	parameters:

	   Set the blcr	components default priority.

	   Set the verbosity level. Default is 0, or silent except on error.

   none	CRS Component
       The none	component simply selects no CRS	 component.  All  of  the  CRS
       function	calls return immediately with OPAL_SUCCESS.

       This  component	is  the	last component to be selected by default. This
       means that if another component is available, and  the  none  component
       was  not	 explicity requested then OPAL will attempt to activate	all of
       the available components	before falling back to this component.

	 opal_checkpoint(1), opal_restart(1)

4.0.5				 Aug 26, 2020			   OPAL_CRS(7)


Want to link to this manual page? Use this URL:

home | help