Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages


home | help
MPIRUN(1)			   Open	MPI			     MPIRUN(1)

       orterun,	 mpirun,  mpiexec  -  Execute serial and parallel jobs in Open
       MPI.  oshrun, shmemrun -	Execute	 serial	 and  parallel	jobs  in  Open

       Note:  mpirun,  mpiexec,	and orterun are	all synonyms for each other as
       well as oshrun, shmemrun	in case	Open SHMEM is installed.  Using	any of
       the names will produce the same behavior.

       Single Process Multiple Data (SPMD) Model:

       mpirun [	options	] <program> [ <args> ]

       Multiple	Instruction Multiple Data (MIMD) Model:

       mpirun [	global_options ]
	      [	local_options1 ] <program1> [ <args1> ]	:
	      [	local_options2 ] <program2> [ <args2> ]	:
	      ... :
	      [	local_optionsN ] <programN> [ <argsN> ]

       Note  that in both models, invoking mpirun via an absolute path name is
       equivalent to specifying	the --prefix option with a _dir_ value equiva-
       lent  to	 the  directory	where mpirun resides, minus its	last subdirec-
       tory.  For example:

	   % /usr/local/bin/mpirun ...

       is equivalent to

	   % mpirun --prefix /usr/local

       If you are simply looking for how to run	an MPI application, you	proba-
       bly want	to use a command line of the following form:

	   % mpirun [ -np X ] [	--hostfile <filename> ]	 <program>

       This  will  run X copies	of _program_ in	your current run-time environ-
       ment (if	running	under a	supported resource manager, Open MPI's	mpirun
       will  usually  automatically  use  the  corresponding  resource manager
       process starter,	as opposed to, for example, rsh	or ssh,	which  require
       the  use	 of a hostfile,	or will	default	to running all X copies	on the
       localhost), scheduling (by default) in a	 round-robin  fashion  by  CPU
       slot.  See the rest of this page	for more details.

       Please  note  that mpirun automatically binds processes as of the start
       of the v1.8 series. Two binding patterns	are used in the	absence	of any
       further directives:

       Bind to core:	 when the number of processes is <= 2

       Bind to socket:	 when the number of processes is > 2

       If your application uses	threads, then you probably want	to ensure that
       you are either not bound	at all	(by  specifying	 --bind-to  none),  or
       bound  to multiple cores	using an appropriate binding level or specific
       number of processing elements per application process.

       mpirun will send	the name of the	directory where	it was invoked on  the
       local  node  to each of the remote nodes, and attempt to	change to that
       directory.  See the "Current Working Directory" section below for  fur-
       ther details.

       <program> The  program executable. This is identified as	the first non-
		 recognized argument to	mpirun.

       <args>	 Pass these run-time arguments to every	 new  process.	 These
		 must  always  be the last arguments to	mpirun.	If an app con-
		 text file is used, _args_ will	be ignored.

       -h, --help
		 Display help for this command

       -q, --quiet
		 Suppress informative messages from orterun during application

       -v, --verbose
		 Be verbose

       -V, --version
		 Print	version	number.	 If no other arguments are given, this
		 will also cause orterun to exit.

       -display-map, --display-map
		 Display a table showing the mapped location of	 each  process
		 prior to launch.

       -display-devel-map, --display-devel-map
		 Display  a more detailed table	showing	the mapped location of
		 each process prior to launch (usually of interest to develop-

       -display-allocation, --display-allocation
		 Display the detected resource allocation.

       Use  one	of the following options to specify which hosts	(nodes)	of the
       cluster to run on. Note that as of  the	start  of  the	v1.8  release,
       mpirun  will launch a daemon onto each host in the allocation (as modi-
       fied by the following options) at the very beginning of execution,  re-
       gardless	 of  whether  or  not application processes will eventually be
       mapped to execute there.	This is	done to	allow collection  of  hardware
       topology	 information  from  the	 remote	nodes, thus allowing us	to map
       processes against known topology. However, it is	a change from the  be-
       havior in prior releases	where daemons were only	launched after mapping
       was complete, and thus only occurred on nodes  where  application  pro-
       cesses would actually be	executing.

       -H, -host, --host <host1,host2,...,hostN>
	      List of hosts on which to	invoke processes.

       -hostfile, --hostfile <hostfile>
	      Provide a	hostfile to use.

       -machinefile, --machinefile <machinefile>
	      Synonym for -hostfile.

       -cpu-set, --cpu-set
	      Restrict	launched  processes  to	 the specified logical cpus on
	      each node. Note that the binding options will still apply	within
	      the  specified  envelope	-  e.g.,  you  can  elect to bind each
	      process to only one cpu within the specified cpu set.

       The following options specify the number	of processes to	 launch.  Note
       that  none of the options imply a particular binding policy - e.g., re-
       questing	N processes for	each socket does not imply that	the  processes
       will be bound to	the socket.

       -c, -n, --n, -np	<#>
	      Run  this	 many  copies of the program on	the given nodes.  This
	      option indicates that the	specified file is an  executable  pro-
	      gram and not an application context. If no value is provided for
	      the number of copies to execute (i.e., neither the "-np" nor its
	      synonyms	are provided on	the command line), Open	MPI will auto-
	      matically	execute	a copy of the program  on  each	 process  slot
	      (see  below  for description of a	"process slot"). This feature,
	      however, can only	be used	in the SPMD model and will  return  an
	      error  (without  beginning  execution of the application)	other-

       ^amap-by ppr:N:<object>
	      Launch N times the number	of objects of the  specified  type  on
	      each node.

       -npersocket, --npersocket <#persocket>
	      On  each	node,  launch  this many processes times the number of
	      processor	sockets	on the	node.	The  -npersocket  option  also
	      turns  on	 the  -bind-to-socket option.  (deprecated in favor of
	      --map-by ppr:n:socket)

       -npernode, --npernode <#pernode>
	      On each node, launch this	many processes.	 (deprecated in	 favor
	      of --map-by ppr:n:node)

       -pernode, --pernode
	      On  each	node, launch one process -- equivalent to -npernode 1.
	      (deprecated in favor of --map-by ppr:1:node)

       To map processes:

       --map-by	<foo>
	      Map to the specified object, defaults to socket.	Supported  op-
	      tions  include  slot, hwthread, core, L1cache, L2cache, L3cache,
	      socket, numa, board, node, sequential, distance,	and  ppr.  Any
	      object  can  include modifiers by	adding a : and any combination
	      of PE=n (bind n processing elements to each  proc),  SPAN	 (load
	      balance the processes across the allocation), OVERSUBSCRIBE (al-
	      low more processes on a  node  than  processing  elements),  and
	      NOOVERSUBSCRIBE.	 This includes PPR, where the pattern would be
	      terminated by another colon to separate it from the modifiers.

       -bycore,	--bycore
	      Map processes by core (deprecated	in favor of --map-by core)

       -bysocket, --bysocket
	      Map processes by socket (deprecated in favor of --map-by socket)

       -nolocal, --nolocal
	      Do not run any copies of the launched application	 on  the  same
	      node  as	orterun	is running.  This option will override listing
	      the localhost with --host	or any	other  host-specifying	mecha-

       -nooversubscribe, --nooversubscribe
	      Do not oversubscribe any nodes; error (without starting any pro-
	      cesses) if the requested number of processes would  cause	 over-
	      subscription.   This option implicitly sets "max_slots" equal to
	      the "slots" value	for each node.

       -bynode,	--bynode
	      Launch processes one per node, cycling by	node in	a  round-robin
	      fashion.	 This spreads processes	evenly among nodes and assigns
	      MPI_COMM_WORLD ranks in a	round-robin, "by node" manner.

       To order	processes' ranks in MPI_COMM_WORLD:

       --rank-by <foo>
	      Rank in round-robin fashion according to the  specified  object,
	      defaults	to  slot.  Supported  options  include slot, hwthread,
	      core, L1cache, L2cache, L3cache, socket, numa, board, and	node.

       For process binding:

       --bind-to <foo>
	      Bind processes to	the specified object, defaults to  core.  Sup-
	      ported  options  include slot, hwthread, core, l1cache, l2cache,
	      l3cache, socket, numa, board, and	none.

       -cpus-per-proc, --cpus-per-proc <#perproc>
	      Bind each	process	to the specified number	of cpus.   (deprecated
	      in favor of --map-by <obj>:PE=n)

       -cpus-per-rank, --cpus-per-rank <#perrank>
	      Alias  for  -cpus-per-proc.   (deprecated	 in  favor of --map-by

       -bind-to-core, --bind-to-core
	      Bind processes to	cores (deprecated in favor of --bind-to	core)

       -bind-to-socket,	--bind-to-socket
	      Bind processes to	processor sockets   (deprecated	 in  favor  of
	      --bind-to	socket)

       -bind-to-none, --bind-to-none
	      Do not bind processes  (deprecated in favor of --bind-to none)

       -report-bindings, --report-bindings
	      Report any bindings for launched processes.

       -slot-list, --slot-list <slots>
	      List  of processor IDs to	be used	for binding MPI	processes. The
	      specified	bindings will be applied to all	MPI processes. See ex-
	      planation	below for syntax.

       For rankfiles:

       -rf, --rankfile <rankfile>
	      Provide a	rankfile file.

       To manage standard I/O:

       -output-filename, --output-filename <filename>
	      Redirect	the  stdout, stderr, and stddiag of all	processes to a
	      process-unique version of	the specified filename.	 Any  directo-
	      ries in the filename will	automatically be created.  Each	output
	      file will	consist	of,	where the id will be the  pro-
	      cesses' rank in MPI_COMM_WORLD, left-filled with zero's for cor-
	      rect ordering in listings.

       -stdin, --stdin <rank>
	      The MPI_COMM_WORLD rank of the process that is to	receive	stdin.
	      The  default  is	to forward stdin to MPI_COMM_WORLD rank	0, but
	      this option can be used to forward stdin to any process.	It  is
	      also  acceptable	to  specify none, indicating that no processes
	      are to receive stdin.

       -tag-output, --tag-output
	      Tag each line of output to stdout, stderr, and stddiag with [jo-
	      bid,   MCW_rank]<stdxxx>	 indicating   the  process  jobid  and
	      MPI_COMM_WORLD rank of the process that  generated  the  output,
	      and the channel which generated it.

       -timestamp-output, --timestamp-output
	      Timestamp	each line of output to stdout, stderr, and stddiag.

       -xml, --xml
	      Provide all output to stdout, stderr, and	stddiag	in an xml for-

       -xterm, --xterm <ranks>
	      Display the  output  from	 the  processes	 identified  by	 their
	      MPI_COMM_WORLD  ranks  in	 separate xterm	windows. The ranks are
	      specified	as a comma-separated list of ranges, with a  -1	 indi-
	      cating all. A separate window will be created for	each specified
	      process.	Note: xterm will normally terminate  the  window  upon
	      termination of the process running within	it. However, by	adding
	      a	"!" to the end of the list of specified	ranks, the proper  op-
	      tions  will  be  provided	 to ensure that	xterm keeps the	window
	      open after the process terminates, thus allowing you to see  the
	      process' output.	Each xterm window will subsequently need to be
	      manually closed.	Note: In some environments, xterm may  require
	      that  the	 executable  be	in the user's path, or be specified in
	      absolute or relative terms. Thus,	it may be necessary to specify
	      a	 local	executable  as "./foo" instead of just "foo". If xterm
	      fails to find the	executable, mpirun will	hang,  but  still  re-
	      spond correctly to a ctrl-c.  If this happens, please check that
	      the executable is	being specified	correctly and try again.

       To manage files and runtime environment:

       -path, --path <path>
	      <path> that will be used when attempting to locate the requested
	      executables.   This  is  used prior to using the local PATH set-

       --prefix	<dir>
	      Prefix directory that will be used to set	the  PATH  and	LD_LI-
	      BRARY_PATH  on  the  remote node before invoking Open MPI	or the
	      target process.  See the "Remote Execution" section, below.

	      Copy the specified executable(s) to  remote  machines  prior  to
	      starting remote processes. The executables will be copied	to the
	      Open MPI session directory and will be deleted  upon  completion
	      of the job.

       --preload-files <files>
	      Preload the comma	separated list of files	to the current working
	      directory	 of  the  remote  machines  where  processes  will  be
	      launched prior to	starting those processes.

       --preload-files-dest-dir	<path>
	      The destination directory	to be used for preload-files, if other
	      than the current working directory. By default, the absolute and
	      relative paths provided by --preload-files are used.

       --tmpdir	<dir>
	      Set the root for the session directory tree for mpirun only.

       -wd <dir>
	      Synonym for -wdir.

       -wdir <dir>
	      Change  to  the  directory  <dir>	before the user's program exe-
	      cutes.  See the "Current Working Directory" section for notes on
	      relative	paths.	 Note: If the -wdir option appears both	on the
	      command line and in an application  context,  the	 context  will
	      take  precedence over the	command	line. Thus, if the path	to the
	      desired wdir is different	on the backend nodes, then it must  be
	      specified	 as  an	 absolute path that is correct for the backend

       -x <env>
	      Export the specified environment variables to the	 remote	 nodes
	      before executing the program.  Only one environment variable can
	      be specified per -x option.  Existing environment	variables  can
	      be  specified or new variable names specified with corresponding
	      values.  For example:
		  % mpirun -x DISPLAY -x OFILE=/tmp/out	...

	      The parser for the -x option is not very sophisticated; it  does
	      not  even	 understand  quoted  values.  Users are	advised	to set
	      variables	in the environment, and	then use -x to export (not de-
	      fine) them.

       Setting MCA parameters:

       -gmca, --gmca <key> <value>
	      Pass  global MCA parameters that are applicable to all contexts.
	      _key_ is the parameter name; _value_ is the parameter value.

       -mca, --mca <key> <value>
	      Send arguments to	various	MCA modules.  See the  "MCA"  section,

       For debugging:

       -debug, --debug
	      Invoke	the    user-level    debugger	 indicated    by   the
	      orte_base_user_debugger MCA parameter.

       -debugger, --debugger
	      Sequence of debuggers to search for when --debug is  used	 (i.e.
	      a	synonym	for orte_base_user_debugger MCA	parameter).

       -tv, --tv
	      Launch processes under the TotalView debugger.  Deprecated back-
	      wards compatibility flag.	Synonym	for --debug.

       There are also other options:

	      Allow mpirun to run when executed	by the root user  (mpirun  de-
	      faults to	aborting when launched as the root user).

       -aborted, --aborted <#>
	      Set the maximum number of	aborted	processes to display.

       --app <appfile>
	      Provide an appfile, ignoring all other command line options.

       -cf, --cartofile	<cartofile>
	      Provide a	cartography file.

	      Indicates	that multiple app_contexts are being provided that are
	      a	mix of 32/64-bit binaries.

       -leave-session-attached,	--leave-session-attached
	      Do not detach OmpiRTE daemons used by this application. This al-
	      lows  error  messages from the daemons as	well as	the underlying
	      environment (e.g., when failing to launch	a daemon) to  be  out-

       -ompi-server, --ompi-server <uri	or file>
	      Specify the URI of the Open MPI server (or the mpirun to be used
	      as the server) , the name	of the file (specified	as  file:file-
	      name)  that  contains that info, or the PID (specified as	pid:#)
	      of the mpirun to be used as
	       the server.  The	Open MPI server	is used	to  support  multi-ap-
	      plication	 data  exchange	 via  the  MPI-2  MPI_Publish_name and
	      MPI_Lookup_name functions.

       -report-pid, --report-pid <channel>
	      Print out	mpirun's PID during startup. The channel must  be  ei-
	      ther  a '-' to indi cate that the	pid is to be output to stdout,
	      a	'+' to indicate	that the pid is	to be outp ut to stderr, or  a
	      filename to which	the pid	is to be written.

       -report-uri, --report-uri <channel>
	      Print  out  mpirun's URI during startup. The channel must	be ei-
	      ther a '-' to indi cate that the URI is to be output to  stdout,
	      a	 '+' to	indicate that the URI is to be outp ut to stderr, or a
	      filename to which	the URI	is to be written.

       -wait-for-server, --wait-for-server
	      Pause mpirun before launching the	job until ompi-server  is  de-
	      tected.  This  is	 useful	 in  scripts  where ompi-server	may be
	      started in the background, followed  immediately	by  an	mpirun
	      command  that  wishes  to	connect	to it. Mpirun will pause until
	      either the specified ompi-server is  contacted  or  the  server-
	      wait-time	is exceeded.

       -server-wait-time, --server-wait-time <secs>
	      The  max	amount of time (in seconds) mpirun should wait for the
	      ompi-server to start. The	default	is 10 seconds.

       The following options are useful	for developers;	they are not generally
       useful to most ORTE and/or MPI users:

       -d, --debug-devel
	      Enable  debugging	 of  the  OmpiRTE  (the	run-time layer in Open
	      MPI).  This is not generally useful for most users.

	      Enable debugging of any OmpiRTE daemons used  by	this  applica-

	      Enable  debugging	 of  any OmpiRTE daemons used by this applica-
	      tion, storing output in files.

       -launch-agent, --launch-agent
	      Name of the executable that is to	be used	to start processes  on
	      the  remote  nodes.  The	default	is "orted". This option	can be
	      used to test new daemon concepts,	or to pass options back	to the
	      daemons  without	having	mpirun	itself	see them. For example,
	      specifying a launch agent	of orted -mca odls_base_verbose	5  al-
	      lows the developer to ask	the orted for debugging	output without
	      clutter from mpirun itself.

	      Disable the automatic --prefix behavior

       There may be other options listed with mpirun --help.

   Environment Variables
	      The maximum number of seconds that mpirun	 (mpiexec)  will  run.
	      After  this many seconds,	mpirun will abort the launched job and

       One invocation of mpirun	starts an MPI application running  under  Open
       MPI. If the application is single process multiple data (SPMD), the ap-
       plication can be	specified on the mpirun	command	line.

       If the application is multiple instruction multiple data	 (MIMD),  com-
       prising	of  multiple programs, the set of programs and argument	can be
       specified in one	of two ways: Extended Command Line Arguments, and  Ap-
       plication Context.

       An application context describes	the MIMD program set including all ar-
       guments in a separate file.  This file  essentially  contains  multiple
       mpirun  command	lines,	less  the command name itself.	The ability to
       specify different options for different instantiations of a program  is
       another reason to use an	application context.

       Extended	command	line arguments allow for the description of the	appli-
       cation layout on	the command line using	colons	(:)  to	 separate  the
       specification  of programs and arguments. Some options are globally set
       across all specified programs (e.g. --hostfile),	while others are  spe-
       cific to	a single program (e.g. -np).

   Specifying Host Nodes
       Host  nodes can be identified on	the mpirun command line	with the -host
       option or in a hostfile.

       For example,

       mpirun -H aa,aa,bb ./a.out
	   launches two	processes on node aa and one on	bb.

       Or, consider the	hostfile

	  % cat	myhostfile
	  aa slots=2
	  bb slots=2
	  cc slots=2

       Here, we	list both the host names (aa, bb, and cc) but  also  how  many
       "slots"	there are for each.  Slots indicate how	many processes can po-
       tentially execute on a node.  For best performance, the number of slots
       may  be	chosen	to be the number of cores on the node or the number of
       processor sockets.  If the hostfile does	not provide slots information,
       a default of 1 is assumed.  When	running	under resource managers	(e.g.,
       SLURM, Torque, etc.), Open MPI will obtain both the hostnames  and  the
       number of slots directly	from the resource manger.

       mpirun -hostfile	myhostfile ./a.out
	   will	launch two processes on	each of	the three nodes.

       mpirun -hostfile	myhostfile -host aa ./a.out
	   will	launch two processes, both on node aa.

       mpirun -hostfile	myhostfile -host dd ./a.out
	   will	find no	hosts to run on	and abort with an error.  That is, the
	   specified host dd is	not in the specified hostfile.

   Specifying Number of	Processes
       As we have just seen, the number	of processes to	run can	be  set	 using
       the hostfile.  Other mechanisms exist.

       The  number of processes	launched can be	specified as a multiple	of the
       number of nodes or processor sockets available.	For example,

       mpirun -H aa,bb -npersocket 2 ./a.out
	   launches processes 0-3 on node aa and process 4-7 on	node bb, where
	   aa  and bb are both dual-socket nodes.  The -npersocket option also
	   turns on the	-bind-to-socket	option,	which is discussed in a	 later

       mpirun -H aa,bb -npernode 2 ./a.out
	   launches processes 0-1 on node aa and processes 2-3 on node bb.

       mpirun -H aa,bb -npernode 1 ./a.out
	   launches one	process	per host node.

       mpirun -H aa,bb -pernode	./a.out
	   is the same as -npernode 1.

       Another	alternative is to specify the number of	processes with the -np
       option.	Consider now the hostfile

	  % cat	myhostfile
	  aa slots=4
	  bb slots=4
	  cc slots=4


       mpirun -hostfile	myhostfile -np 6 ./a.out
	   will	launch processes 0-3 on	node aa	and processes 4-5 on node  bb.
	   The	remaining slots	in the hostfile	will not be used since the -np
	   option indicated that only 6	processes should be launched.

   Mapping Processes to	Nodes: Using Policies
       The examples above illustrate the default mapping of process  processes
       to  nodes.  This	mapping	can also be controlled with various mpirun op-
       tions that describe mapping policies.

       Consider	the same hostfile as above, again with -np 6:

				 node aa      node bb	   node	cc

	 mpirun			 0 1 2 3      4	5

	 mpirun	--map-by node	 0 3	      1	4	   2 5

	 mpirun	-nolocal		      0	1 2 3	   4 5

       The --map-by node option	will load balance  the	processes  across  the
       available nodes,	numbering each process in a round-robin	fashion.

       The  -nolocal  option prevents any processes from being mapped onto the
       local host (in this case	node aa).  While mpirun	typically consumes few
       system resources, -nolocal can be helpful for launching very large jobs
       where mpirun may	actually need to  use  noticeable  amounts  of	memory
       and/or processing time.

       Just  as	 -np  can specify fewer	processes than there are slots,	it can
       also oversubscribe the slots.  For example, with	the same hostfile:

       mpirun -hostfile	myhostfile -np 14 ./a.out
	   will	launch processes 0-3 on	node aa, 4-7 on	bb, and	 8-11  on  cc.
	   It  will then add the remaining two processes to whichever nodes it

       One can also specify limits to oversubscription.	 For example, with the
       same hostfile:

       mpirun -hostfile	myhostfile -np 14 -nooversubscribe ./a.out
	   will	produce	an error since -nooversubscribe	prevents oversubscrip-

       Limits to oversubscription can also be specified	in  the	 hostfile  it-
	% cat myhostfile
	aa slots=4 max_slots=4
	bb	   max_slots=4
	cc slots=4

       The  max_slots  field  specifies	such a limit.  When it does, the slots
       value defaults to the limit.  Now:

       mpirun -hostfile	myhostfile -np 14 ./a.out
	   causes the first 12 processes to be launched	as before, but the re-
	   maining  two	 processes will	be forced onto node cc.	 The other two
	   nodes are protected by the  hostfile	 against  oversubscription  by
	   this	job.

       Using  the  --nooversubscribe option can	be helpful since Open MPI cur-
       rently does not get "max_slots" values from the resource	manager.

       Of course, -np can also be used with the	-H or -host option.  For exam-

       mpirun -H aa,bb -np 8 ./a.out
	   launches  8	processes.   Since only	two hosts are specified, after
	   the first two processes are mapped, one to aa and one  to  bb,  the
	   remaining processes oversubscribe the specified hosts.

       And here	is a MIMD example:

       mpirun -H aa -np	1 hostname : -H	bb,cc -np 2 uptime
	   will	 launch	 process 0 running hostname on node aa and processes 1
	   and 2 each running uptime on	nodes bb and cc, respectively.

   Mapping, Ranking, and Binding: Oh My!
       Open MPI	employs	a three-phase procedure	for  assigning	process	 loca-
       tions and ranks:

       mapping	 Assigns a default location to each process

       ranking	 Assigns an MPI_COMM_WORLD rank	value to each process

       binding	 Constrains each process to run	on specific processors

       The  mapping  step is used to assign a default location to each process
       based on	the mapper being employed. Mapping by slot, node, and  sequen-
       tially results in the assignment	of the processes to the	node level. In
       contrast, mapping by object, allows the mapper to assign	the process to
       an actual object	on each	node.

       Note:  the  location assigned to	the process is independent of where it
       will be bound - the assignment is used solely as	input to  the  binding

       The  mapping of process processes to nodes can be defined not just with
       general policies	but also, if necessary,	using arbitrary	mappings  that
       cannot  be  described  by a simple policy.  One can use the "sequential
       mapper,"	which reads the	hostfile line by line, assigning processes  to
       nodes in	whatever order the hostfile specifies.	Use the	-mca rmaps seq
       option.	For example, using the same hostfile as	before:

       mpirun -hostfile	myhostfile -mca	rmaps seq ./a.out

       will launch three processes, one	on each	of nodes aa, bb, and  cc,  re-
       spectively.  The	slot counts don't matter;  one process is launched per
       line on whatever	node is	listed on the line.

       Another way to specify arbitrary	mappings is  with  a  rankfile,	 which
       gives you detailed control over process binding as well.	 Rankfiles are
       discussed below.

       The second phase	focuses	on the ranking of the process within the job's
       MPI_COMM_WORLD.	 Open MPI separates this from the mapping procedure to
       allow more flexibility in the relative placement	of MPI processes. This
       is  best	 illustrated  by  considering the following two	cases where we
       used the	amap-by	ppr:2:socket option:

				 node aa       node bb

	   rank-by core		0 1 ! 2	3     4	5 ! 6 7

	  rank-by socket	0 2 ! 1	3     4	6 ! 5 7

	  rank-by socket:span	0 4 ! 1	5     2	6 ! 3 7

       Ranking by core and by slot provide the identical  result  -  a	simple
       progression of MPI_COMM_WORLD ranks across each node. Ranking by	socket
       does a round-robin ranking within each node until  all  processes  have
       been assigned an	MCW rank, and then progresses to the next node.	Adding
       the span	modifier to the	ranking	directive causes the ranking algorithm
       to treat	the entire allocation as a single entity - thus, the MCW ranks
       are assigned across all sockets before circling back around to the  be-

       The binding phase actually binds	each process to	a given	set of proces-
       sors. This can improve performance if the operating system  is  placing
       processes  suboptimally.	  For  example,	 it  might  oversubscribe some
       multi-core processor sockets, leaving other  sockets  idle;   this  can
       lead  processes	to contend unnecessarily for common resources.	Or, it
       might spread processes out too widely;  this can	be suboptimal  if  ap-
       plication performance is	sensitive to interprocess communication	costs.
       Binding can also	keep the operating system from migrating processes ex-
       cessively,  regardless  of how optimally	those processes	were placed to
       begin with.

       The processors to be used for binding can be  identified	 in  terms  of
       topological  groupings  -  e.g.,	 binding  to an	l3cache	will bind each
       process to all processors within	the scope of a single L3 cache	within
       their  assigned	location. Thus,	if a process is	assigned by the	mapper
       to a certain socket, then a _abind-to l3cache directive will  cause  the
       process	to  be	bound  to  the processors that share a single L3 cache
       within that socket.

       To help balance loads, the binding directive uses a round-robin	method
       when binding to levels lower than used in the mapper. For example, con-
       sider the case where a job is mapped to	the  socket  level,  and  then
       bound  to  core.	 Each  socket will have	multiple cores,	so if multiple
       processes are mapped to a given socket, the binding algorithm will  as-
       sign each process located to a socket to	a unique core in a round-robin

       Alternatively, processes	mapped by l2cache and  then  bound  to	socket
       will simply be bound to all the processors in the socket	where they are
       located.	In this	manner,	users can exert	detailed control over relative
       MCW rank	location and binding.

       Finally,	--report-bindings can be used to report	bindings.

       As  an  example,	 consider a node with two processor sockets, each com-
       prising four cores.  We run mpirun with -np 4 --report-bindings and the
       following additional options:

	% mpirun ... --map-by core --bind-to core
	[...] ... binding child	[...,0]	to cpus	0001
	[...] ... binding child	[...,1]	to cpus	0002
	[...] ... binding child	[...,2]	to cpus	0004
	[...] ... binding child	[...,3]	to cpus	0008

	% mpirun ... --map-by socket --bind-to socket
	[...] ... binding child	[...,0]	to socket 0 cpus 000f
	[...] ... binding child	[...,1]	to socket 1 cpus 00f0
	[...] ... binding child	[...,2]	to socket 0 cpus 000f
	[...] ... binding child	[...,3]	to socket 1 cpus 00f0

	% mpirun ... --map-by core:PE=2	--bind-to core
	[...] ... binding child	[...,0]	to cpus	0003
	[...] ... binding child	[...,1]	to cpus	000c
	[...] ... binding child	[...,2]	to cpus	0030
	[...] ... binding child	[...,3]	to cpus	00c0

	% mpirun ... --bind-to none

       Here,  --report-bindings	 shows	the binding of each process as a mask.
       In the first case, the processes	bind to	successive cores as  indicated
       by the masks 0001, 0002,	0004, and 0008.	 In the	second case, processes
       bind to all cores on successive sockets as indicated by the masks  000f
       and  00f0.   The	 processes  cycle  through  the	processor sockets in a
       round-robin fashion as many times as are	needed.	 In  the  third	 case,
       the  masks  show	 us  that 2 cores have been bound per process.	In the
       fourth case, binding is turned off and no bindings are reported.

       Open MPI's support for process binding depends on the underlying	 oper-
       ating  system.	Therefore,  certain process binding options may	not be
       available on every system.

       Process binding can also	be set with MCA	parameters.   Their  usage  is
       less  convenient	 than  that of mpirun options.	On the other hand, MCA
       parameters can be set not only on the mpirun command line, but alterna-
       tively in a system or user mca-params.conf file or as environment vari-
       ables, as described in the MCA section below.  Some examples include:

	   mpirun option	  MCA parameter	key	    value

	 --map-by core		rmaps_base_mapping_policy   core
	 --map-by socket	rmaps_base_mapping_policy   socket
	 --rank-by core		rmaps_base_ranking_policy   core
	 --bind-to core		hwloc_base_binding_policy   core
	 --bind-to socket	hwloc_base_binding_policy   socket
	 --bind-to none		hwloc_base_binding_policy   none

       Rankfiles are text files	that specify detailed  information  about  how
       individual  processes  should  be mapped	to nodes, and to which proces-
       sor(s) they should be bound.  Each line of a rankfile specifies the lo-
       cation  of one process (for MPI jobs, the process' "rank" refers	to its
       rank in MPI_COMM_WORLD).	 The general form of each line in the rankfile

	   rank	<N>=<hostname> slot=<slot list>

       For example:

	   $ cat myrankfile
	   rank	0=aa slot=1:0-2
	   rank	1=bb slot=0:0,1
	   rank	2=cc slot=1-2
	   $ mpirun -H aa,bb,cc,dd -rf myrankfile ./a.out

       Means that

	 Rank 0	runs on	node aa, bound to logical socket 1, cores 0-2.
	 Rank 1	runs on	node bb, bound to logical socket 0, cores 0 and	1.
	 Rank 2	runs on	node cc, bound to logical cores	1 and 2.

       Rankfiles can alternatively be used to specify physical processor loca-
       tions. In this case, the	syntax is somewhat different. Sockets  are  no
       longer  recognized, and the slot	number given must be the number	of the
       physical	PU as most OS's	do not assign a	unique physical	identifier  to
       each core in the	node. Thus, a proper physical rankfile looks something
       like the	following:

	   $ cat myphysicalrankfile
	   rank	0=aa slot=1
	   rank	1=bb slot=8
	   rank	2=cc slot=6

       This means that

	 Rank 0	will run on node aa, bound to the core that contains  physical
       PU 1
	 Rank  1 will run on node bb, bound to the core	that contains physical
       PU 8
	 Rank 2	will run on node cc, bound to the core that contains  physical
       PU 6

       Rankfiles  are  treated	as  logical  by	default, and the MCA parameter
       rmaps_rank_file_physical	must be	set to 1 to indicate that the rankfile
       is to be	considered as physical.

       The hostnames listed above are "absolute," meaning that actual resolve-
       able hostnames are specified.  However, hostnames can also be specified
       as "relative," meaning that they	are specified in relation to an	exter-
       nally-specified list of hostnames (e.g.,	by mpirun's --host argument, a
       hostfile, or a job scheduler).

       The  "relative" specification is	of the form "+n<X>", where X is	an in-
       teger specifying	the Xth	hostname in the	set  of	 all  available	 host-
       names, indexed from 0.  For example:

	   $ cat myrankfile
	   rank	0=+n0 slot=1:0-2
	   rank	1=+n1 slot=0:0,1
	   rank	2=+n2 slot=1-2
	   $ mpirun -H aa,bb,cc,dd -rf myrankfile ./a.out

       Starting	 with  Open  MPI  v1.7,	 all socket/core slot locations	are be
       specified as logical indexes (the Open MPI v1.6	series	used  physical
       indexes).  You can use tools such as HWLOC's "lstopo" to	find the logi-
       cal indexes of socket and cores.

   Application Context or Executable Program?
       To distinguish the two different	forms, mpirun  looks  on  the  command
       line  for --app option.	If it is specified, then the file named	on the
       command line is assumed to be an	application context.   If  it  is  not
       specified, then the file	is assumed to be an executable program.

   Locating Files
       If  no relative or absolute path	is specified for a file, Open MPI will
       first look for files by searching  the  directories  specified  by  the
       --path  option.	If there is no --path option set or if the file	is not
       found at	the --path location, then Open MPI will	search the user's PATH
       environment variable as defined on the source node(s).

       If  a  relative directory is specified, it must be relative to the ini-
       tial working directory determined by the	specific starter used. For ex-
       ample  when  using  the	rsh  or	ssh starters, the initial directory is
       $HOME by	default. Other starters	may set	the initial directory  to  the
       current working directory from the invocation of	mpirun.

   Current Working Directory
       The  -wdir  mpirun  option  (and	 its  synonym, -wd) allows the user to
       change to an arbitrary directory	before the program is invoked.	It can
       also  be	 used in application context files to specify working directo-
       ries on specific	nodes and/or for specific applications.

       If the -wdir option appears both	in a context file and on  the  command
       line, the context file directory	will override the command line value.

       If  the	-wdir  option is specified, Open MPI will attempt to change to
       the specified directory on all of the  remote  nodes.  If  this	fails,
       mpirun will abort.

       If  the -wdir option is not specified, Open MPI will send the directory
       name where mpirun was invoked to	each of	the remote nodes.  The	remote
       nodes  will  try	to change to that directory. If	they are unable	(e.g.,
       if the directory	does not exist on that node), then Open	MPI  will  use
       the default directory determined	by the starter.

       All  directory changing occurs before the user's	program	is invoked; it
       does not	wait until MPI_INIT is called.

   Standard I/O
       Open MPI	directs	UNIX standard input to /dev/null on all	processes  ex-
       cept  the  MPI_COMM_WORLD  rank	0  process.  The MPI_COMM_WORLD	rank 0
       process inherits	standard input from mpirun.  Note: The node  that  in-
       voked  mpirun need not be the same as the node where the	MPI_COMM_WORLD
       rank 0 process resides. Open MPI	handles	the  redirection  of  mpirun's
       standard	input to the rank 0 process.

       Open  MPI  directs  UNIX	standard output	and error from remote nodes to
       the node	that invoked mpirun and	prints it on the standard output/error
       of mpirun.  Local processes inherit the standard	output/error of	mpirun
       and transfer to it directly.

       Thus it is possible to redirect standard	I/O for	Open MPI  applications
       by using	the typical shell redirection procedure	on mpirun.

	     % mpirun -np 2 my_app < my_input >	my_output

       Note  that  in this example only	the MPI_COMM_WORLD rank	0 process will
       receive the stream from my_input	on stdin.  The stdin on	all the	 other
       nodes  will  be	tied to	/dev/null.  However, the stdout	from all nodes
       will be collected into the my_output file.

   Signal Propagation
       When orterun receives a SIGTERM and SIGINT, it will attempt to kill the
       entire  job  by	sending	 all processes in the job a SIGTERM, waiting a
       small number of seconds,	then  sending  all  processes  in  the	job  a

       SIGUSR1	and  SIGUSR2 signals received by orterun are propagated	to all
       processes in the	job.

       One can turn on forwarding of SIGSTOP and SIGCONT to the	 program  exe-
       cuted  by  mpirun by setting the	MCA parameter orte_forward_job_control
       to 1.  A	SIGTSTOP signal	to mpirun will then cause a SIGSTOP signal  to
       be sent to all of the programs started by mpirun	and likewise a SIGCONT
       signal to mpirun	will cause a SIGCONT sent.

       Other signals are not currently propagated by orterun.

   Process Termination / Signal	Handling
       During the run of an MPI	application, if	any  process  dies  abnormally
       (either exiting before invoking MPI_FINALIZE, or	dying as the result of
       a signal), mpirun will print out	an error message and kill the rest  of
       the MPI application.

       User  signal handlers should probably avoid trying to cleanup MPI state
       (Open MPI is currently not  async-signal-safe;  see  MPI_Init_thread(3)
       for details about MPI_THREAD_MULTIPLE and thread	safety).  For example,
       if a segmentation fault occurs in MPI_SEND (perhaps because a bad  buf-
       fer  was	 passed	in) and	a user signal handler is invoked, if this user
       handler attempts	to invoke MPI_FINALIZE,	Bad Things could happen	 since
       Open  MPI  was  already "in" MPI	when the error occurred.  Since	mpirun
       will notice that	the process died due to	a signal, it is	 probably  not
       necessary (and safest) for the user to only clean up non-MPI state.

   Process Environment
       Processes  in  the  MPI	application inherit their environment from the
       Open RTE	daemon upon the	node on	which they are running.	 The  environ-
       ment  is	 typically  inherited from the user's shell.  On remote	nodes,
       the exact environment is	determined by the boot MCA module  used.   The
       rsh  launch module, for example,	uses either rsh/ssh to launch the Open
       RTE daemon on remote nodes, and typically executes one or more  of  the
       user's  shell-setup  files  before launching the	Open RTE daemon.  When
       running	dynamically  linked  applications  which  require  the	LD_LI-
       BRARY_PATH environment variable to be set, care must be taken to	ensure
       that it is correctly set	when booting Open MPI.

       See the "Remote Execution" section for more details.

   Remote Execution
       Open MPI	requires that the PATH environment variable be set to find ex-
       ecutables  on remote nodes (this	is typically only necessary in rsh- or
       ssh-based environments -- batch/scheduled environments  typically  copy
       the current environment to the execution	of remote jobs,	so if the cur-
       rent environment	has PATH and/or	LD_LIBRARY_PATH	set properly, the  re-
       mote  nodes  will also have it set properly).  If Open MPI was compiled
       with shared library support, it may  also  be  necessary	 to  have  the
       LD_LIBRARY_PATH environment variable set	on remote nodes	as well	(espe-
       cially to find the shared libraries required to run user	 MPI  applica-

       However,	 it  is	not always desirable or	possible to edit shell startup
       files to	set PATH and/or	LD_LIBRARY_PATH.  The --prefix option is  pro-
       vided for some simple configurations where this is not possible.

       The  --prefix option takes a single argument: the base directory	on the
       remote node where Open MPI is installed.	 Open MPI will use this	direc-
       tory  to	 set  the remote PATH and LD_LIBRARY_PATH before executing any
       Open MPI	or user	applications.  This allows running Open	MPI jobs with-
       out  having  pre-configured  the	PATH and LD_LIBRARY_PATH on the	remote

       Open MPI	adds the basename of the current node's	"bindir"  (the	direc-
       tory where Open MPI's executables are installed)	to the prefix and uses
       that to set the PATH on the remote node.	 Similarly, Open MPI adds  the
       basename	of the current node's "libdir" (the directory where Open MPI's
       libraries are installed)	to the prefix and uses that to set the	LD_LI-
       BRARY_PATH on the remote	node.  For example:

       Local bindir:  /local/node/directory/bin

       Local libdir:  /local/node/directory/lib64

       If the following	command	line is	used:

	   % mpirun --prefix /remote/node/directory

       Open  MPI  will	add "/remote/node/directory/bin" to the	PATH and "/re-
       mote/node/directory/lib64" to the D_LIBRARY_PATH	on the remote node be-
       fore attempting to execute anything.

       The  --prefix option is not sufficient if the installation paths	on the
       remote node are different than the local	node (e.g., if "/lib" is  used
       on  the local node, but "/lib64"	is used	on the remote node), or	if the
       installation paths are something	other than a subdirectory under	a com-
       mon prefix.

       Note  that  executing  mpirun via an absolute pathname is equivalent to
       specifying --prefix without the last subdirectory in the	absolute path-
       name to mpirun.	For example:

	   % /usr/local/bin/mpirun ...

       is equivalent to

	   % mpirun --prefix /usr/local

   Exported Environment	Variables
       All  environment	variables that are named in the	form OMPI_* will auto-
       matically be exported to	new processes on the local and	remote	nodes.
       Environmental parameters	can also be set/forwarded to the new processes
       using the MCA parameter mca_base_env_list. The -x option	to mpirun  has
       been deprecated,	but the	syntax of the MCA param	follows	that prior ex-
       ample. While the	syntax of the -x option	and MCA	param allows the defi-
       nition  of  new	variables,  note that the parser for these options are
       currently not very sophisticated	- it does not even  understand	quoted
       values.	 Users are advised to set variables in the environment and use
       the option to export them; not to define	them.

   Setting MCA Parameters
       The -mca	switch allows the passing of parameters	to various MCA	(Modu-
       lar Component Architecture) modules.  MCA modules have direct impact on
       MPI programs because they allow tunable parameters to  be  set  at  run
       time (such as which BTL communication device driver to use, what	param-
       eters to	pass to	that BTL, etc.).

       The -mca	switch takes two arguments: _key_ and _value_.	The _key_  ar-
       gument  generally  specifies  which  MCA	module will receive the	value.
       For example, the	_key_ "btl" is used to select which BTL	to be used for
       transporting  MPI  messages.  The _value_ argument is the value that is
       passed.	For example:

       mpirun -mca btl tcp,self	-np 1 foo
	   Tells Open MPI to use the "tcp" and "self" BTLs, and	to run a  sin-
	   gle copy of "foo" an	allocated node.

       mpirun -mca btl self -np	1 foo
	   Tells  Open	MPI to use the "self" BTL, and to run a	single copy of
	   "foo" an allocated node.

       The -mca	switch can be used multiple times to specify  different	 _key_
       and/or  _value_	arguments.   If	 the same _key_	is specified more than
       once, the _value_s are concatenated with	a comma	(",") separating them.

       Note that the -mca switch is simply a shortcut for setting  environment
       variables.   The	same effect may	be accomplished	by setting correspond-
       ing environment variables before	running	mpirun.	 The form of the envi-
       ronment variables that Open MPI sets is:


       Thus,  the  -mca	 switch	overrides any previously set environment vari-
       ables.  The -mca	settings similarly override MCA	parameters set in  the
       $OPAL_PREFIX/etc/openmpi-mca-params.conf	    or	   $HOME/.openmpi/mca-
       params.conf file.

       Unknown _key_ arguments are still set as	environment variable  --  they
       are  not	 checked  (by  mpirun)	for correctness.  Illegal or incorrect
       _value_ arguments may or	may not	be reported -- it depends on the  spe-
       cific MCA module.

       To find the available component types under the MCA architecture, or to
       find the	 available  parameters	for  a	specific  component,  use  the
       ompi_info command.  See the ompi_info(1)	man page for detailed informa-
       tion on the command.

   Running as root
       The Open	MPI team strongly advises against executing mpirun as the root
       user.  MPI applications should be run as	regular	(non-root) users.

       Reflecting  this	 advice, mpirun	will refuse to run as root by default.
       To override this	default, you can add the --allow-run-as-root option to
       the mpirun command line.

   Exit	status
       There  is  no  standard	definition for what mpirun should return as an
       exit status. After considerable discussion, we settled on the following
       method for assigning the	mpirun exit status (note: in the following de-
       scription, the "primary"	job is	the  initial  application  started  by
       mpirun  -  all  jobs  that are spawned by that job are designated "sec-
       ondary" jobs):

       o if all	processes in the primary job normally terminate	with exit sta-
	 tus 0,	we return 0

       o if  one  or more processes in the primary job normally	terminate with
	 non-zero exit status, we return the exit status of the	 process  with
	 the lowest MPI_COMM_WORLD rank	to have	a non-zero status

       o if all	processes in the primary job normally terminate	with exit sta-
	 tus 0,	and one	or more	processes in a secondary job  normally	termi-
	 nate  with non-zero exit status, we (a) return	the exit status	of the
	 process with the lowest MPI_COMM_WORLD	rank in	the  lowest  jobid  to
	 have a	non-zero status, and (b) output	a message summarizing the exit
	 status	of the primary and all secondary jobs.

       o if the	cmd line option	--report-child-jobs-separately is set, we will
	 return	 -only-	 the exit status of the	primary	job. Any non-zero exit
	 status	in secondary jobs will be reported solely in a	summary	 print

       By  default, OMPI records and notes that	MPI processes exited with non-
       zero termination	status.	 This is generally not considered an "abnormal
       termination" - i.e., OMPI will not abort	an MPI job if one or more pro-
       cesses return a non-zero	status.	Instead, the default  behavior	simply
       reports	the  number of processes terminating with non-zero status upon
       completion of the job.

       However,	in some	cases it can be	desirable to have the job  abort  when
       any process terminates with non-zero status. For	example, a non-MPI job
       might detect a bad result from a	calculation and	 want  to  abort,  but
       doesn't want to generate	a core file. Or	an MPI job might continue past
       a call to MPI_Finalize, but indicate that all  processes	 should	 abort
       due to some post-MPI result.

       It  is  not anticipated that this situation will	occur frequently. How-
       ever, in	the interest of	serving	the broader community, OMPI now	has  a
       means  for  allowing  users  to	direct	that  jobs be aborted upon any
       process	exiting	 with  non-zero	 status.  Setting  the	MCA  parameter
       "orte_abort_on_non_zero_status"	to 1 will cause	OMPI to	abort all pro-
       cesses once any process
	exits with non-zero status.

       Terminations caused in this manner will be reported on the  console  as
       an "abnormal termination", with the first process to so exit identified
       along with its exit status.

       Be sure also to see the examples	throughout the sections	above.

       mpirun -np 4 -mca btl ib,tcp,self prog1
	   Run 4 copies	of prog1 using the "ib", "tcp",	and "self"  BTL's  for
	   the transport of MPI	messages.

       mpirun -np 4 -mca btl tcp,sm,self
	   --mca btl_tcp_if_include eth0 prog1
	   Run 4 copies	of prog1 using the "tcp", "sm" and "self" BTLs for the
	   transport of	MPI messages, with TCP using only the  eth0  interface
	   to  communicate.   Note that	other BTLs have	similar	if_include MCA

       mpirun returns 0	if all processes started by mpirun exit	after  calling
       MPI_FINALIZE.   A  non-zero  value is returned if an internal error oc-
       curred in mpirun, or  one  or  more  processes  exited  before  calling
       MPI_FINALIZE.  If an internal error occurred in mpirun, the correspond-
       ing error code is returned.  In the event that one  or  more  processes
       exit   before   calling	 MPI_FINALIZE,	 the   return	value  of  the
       MPI_COMM_WORLD rank of the process that mpirun first notices  died  be-
       fore  calling  MPI_FINALIZE  will  be returned.	Note that, in general,
       this will be the	first process that died	but is not  guaranteed	to  be


1.10.6				 Feb 17, 2017			     MPIRUN(1)


Want to link to this manual page? Use this URL:

home | help