Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages


home | help
fi_collective(3)		   @VERSION@		      fi_collective(3)

	      Operation	where a	subset of peers	join a new collective group.

	      Collective operation that	does not complete until	all peers have
	      entered the barrier call.

	      A	single sender transmits	data to	all peers, including itself.

	      Each peer	distributes a slice of its local data to all peers.

	      Collective operation where all peers broadcast an	atomic	opera-
	      tion to all other	peers.

	      Each peer	sends a	complete copy of its local data	to all peers.

	      Collective  call	where  data  is	 collected  from all peers and
	      merged (reduced).	 The results of	the reduction  is  distributed
	      back  to	the peers, with	each peer receiving a slice of the re-

	      Collective call where data is collected from all peers to	a root
	      peer and merged (reduced).

	      A	single sender distributes (scatters) a slice of	its local data
	      to all peers.

	      All peers	send their data	to a root peer.

	      Returns information about	which collective operations  are  sup-
	      ported by	a provider, and	limitations on the collective.

	      #include <rdma/fi_collective.h>

	      int fi_join_collective(struct fid_ep *ep,	fi_addr_t coll_addr,
		  const	struct fid_av_set *set,
		  uint64_t flags, struct fid_mc	**mc, void *context);

	      ssize_t fi_barrier(struct	fid_ep *ep, fi_addr_t coll_addr,
		  void *context);

	      ssize_t fi_broadcast(struct fid_ep *ep, void *buf, size_t	count, void *desc,
		  fi_addr_t coll_addr, fi_addr_t root_addr, enum fi_datatype datatype,
		  uint64_t flags, void *context);

	      ssize_t fi_alltoall(struct fid_ep	*ep, const void	*buf, size_t count,
		  void *desc, void *result, void *result_desc,
		  fi_addr_t coll_addr, enum fi_datatype	datatype,
		  uint64_t flags, void *context);

	      ssize_t fi_allreduce(struct fid_ep *ep, const void *buf, size_t count,
		  void *desc, void *result, void *result_desc,
		  fi_addr_t coll_addr, enum fi_datatype	datatype, enum fi_op op,
		  uint64_t flags, void *context);

	      ssize_t fi_allgather(struct fid_ep *ep, const void *buf, size_t count,
		  void *desc, void *result, void *result_desc,
		  fi_addr_t coll_addr, enum fi_datatype	datatype,
		  uint64_t flags, void *context);

	      ssize_t fi_reduce_scatter(struct fid_ep *ep, const void *buf, size_t count,
		  void *desc, void *result, void *result_desc,
		  fi_addr_t coll_addr, enum fi_datatype	datatype, enum fi_op op,
		  uint64_t flags, void *context);

	      ssize_t fi_reduce(struct fid_ep *ep, const void *buf, size_t count,
		  void *desc, void *result, void *result_desc, fi_addr_t coll_addr,
		  fi_addr_t root_addr, enum fi_datatype	datatype, enum fi_op op,
		  uint64_t flags, void *context);

	      ssize_t fi_scatter(struct	fid_ep *ep, const void *buf, size_t count,
		  void *desc, void *result, void *result_desc, fi_addr_t coll_addr,
		  fi_addr_t root_addr, enum fi_datatype	datatype,
		  uint64_t flags, void *context);

	      ssize_t fi_gather(struct fid_ep *ep, const void *buf, size_t count,
		  void *desc, void *result, void *result_desc, fi_addr_t coll_addr,
		  fi_addr_t root_addr, enum fi_datatype	datatype,
		  uint64_t flags, void *context);

	      int fi_query_collective(struct fid_domain	*domain,
		  fi_collective_op coll, struct	fi_collective_attr *attr, uint64_t flags);

       ep     Fabric endpoint on which to initiate collective operation.

       set    Address vector set defining the collective membership.

       mc     Multicast	group associated with the collective.

       buf    Local data buffer	that specifies first operand of	collective op-

	      Datatype associated with atomic operands

       op     Atomic operation to perform

       result Local data buffer	to store the result of the  collective	opera-

       desc / result_desc
	      Data  descriptor associated with the local data buffer and local
	      result buffer, respectively.

	      Address referring	to the collective group	of endpoints.

	      Single endpoint that is the source or destination	of  collective

       flags  Additional flags to apply	for the	atomic operation

	      User  specified  pointer	to associate with the operation.  This
	      parameter	is ignored if the operation will not generate  a  suc-
	      cessful  completion, unless an op	flag specifies the context pa-
	      rameter be used for required input.

       The collective APIs are new to the 1.9  libfabric  release.   Although,
       efforts	have  been  made  to design the	APIs such that they align well
       with applications and are implementable	by  the	 providers,  the  APIs
       should  be  considered experimental and may be subject to change	in fu-
       ture versions of	the library until the experimental tag	has  been  re-

       In general collective operations	can be thought of as coordinated atom-
       ic operations between a set of peer endpoints.  Readers should refer to
       the  fi_atomic(3)  man  page  for  details on the atomic	operations and
       datatypes defined by libfabric.

       A collective operation is a group communication exchange.  It  involves
       multiple	 peers	exchanging  data with other peers participating	in the
       collective call.	 Collective operations require close  coordination  by
       all  participating members.  All	participants must invoke the same col-
       lective call before any single member can complete its operation	local-
       ly.   As	 a  result, collective calls can strain	the fabric, as well as
       local and remote	data buffers.

       Libfabric collective interfaces target fabrics that support  offloading
       portions	 of  the collective communication into network switches, NICs,
       and other devices.  However, no implementation requirement is placed on
       the provider.

       The  first step in using	a collective call is identifying the peer end-
       points that will	participate.  Collective membership follows one	of two
       models,	both supported by libfabric.  In the first model, the applica-
       tion manages the	membership.  This usually means	that  the  application
       is performing a collective operation itself using point to point	commu-
       nication	to identify the	members	who will  participate.	 Additionally,
       the  application	 may  be interacting with a fabric resource manager to
       reserve network resources needed	to execute collective operations.   In
       this  model,  the application will inform libfabric that	the membership
       has already been	established.

       A separate model	moves the membership management	 under	libfabric  and
       directly	 into the provider.  In	this model, the	application must iden-
       tify which peer addresses will be members.  That	 information  is  con-
       veyed  to the libfabric provider, which is then responsible for coordi-
       nating the creation of the collective group.  In	the  provider  managed
       model, the provider will	usually	perform	the necessary collective oper-
       ation to	establish the communication group and interact with any	fabric
       management agents.

       In  both	 models,  the  collective  membership  is  communicated	to the
       provider	by creating and	configuring an address vector  set  (AV	 set).
       An  AV set represents an	ordered	subset of addresses in an address vec-
       tor (AV).  Details on creating and configuring an AV set	are  available
       in fi_av_set(3).

       Once  an	 AV set	has been programmed with the collective	membership in-
       formation,  an  endpoint	 is  joined  to	 the  set.   This   uses   the
       fi_join_collective operation and	operates asynchronously.  This differs
       from how	an endpoint is associated synchronously	with an	AV  using  the
       fi_ep_bind()  call.   Upon  completion of the fi_join_collective	opera-
       tion, an	fi_addr	is provided that is used as the	 target	 address  when
       invoking	a collective operation.

       For  developer convenience, a set of collective APIs are	defined.  Col-
       lective APIs differ from	message	and RMA	interfaces in that the	format
       of the data is known to the provider, and the collective	may perform an
       operation on that data.	This aligns collective operations closely with
       the atomic interfaces.

   Join	Collective (fi_join_collective)
       This  call attaches an endpoint to a collective membership group.  Lib-
       fabric  treats  collective  members  as	a  multicast  group,  and  the
       fi_join_collective  call	attaches the endpoint to that multicast	group.
       By default, the endpoint	will join the group based on the data transfer
       capabilities  of	 the  endpoint.	 For example, if the endpoint has been
       configured to both send and receive data, then  the  endpoint  will  be
       able to initiate	and receive transfers to and from the collective.  The
       input flags may be used to restrict access  to  the  collective	group,
       subject to endpoint capability limitations.

       Join  collective	 operations  complete  asynchronously, and may involve
       fabric transfers, dependent on the provider  implementation.   An  end-
       point  must be bound to an event	queue prior to calling fi_join_collec-
       tive.  The result of the	join operation will be reported	to the	EQ  as
       an FI_JOIN_COMPLETE event.  Applications	cannot issue collective	trans-
       fers until receiving notification that the join operation has  complet-
       ed.   Note  that	an endpoint may	begin receiving	messages from the col-
       lective group as	soon as	the join completes, which can occur  prior  to
       the FI_JOIN_COMPLETE event being	generated.

       The  join  collective  operation	is itself a collective operation.  All
       participating peers must	call fi_join_collective	before any  individual
       peer will report	that the join has completed.  Application managed col-
       lective memberships are an exception.  With application managed member-
       ships,  the  fi_join_collective	call  may be completed locally without
       fabric communication.  For provider managed memberships,	the join  col-
       lective call requires as	input a	coll_addr that refers to either	an ad-
       dress associated	with an	AV set (see  fi_av_set_addr)  or  an  existing
       collective  group  (obtained through a previous call to fi_join_collec-
       tive).  The fi_join_collective call will	create a new  collective  sub-
       group.	If  application	managed	memberships are	used, coll_addr	should
       be set to FI_ADDR_UNAVAIL.

       Applications must call fi_close on the collective group	to  disconnect
       the endpoint from the group.  After a join operation has	completed, the
       fi_mc_addr call may be used to retrieve the address associated with the
       multicast group.	 See fi_cm(3) for additional details on	fi_mc_addr().

   Barrier (fi_barrier)
       The  fi_barrier	operation  provides  a mechanism to synchronize	peers.
       Barrier does not	result in any data being transferred at	 the  applica-
       tion  level.   A	barrier	does not complete locally until	all peers have
       invoked the barrier call.  This signifies to the	local application that
       work  by	 peers	that  completed	prior to them calling barrier has fin-

   Broadcast (fi_broadcast)
       fi_broadcast transfers an array of data from a  single  sender  to  all
       other  members  of  the	collective  group.  The	input buf parameter is
       treated as the transmit buffer if the local rank	is the root, otherwise
       it  is  the  receive buffer.  The broadcast operation acts as an	atomic
       write or	read to	a data array.  As a result, the	format of the data  in
       buf is specified	through	the datatype parameter.	 Any non-void datatype
       may be broadcast.

       The following diagram shows an  example	of  broadcast  being  used  to
       transfer	an array of integers to	a group	of peers.

	      [1]  [1]	[1]
	      [5]  [5]	[5]
	      [9]  [9]	[9]
	       |____^	 ^

   All to All (fi_alltoall)
       The  fi_alltoall	 collective involves distributing (or scattering) dif-
       ferent portions of an array of data to peers.  It is best explained us-
       ing  an	example.  Here three peers perform an all to all collective to
       exchange	different entries in an	integer	array.

	      [1]   [2]	  [3]
	      [5]   [6]	  [7]
	      [9]  [10]	 [11]
		 \   |	 /
		 All to	all
		 /   |	 \
	      [1]   [5]	  [9]
	      [2]   [6]	 [10]
	      [3]   [7]	 [11]

       Each peer sends a piece of its data to the other	peers.

       All to all operations may be performed on any non-void datatype.	  How-
       ever,  all  to all does not perform an operation	on the data itself, so
       no operation is specified.

   All Reduce (fi_allreduce)
       fi_allreduce can	be described as	all  peers  providing  input  into  an
       atomic  operation, with the result copied back to each peer.  Conceptu-
       ally, this can be viewed	as each	peer issuing a multicast atomic	opera-
       tion to all other peers,	fetching the results, and combining them.  The
       combining of  the  results  is  referred	 to  as	 the  reduction.   The
       fi_allreduce() operation	takes as input an array	of data	and the	speci-
       fied atomic operation to	perform.  The results  of  the	reduction  are
       written into the	result buffer.

       Any  non-void  datatype	may be specified.  Valid atomic	operations are
       listed below in the fi_query_collective call.   The  following  diagram
       shows  an example of an all reduce operation involving summing an array
       of integers between three peers.

	       [1]  [1]	 [1]
	       [5]  [5]	 [5]
	       [9]  [9]	 [9]
		 \   |	 /
		 /   |	 \
	       [3]  [3]	 [3]
	      [15] [15]	[15]
	      [27] [27]	[27]
		All Reduce

   All Gather (fi_allgather)
       Conceptually, all gather	can be viewed as the opposite of  the  scatter
       component from reduce-scatter.  All gather collects data	from all peers
       into a single array, then copies	that array back	to each	peer.

	      [1]  [5]	[9]
		\   |	/
	       All gather
		/   |	\
	      [1]  [1]	[1]
	      [5]  [5]	[5]
	      [9]  [9]	[9]

       All gather may be performed on any  non-void  datatype.	 However,  all
       gather  does  not perform an operation on the data itself, so no	opera-
       tion is specified.

   Reduce-Scatter (fi_reduce_scatter)
       The fi_reduce_scatter collective	is similar to an  fi_allreduce	opera-
       tion,  followed	by all to all.	With reduce scatter, all peers provide
       input into an atomic operation, similar to all reduce.  However,	rather
       than  the  full	result being copied to each peer, each participant re-
       ceives only a slice of the result.

       This is shown by	the following example:

	      [1]  [1]	[1]
	      [5]  [5]	[5]
	      [9]  [9]	[9]
		\   |	/
		   sum (reduce)
		/   |	\
	      [3] [15] [27]

       The reduce scatter call supports	the same datatype and atomic operation
       as fi_allreduce.

   Reduce (fi_reduce)
       The  fi_reduce  collective  is the first	half of	an fi_allreduce	opera-
       tion.  With reduce, all peers provide input into	an  atomic  operation,
       with the	the results collected by a single 'root' endpoint.

       This  is	shown by the following example,	with the leftmost peer identi-
       fied as the root:

	      [1]  [1]	[1]
	      [5]  [5]	[5]
	      [9]  [9]	[9]
		\   |	/
		   sum (reduce)

       The reduce call supports	the same  datatype  and	 atomic	 operation  as

   Scatter (fi_scatter)
       The  fi_scatter	collective  is the second half of an fi_reduce_scatter
       operation.  The data from a single 'root' endpoint is  split  and  dis-
       tributed	to all peers.

       This is shown by	the following example:

		/   |	\
	      [3] [15] [27]

       The  scatter  operation is used to distribute results to	the peers.  No
       atomic operation	is performed on	the data.

   Gather (fi_gather)
       The fi_gather operation is used to collect (gather)  the	 results  from
       all peers and store them	at a 'root' peer.

       This  is	shown by the following example,	with the leftmost peer identi-
       fied as the root.

	      [1]  [5]	[9]
		\   |	/

       The gather operation does not perform any operation on the data itself.

   Query Collective Attributes (fi_query_collective)
       The fi_query_collective call reports which  collective  operations  are
       supported  by  the  underlying  provider,  for suitably configured end-
       points.	Collective operations needed by	an application	that  are  not
       supported  by the provider must be implemented by the application.  The
       query call checks whether a provider supports a specific	collective op-
       eration for a given datatype and	operation, if applicable.

       The  name of the	collective, as well as the datatype and	associated op-
       eration,	if applicable, and are provided	as input into fi_query_collec-

       The  coll parameter may reference one of	these collectives: FI_BARRIER,
       TER,  FI_REDUCE,	 FI_SCATTER,  or FI_GATHER.  Additional	details	on the
       collective operation is specified through the struct fi_collective_attr
       parameter.  For collectives that	act on data, the operation and related
       data type must be specified through the given attributes.

	      struct fi_collective_attr	{
		  enum fi_op op;
		  enum fi_datatype datatype;
		  struct fi_atomic_attr	datatype_attr;
		  size_t max_members;
		    uint64_t mode;

       For a description of struct fi_atomic_attr, see fi_atomic(3).

       op     On input,	this specifies the atomic operation involved with  the
	      collective  call.	  This	should	be set to one of the following
	      values:  FI_MIN,	FI_MAX,	 FI_SUM,  FI_PROD,  FI_LOR,   FI_LAND,
	      IC_WRITE,	of FI_NOOP.  For collectives that do not exchange  ap-
	      plication	data (fi_barrier), this	should be set to FI_NOOP.

	      On  onput,  specifies the	datatype of the	data being modified by
	      the collective.  This should be set to one of the	following val-
	      ues:   FI_INT8,	FI_UINT8,   FI_INT16,	FI_UINT16,   FI_INT32,
	      FI_UINT32,    FI_INT64,	 FI_UINT64,    FI_FLOAT,    FI_DOUBLE,
	      FI_LONG_DOUBLE_COMPLEX, or FI_VOID.  For collectives that	do not
	      exchange	application  data  (fi_barrier), this should be	set to

	      The maximum number of elements that may be used with the collec-

	      The size of the datatype as supported by the provider.  Applica-
	      tions should validate the	size of	datatypes that differ based on
	      the platform, such as FI_LONG_DOUBLE.

	      The maximum number of peers that may participate in a collective

       mode   This field is reserved and should	be 0.

       If a collective operation is supported,	the  query  call  will	return
       FI_SUCCESS,  along with attributes on the limits	for using that collec-
       tive operation through the provider.

       Collective operations map to underlying fi_atomic  operations.	For  a
       discussion  of atomic completion	semantics, see fi_atomic(3).  The com-
       pletion,	ordering, and atomicity	of collective operations  match	 those
       defined for point to point atomic operations.

       The following flags are defined for the specified operations.

	      Applies  to  fi_query_collective.	  When set, requests attribute
	      information on the reduce-scatter	collective operation.

       Returns 0 on success.  On error,	a negative value corresponding to fab-
       ric  errno is returned.	Fabric errno values are	defined	in rdma/fi_er-

	      See fi_msg(3) for	a detailed description of handling FI_EAGAIN.

	      The requested atomic operation is	not  supported	on  this  end-

	      The  number of collective	operations in a	single request exceeds
	      that supported by	the underlying provider.

       Collective operations map to atomic operations.	As such,  they	follow
       most  of	the conventions	and restrictions as peer to peer atomic	opera-
       tions.  This includes data atomicity, data alignment, and  message  or-
       dering  semantics.   See	fi_atomic(3) for additional information	on the
       datatypes and operations	defined	for atomic and collective operations.

       fi_getinfo(3), fi_av(3),	fi_atomic(3), fi_cm(3)


Libfabric Programmer's Manual	  2020-04-13		      fi_collective(3)


Want to link to this manual page? Use this URL:

home | help