Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages


home | help
HPL_pdpanrlN(3)		     HPL Library Functions	       HPL_pdpanrlN(3)

       HPL_pdpanrlN - Right-looking panel factorization.

       #include	"hpl.h"

       void HPL_pdpanrlN( HPL_T_panel *	PANEL, const int M, const int N, const
       int ICOFF, double * WORK	);

       HPL_pdpanrlN factorizes	a panel	of columns  that is a sub-array	 of  a
       larger  one-dimensional	panel A	using the Right-looking	variant	of the
       usual one-dimensional algorithm.	 The lower triangular  N0-by-N0	 upper
       block  of  the panel is stored in no-transpose form (i.e. just like the
       input matrix itself).

       Bi-directional  exchange	 is  used  to  perform	 the   swap::broadcast
       operations   at	once  for one column in	the panel.  This  results in a
       lower number of slightly	larger	messages than usual.  On  P  processes
       and  assuming  bi-directional links,  the running time of this function
       can be approximated by (when N is equal to N0):

	  N0 * log_2( P	) * ( lat + ( 2*N0 + 4 ) / bdwth ) +
	  N0^2 * ( M - N0/3 ) *	gam2-3

       where M is the local number of rows of  the panel, lat and  bdwth   are
       the  latency  and bandwidth of the network for  double  precision  real
       words, and  gam2-3  is  an estimate of the  Level 2 and Level  3	  BLAS
       rate  of	 execution. The	 recursive  algorithm  allows indeed to	almost
       achieve	Level 3	BLAS  performance  in the panel	factorization.	 On  a
       large   number  of modern machines,  this  operation is however latency
       bound,  meaning	that its cost can  be estimated	 by only  the  latency
       portion	N0  * log_2(P) * lat.  Mono-directional	links will double this
       communication cost.

       Note that  one  iteration of the	the main loop is unrolled.  The	 local
       computation  of	the absolute value max of the next column is performed
       just after its update by	the current column. This allows	to  bring  the
       current	column	only  once through  cache at each  step.  The  current
       implementation  does not	perform	 any blocking  for  this  sequence  of
       BLAS  operations,  however the design allows for	plugging in an optimal
       (machine-specific) specialized  BLAS-like kernel.  This idea  has  been
       suggested to us by Fred Gustavson, IBM T.J. Watson Research Center.

       PANEL   (local input/output)    HPL_T_panel *
	       On  entry,   PANEL  points to the data structure	containing the
	       panel information.

       M       (local input)	       const int
	       On entry,  M specifies the local	number of rows of sub(A).

       N       (local input)	       const int
	       On entry,  N specifies the local	number of columns of sub(A).

       ICOFF   (global input)	       const int
	       On entry, ICOFF specifies the row and column offset  of	sub(A)
	       in A.

       WORK    (local workspace)       double *
	       On entry, WORK  is a workarray of size at least 2*(4+2*N0).

       HPL_dlocmax (3),	 HPL_dlocswpN (3),  HPL_dlocswpT (3), HPL_pdmxswp (3),
       HPL_pdpancrN (3), HPL_pdpancrT (3), HPL_pdpanllN	(3), HPL_pdpanllT (3),
       HPL_pdpanrlT (3).

HPL 2.3			       December	2, 2018		       HPL_pdpanrlN(3)


Want to link to this manual page? Use this URL:

home | help