Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages


home | help
PAR_MEM(8)			    LMBENCH			    PAR_MEM(8)

       par_mem - memory	parallelism benchmark

       par_mem	[ -L _line size_ ] [ -M	_len_ ]	[ -W _warmups_ ] [ -N _repeti-
       tions_ ]

       par_mem measures	the available parallelism in the memory	hierarchy,  up
       to  len bytes.  Modern processors can often service multiple memory re-
       quests in parallel, while older processors typically  blocked  on  LOAD
       instructions and	had no available parallelism (other than that provided
       by cache	prefetching).  par_mem measures	the available parallelism at a
       variety	of points, since the available parallelism is often a function
       of the data location in the memory hierarchy.

       In order	to measure the available parallelism par_mem conducts a	 vari-
       ety of experiments at each memory size; one for each level of parallel-
       ism.  It	builds a pointer chain of the desired length.  It then creates
       an  array  of  pointers	which  point to	chain entries which are	evenly
       spaced across the chain.	 Then it starts	running	the  pointers  forward
       through	the chain in parallel.	It can then measure the	average	memory
       latency for each	level of parallelism, and the available	parallelism is
       the minimum average memory latency for parallelism 1 divided by the av-
       erage memory latency across all levels of available parallelism.

       For example, the	inner loop which measures  parallelism	2  would  look
       something like:

       for  (i	=  0;  i  <  N;	++i) {	    p0 = (char **)*p0;	    p1 = (char
       **)*p1; }

       in a for	loop (the overhead of the for loop  is	not  significant;  the
       loop  is	 an unrolled loop 100 loads long).  In this case, if the hard-
       ware can	process	two LOAD operations in parallel, then the overall  la-
       tency  of  the  loop  should  be	equivalent to that of a	single pointer
       chain, so the measured parallelism would	be roughly two.	 If,  however,
       the  hardware  can  only	process	a single LOAD operation	at once, or if
       there is	(significant) resource contention between the two LOAD	opera-
       tions,  then  the  loop	will  be much slower than a loop with a	single
       pointer chain, so the measured parallelism will be less than  two,  and
       probably	no smaller than	one.

       Output  format  is  intended as input to	xgraph or some similar program
       (we use a perl script that produces pic input).	There is a set of data
       produced	 for  each  stride.  The data set title	is the stride size and
       the data	points are the array size in megabytes (floating point	value)
       and the load latency over all points in that array.

       lmbench(8), line(8), cache(8), tlb(8), par_ops(8).

       Carl Staelin and	Larry McVoy

       Comments, suggestions, and bug reports are always welcome.

(c)2000	Carl Staelin and Larry McVoy$Date$			    PAR_MEM(8)


Want to link to this manual page? Use this URL:

home | help