Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages


home | help
LLVM-MCA(1)			     LLVM			   LLVM-MCA(1)

       llvm-mca	- LLVM Machine Code Analyzer

       llvm-mca	[options] [input]

       llvm-mca	is a performance analysis tool that uses information available
       in LLVM (e.g. scheduling	models)	to statically measure the  performance
       of machine code in a specific CPU.

       Performance is measured in terms	of throughput as well as processor re-
       source consumption. The tool currently works  for  processors  with  an
       out-of-order  backend,  for which there is a scheduling model available
       in LLVM.

       The main	goal of	this tool is not just to predict  the  performance  of
       the  code  when run on the target, but also help	with diagnosing	poten-
       tial performance	issues.

       Given an	assembly code sequence,	llvm-mca  estimates  the  Instructions
       Per  Cycle  (IPC),  as well as hardware resource	pressure. The analysis
       and reporting style were	inspired by the	IACA tool from Intel.

       llvm-mca	allows the usage of special code comments to mark  regions  of
       the  assembly  code  to be analyzed.  A comment starting	with substring
       LLVM-MCA-BEGIN marks the	beginning of a code region. A comment starting
       with  substring LLVM-MCA-END marks the end of a code region.  For exam-

	  # LLVM-MCA-BEGIN My Code Region

       Multiple	regions	can be specified provided that they do not overlap.  A
       code region can have an optional	description. If	no user-defined	region
       is specified, then llvm-mca assumes a default region which contains ev-
       ery  instruction	in the input file.  Every region is analyzed in	isola-
       tion, and the final performance report is the union of all the  reports
       generated for every code	region.

       Inline assembly directives may be used from source code to annotate the
       assembly	text:

	  int foo(int a, int b)	{
	    __asm volatile("# LLVM-MCA-BEGIN foo");
	    a += 42;
	    __asm volatile("# LLVM-MCA-END");
	    a *= b;
	    return a;

       So for example, you can compile code with clang,	output	assembly,  and
       pipe it directly	into llvm-mca for analysis:

	  $ clang foo.c	-O2 -target x86_64-unknown-unknown -S -o - | llvm-mca -mcpu=btver2

       Or for Intel syntax:

	  $ clang foo.c	-O2 -target x86_64-unknown-unknown -mllvm -x86-asm-syntax=intel	-S -o -	| llvm-mca -mcpu=btver2

       If  input is "-"	or omitted, llvm-mca reads from	standard input.	Other-
       wise, it	will read from the specified filename.

       If the -o option	is omitted, then llvm-mca  will	 send  its  output  to
       standard	 output	if the input is	from standard input.  If the -o	option
       specifies "-", then the output will also	be sent	to standard output.

       -help  Print a summary of command line options.

       -mtriple=<target	triple>
	      Specify a	target triple string.

	      Specify the architecture for which to analyze the	code.  It  de-
	      faults to	the host default target.

	      Specify  the  processor  for  which to analyze the code.	By de-
	      fault, the cpu name is autodetected from the host.

       -output-asm-variant=<variant id>
	      Specify the output assembly variant for the report generated  by
	      the  tool.   On  x86,  possible  values are [0, 1]. A value of 0
	      (vic. 1) for this	flag enables the AT&T  (vic.  Intel)  assembly
	      format  for the code printed out by the tool in the analysis re-

	      Specify a	different dispatch width for the processor.  The  dis-
	      patch  width  defaults  to  field	 'IssueWidth' in the processor
	      scheduling model.	 If width is zero, then	the  default  dispatch
	      width is used.

	      Specify the size of the register file. When specified, this flag
	      limits how many physical registers are  available	 for  register
	      renaming	purposes.  A value of zero for this flag means "unlim-
	      ited number of physical registers".

       -iterations=<number of iterations>
	      Specify the number of iterations to run. If this flag is set  to
	      0,  then	the  tool  sets	 the number of iterations to a default
	      value (i.e. 100).

	      If set, the tool assumes that loads and stores don't alias. This
	      is the default behavior.

       -lqueue=<load queue size>
	      Specify  the  size of the	load queue in the load/store unit emu-
	      lated by the tool.  By default, the tool assumes an unbound num-
	      ber of entries in	the load queue.	 A value of zero for this flag
	      is ignored, and the default load queue size is used instead.

       -squeue=<store queue size>
	      Specify the size of the store queue in the load/store unit  emu-
	      lated  by	the tool. By default, the tool assumes an unbound num-
	      ber of entries in	the store queue. A value of zero for this flag
	      is ignored, and the default store	queue size is used instead.

	      Enable the timeline view.

	      Limit the	number of iterations to	print in the timeline view. By
	      default, the timeline view prints	information for	up to 10 iter-

	      Limit the	number of cycles in the	timeline view. By default, the
	      number of	cycles is set to 80.

	      Enable the resource pressure view. This is enabled by default.

	      Enable register file usage statistics.

	      Enable extra dispatch statistics.	This view  collects  and  ana-
	      lyzes  instruction  dispatch  events,  as	well as	static/dynamic
	      dispatch stall events. This view is disabled by default.

	      Enable extra scheduler statistics. This view collects  and  ana-
	      lyzes  instruction  issue	 events.  This view is disabled	by de-

	      Enable extra retire control unit statistics. This	view  is  dis-
	      abled by default.

	      Enable the instruction info view.	This is	enabled	by default.

	      Print all	hardware statistics. This enables extra	statistics re-
	      lated to the dispatch logic, the hardware	schedulers, the	regis-
	      ter  file(s),  and  the retire control unit. This	option is dis-
	      abled by default.

	      Enable all the view.

	      Prints resource pressure information based on the	static	infor-
	      mation available from the	processor model. This differs from the
	      resource pressure	view because it	doesn't	require	that the  code
	      is  simulated. It	instead	prints the theoretical uniform distri-
	      bution of	resource pressure for every instruction	in sequence.

       llvm-mca	returns	0 on success. Otherwise, an error message  is  printed
       to standard error, and the tool returns 1.

       llvm-mca	takes assembly code as input. The assembly code	is parsed into
       a sequence of MCInst with the help of the existing LLVM target assembly
       parsers.	 The  parsed sequence of MCInst	is then	analyzed by a Pipeline
       module to generate a performance	report.

       The Pipeline module simulates the execution of  the  machine  code  se-
       quence  in  a loop of iterations	(default is 100). During this process,
       the pipeline collects a number of execution related statistics. At  the
       end  of	this  process, the pipeline generates and prints a report from
       the collected statistics.

       Here is an example of a performance report generated by the tool	for  a
       dot-product  of two packed float	vectors	of four	elements. The analysis
       is conducted for	target x86, cpu	btver2.	 The following result  can  be
       produced	 via  the  following  command  using  the  example  located at

	  $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=300 dot-product.s

	  Iterations:	     300
	  Instructions:	     900
	  Total	Cycles:	     610
	  Dispatch Width:    2
	  IPC:		     1.48
	  Block	RThroughput: 2.0

	  Instruction Info:
	  [1]: #uOps
	  [2]: Latency
	  [3]: RThroughput
	  [4]: MayLoad
	  [5]: MayStore
	  [6]: HasSideEffects (U)

	  [1]	 [2]	[3]    [4]    [5]    [6]    Instructions:
	   1	  2	1.00			    vmulps	%xmm0, %xmm1, %xmm2
	   1	  3	1.00			    vhaddps	%xmm2, %xmm2, %xmm3
	   1	  3	1.00			    vhaddps	%xmm3, %xmm3, %xmm4

	  [0]	- JALU0
	  [1]	- JALU1
	  [2]	- JDiv
	  [3]	- JFPA
	  [4]	- JFPM
	  [5]	- JFPU0
	  [6]	- JFPU1
	  [7]	- JLAGU
	  [8]	- JMul
	  [9]	- JSAGU
	  [10]	- JSTC
	  [11]	- JVALU0
	  [12]	- JVALU1
	  [13]	- JVIMUL

	  Resource pressure per	iteration:
	  [0]	 [1]	[2]    [3]    [4]    [5]    [6]	   [7]	  [8]	 [9]	[10]   [11]   [12]   [13]
	   -	  -	 -     2.00   1.00   2.00   1.00    -	   -	  -	 -	-      -      -

	  Resource pressure by instruction:
	  [0]	 [1]	[2]    [3]    [4]    [5]    [6]	   [7]	  [8]	 [9]	[10]   [11]   [12]   [13]   Instructions:
	   -	  -	 -	-     1.00    -	    1.00    -	   -	  -	 -	-      -      -	    vmulps	%xmm0, %xmm1, %xmm2
	   -	  -	 -     1.00    -     1.00    -	    -	   -	  -	 -	-      -      -	    vhaddps	%xmm2, %xmm2, %xmm3
	   -	  -	 -     1.00    -     1.00    -	    -	   -	  -	 -	-      -      -	    vhaddps	%xmm3, %xmm3, %xmm4

       According to this report, the dot-product kernel	has been executed  300
       times, for a total of 900 dynamically executed instructions.

       The  report  is	structured  in three main sections.  The first section
       collects	a few performance numbers; the goal of this section is to give
       a  very	quick overview of the performance throughput. In this example,
       the two important performance indicators	are IPC	and Block  RThroughput
       (Block Reciprocal Throughput).

       IPC  is computed	dividing the total number of simulated instructions by
       the total number	of cycles.  A delta between Dispatch Width and IPC  is
       an  indicator  of  a  performance issue.	In the absence of loop-carried
       data dependencies, the observed IPC  tends  to  a  theoretical  maximum
       which  can be computed by dividing the number of	instructions of	a sin-
       gle iteration by	the Block RThroughput.

       IPC is bounded from above by the	dispatch width.	That  is  because  the
       dispatch	width limits the maximum size of a dispatch group. IPC is also
       limited by the amount of	 hardware  parallelism.	 The  availability  of
       hardware	 resources  affects the	resource pressure distribution,	and it
       limits the number of instructions that can be executed in parallel  ev-
       ery  cycle.  A delta between Dispatch Width and the theoretical maximum
       IPC is an indicator of a	performance bottleneck caused by the  lack  of
       hardware	 resources.  In	 general, the lower the	Block RThroughput, the

       In this example,	Instructions per iteration/Block RThroughput is	 1.50.
       Since  there  are no loop-carried dependencies, the observed IPC	is ex-
       pected to approach 1.50 when the	number of iterations tends  to	infin-
       ity.  The  delta	between	the Dispatch Width (2.00), and the theoretical
       maximum IPC (1.50) is an	indicator of a performance  bottleneck	caused
       by  the	lack of	hardware resources, and	the Resource pressure view can
       help to identify	the problematic	resource usage.

       The second section of the  report  shows	 the  latency  and  reciprocal
       throughput  of every instruction	in the sequence. That section also re-
       ports extra information related to the number of	micro opcodes, and op-
       code properties (i.e., 'MayLoad', 'MayStore', and 'HasSideEffects').

       The third section is the	Resource pressure view.	 This view reports the
       average number of resource cycles consumed every	iteration by  instruc-
       tions  for  every processor resource unit available on the target.  In-
       formation is structured in two tables. The first	table reports the num-
       ber of resource cycles spent on average every iteration.	The second ta-
       ble correlates the resource cycles to the machine  instruction  in  the
       sequence. For example, every iteration of the instruction vmulps	always
       executes	on resource unit [6] (JFPU1 -  floating	 point	pipeline  #1),
       consuming  an  average of 1 resource cycle per iteration.  Note that on
       AMD Jaguar, vector floating-point multiply can only be issued to	 pipe-
       line  JFPU1,  while horizontal floating-point additions can only	be is-
       sued to pipeline	JFPU0.

       The resource pressure view helps	with identifying bottlenecks caused by
       high  usage  of	specific hardware resources.  Situations with resource
       pressure	mainly concentrated on a few resources should, in general,  be
       avoided.	  Ideally,  pressure  should  be uniformly distributed between
       multiple	resources.

   Timeline View
       The timeline view produces a  detailed  report  of  each	 instruction's
       state  transitions  through  an instruction pipeline.  This view	is en-
       abled by	the command line option	-timeline.  As instructions transition
       through	the  various stages of the pipeline, their states are depicted
       in the view report.  These states  are  represented  by	the  following

       o D : Instruction dispatched.

       o e : Instruction executing.

       o E : Instruction executed.

       o R : Instruction retired.

       o = : Instruction already dispatched, waiting to	be executed.

       o - : Instruction executed, waiting to be retired.

       Below  is the timeline view for a subset	of the dot-product example lo-
       cated in	test/tools/llvm-mca/X86/BtVer2/dot-product.s and processed  by
       llvm-mca	using the following command:

	  $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=3	-timeline dot-product.s

	  Timeline view:
	  Index	    0123456789

	  [0,0]	    DeeER.    .	   .   vmulps	%xmm0, %xmm1, %xmm2
	  [0,1]	    D==eeeER  .	   .   vhaddps	%xmm2, %xmm2, %xmm3
	  [0,2]	    .D====eeeER	   .   vhaddps	%xmm3, %xmm3, %xmm4
	  [1,0]	    .DeeE-----R	   .   vmulps	%xmm0, %xmm1, %xmm2
	  [1,1]	    . D=eeeE---R   .   vhaddps	%xmm2, %xmm2, %xmm3
	  [1,2]	    . D====eeeER   .   vhaddps	%xmm3, %xmm3, %xmm4
	  [2,0]	    .  DeeE-----R  .   vmulps	%xmm0, %xmm1, %xmm2
	  [2,1]	    .  D====eeeER  .   vhaddps	%xmm2, %xmm2, %xmm3
	  [2,2]	    .	D======eeeER   vhaddps	%xmm3, %xmm3, %xmm4

	  Average Wait times (based on the timeline view):
	  [0]: Executions
	  [1]: Average time spent waiting in a scheduler's queue
	  [2]: Average time spent waiting in a scheduler's queue while ready
	  [3]: Average time elapsed from WB until retire stage

		[0]    [1]    [2]    [3]
	  0.	 3     1.0    1.0    3.3       vmulps	%xmm0, %xmm1, %xmm2
	  1.	 3     3.3    0.7    1.0       vhaddps	%xmm2, %xmm2, %xmm3
	  2.	 3     5.7    0.0    0.0       vhaddps	%xmm3, %xmm3, %xmm4

       The  timeline  view  is	interesting because it shows instruction state
       changes during execution.  It also gives	an idea	of how the  tool  pro-
       cesses instructions executed on the target, and how their timing	infor-
       mation might be calculated.

       The timeline view is structured in two tables.  The first  table	 shows
       instructions  changing state over time (measured	in cycles); the	second
       table (named Average Wait  times)  reports  useful  timing  statistics,
       which  should help diagnose performance bottlenecks caused by long data
       dependencies and	sub-optimal usage of hardware resources.

       An instruction in the timeline view is identified by a pair of indices,
       where  the first	index identifies an iteration, and the second index is
       the instruction index (i.e., where it appears in	 the  code  sequence).
       Since this example was generated	using 3	iterations: -iterations=3, the
       iteration indices range from 0-2	inclusively.

       Excluding the first and last column, the	remaining columns are  in  cy-
       cles.  Cycles are numbered sequentially starting	from 0.

       From the	example	output above, we know the following:

       o Instruction [1,0] was dispatched at cycle 1.

       o Instruction [1,0] started executing at	cycle 2.

       o Instruction [1,0] reached the write back stage	at cycle 4.

       o Instruction [1,0] was retired at cycle	10.

       Instruction  [1,0]  (i.e.,  vmulps  from	iteration #1) does not have to
       wait in the scheduler's queue for the operands to become	available.  By
       the  time  vmulps  is  dispatched,  operands are	already	available, and
       pipeline	JFPU1 is ready to serve	another	instruction.  So the  instruc-
       tion  can  be  immediately issued on the	JFPU1 pipeline.	That is	demon-
       strated by the fact that	the instruction	only spent 1cy in  the	sched-
       uler's queue.

       There  is a gap of 5 cycles between the write-back stage	and the	retire
       event.  That is because instructions must retire	in program  order,  so
       [1,0]  has  to wait for [0,2] to	be retired first (i.e.,	it has to wait
       until cycle 10).

       In the example, all instructions	are in a RAW (Read After Write)	depen-
       dency  chain.   Register	%xmm2 written by vmulps	is immediately used by
       the first vhaddps, and register %xmm3 written by	the first  vhaddps  is
       used  by	 the second vhaddps.  Long data	dependencies negatively	impact
       the ILP (Instruction Level Parallelism).

       In the dot-product example, there are anti-dependencies	introduced  by
       instructions  from  different  iterations.  However, those dependencies
       can be removed at register renaming stage (at the  cost	of  allocating
       register	aliases, and therefore consuming physical registers).

       Table  Average  Wait  times  helps diagnose performance issues that are
       caused by the presence of long  latency	instructions  and  potentially
       long data dependencies which may	limit the ILP.	Note that llvm-mca, by
       default,	assumes	at least 1cy between the dispatch event	and the	 issue

       When  the  performance  is limited by data dependencies and/or long la-
       tency instructions, the number of cycles	spent while in the ready state
       is expected to be very small when compared with the total number	of cy-
       cles spent in the scheduler's queue.  The difference  between  the  two
       counters	 is  a good indicator of how large of an impact	data dependen-
       cies had	on the execution of the	 instructions.	 When  performance  is
       mostly limited by the lack of hardware resources, the delta between the
       two counters is small.  However,	the number  of	cycles	spent  in  the
       queue  tends to be larger (i.e.,	more than 1-3cy), especially when com-
       pared to	other low latency instructions.

   Extra Statistics to Further Diagnose	Performance Issues
       The -all-stats command line option enables extra	statistics and perfor-
       mance  counters	for the	dispatch logic,	the reorder buffer, the	retire
       control unit, and the register file.

       Below is	an example of -all-stats  output  generated  by	 MCA  for  the
       dot-product example discussed in	the previous sections.

	  Dynamic Dispatch Stall Cycles:
	  RAT	  - Register unavailable:		       0
	  RCU	  - Retire tokens unavailable:		       0
	  SCHEDQ  - Scheduler full:			       272
	  LQ	  - Load queue full:			       0
	  SQ	  - Store queue	full:			       0
	  GROUP	  - Static restrictions	on the dispatch	group: 0

	  Dispatch Logic - number of cycles where we saw N instructions	dispatched:
	  [# dispatched], [# cycles]
	   0,		   24  (3.9%)
	   1,		   272	(44.6%)
	   2,		   314	(51.5%)

	  Schedulers - number of cycles	where we saw N instructions issued:
	  [# issued], [# cycles]
	   0,	       7  (1.1%)
	   1,	       306  (50.2%)
	   2,	       297  (48.7%)

	  Scheduler's queue usage:
	  JALU01,  0/20
	  JFPU01,  18/18
	  JLSAGU,  0/12

	  Retire Control Unit -	number of cycles where we saw N	instructions retired:
	  [# retired], [# cycles]
	   0,		109  (17.9%)
	   1,		102  (16.7%)
	   2,		399  (65.4%)

	  Register File	statistics:
	  Total	number of mappings created:    900
	  Max number of	mappings used:	       35

	  *  Register File #1 -- JFpuPRF:
	     Number of physical	registers:     72
	     Total number of mappings created: 900
	     Max number	of mappings used:      35

	  *  Register File #2 -- JIntegerPRF:
	     Number of physical	registers:     64
	     Total number of mappings created: 0
	     Max number	of mappings used:      0

       If  we  look  at	 the  Dynamic  Dispatch	Stall Cycles table, we see the
       counter for SCHEDQ reports 272 cycles.  This counter is incremented ev-
       ery  time  the  dispatch	logic is unable	to dispatch a group of two in-
       structions because the scheduler's queue	is full.

       Looking at the Dispatch Logic table, we see that	the pipeline was  only
       able  to	 dispatch  two	instructions  51.5% of the time.  The dispatch
       group was limited to one	instruction 44.6% of the cycles, which	corre-
       sponds  to 272 cycles.  The dispatch statistics are displayed by	either
       using the command option	-all-stats or -dispatch-stats.

       The next	table, Schedulers, presents a histogram	 displaying  a	count,
       representing  the  number  of instructions issued on some number	of cy-
       cles.  In this case, of the 610 simulated cycles,  single  instructions
       were issued 306 times (50.2%) and there were 7 cycles where no instruc-
       tions were issued.

       The Scheduler's queue usage table shows that the	maximum	number of buf-
       fer  entries (i.e., scheduler queue entries) used at runtime.  Resource
       JFPU01 reached its maximum (18 of 18  queue  entries).  Note  that  AMD
       Jaguar implements three schedulers:

       o JALU01	- A scheduler for ALU instructions.

       o JFPU01	- A scheduler floating point operations.

       o JLSAGU	- A scheduler for address generation.

       The  dot-product	 is  a	kernel of three	floating point instructions (a
       vector multiply followed	by two horizontal adds).   That	 explains  why
       only the	floating point scheduler appears to be used.

       A full scheduler	queue is either	caused by data dependency chains or by
       a sub-optimal usage of hardware resources.  Sometimes,  resource	 pres-
       sure  can be mitigated by rewriting the kernel using different instruc-
       tions that consume different scheduler resources.   Schedulers  with  a
       small queue are less resilient to bottlenecks caused by the presence of
       long data dependencies.	The scheduler statistics are displayed by  us-
       ing the command option -all-stats or -scheduler-stats.

       The  next table,	Retire Control Unit, presents a	histogram displaying a
       count, representing the number of instructions retired on  some	number
       of cycles.  In this case, of the	610 simulated cycles, two instructions
       were retired during the same cycle 399 times (65.4%) and	there were 109
       cycles  where  no instructions were retired.  The retire	statistics are
       displayed by using the command option -all-stats	or -retire-stats.

       The last	table presented	is Register File  statistics.	Each  physical
       register	 file  (PRF)  used by the pipeline is presented	in this	table.
       In the case of AMD Jaguar, there	are two	register files,	one for	float-
       ing-point  registers  (JFpuPRF)	and  one for integer registers (JInte-
       gerPRF).	 The table shows that of the 900 instructions processed, there
       were  900  mappings  created.   Since this dot-product example utilized
       only floating point registers, the JFPuPRF was responsible for creating
       the  900	mappings.  However, we see that	the pipeline only used a maxi-
       mum of 35 of 72 available register slots	at any given time. We can con-
       clude  that  the	floating point PRF was the only	register file used for
       the example, and	that it	was never resource constrained.	 The  register
       file statistics are displayed by	using the command option -all-stats or

       In this example,	we can conclude	that the IPC is	mostly limited by data
       dependencies, and not by	resource pressure.

   Instruction Flow
       This  section  describes	 the  instruction  flow	 through MCA's default
       out-of-order pipeline, as well as the functional	units involved in  the

       The  default  pipeline implements the following sequence	of stages used
       to process instructions.

       o Dispatch (Instruction is dispatched to	the schedulers).

       o Issue (Instruction is issued to the processor pipelines).

       o Write Back (Instruction is executed, and results are written back).

       o Retire	(Instruction is	retired; writes	 are  architecturally  commit-

       The  default pipeline only models the out-of-order portion of a proces-
       sor.  Therefore,	the instruction	fetch and decode stages	are  not  mod-
       eled.  Performance  bottlenecks in the frontend are not diagnosed.  MCA
       assumes that instructions have all  been	 decoded  and  placed  into  a
       queue.  Also, MCA does not model	branch prediction.

   Instruction Dispatch
       During  the  dispatch  stage,  instructions are picked in program order
       from a queue of already decoded instructions, and dispatched in	groups
       to the simulated	hardware schedulers.

       The  size  of a dispatch	group depends on the availability of the simu-
       lated hardware resources.  The processor	dispatch width defaults	to the
       value of	the IssueWidth in LLVM's scheduling model.

       An instruction can be dispatched	if:

       o The  size  of the dispatch group is smaller than processor's dispatch

       o There are enough entries in the reorder buffer.

       o There are enough physical registers to	do register renaming.

       o The schedulers	are not	full.

       Scheduling models can  optionally  specify  which  register  files  are
       available  on  the  processor.  MCA uses	that information to initialize
       register	file descriptors.  Users can limit the number of physical reg-
       isters  that  are globally available for	register renaming by using the
       command option -register-file-size.  A value of zero  for  this	option
       means  unbounded.   By knowing how many registers are available for re-
       naming, MCA can predict dispatch	stalls caused by the  lack  of	regis-

       The number of reorder buffer entries consumed by	an instruction depends
       on the number of	 micro-opcodes	specified  by  the  target  scheduling
       model.	MCA's reorder buffer's purpose is to track the progress	of in-
       structions that are "in-flight,"	and to retire instructions in  program
       order.  The number of entries in	the reorder buffer defaults to the Mi-
       croOpBufferSize provided	by the target scheduling model.

       Instructions that are dispatched	to the	schedulers  consume  scheduler
       buffer  entries.	llvm-mca queries the scheduling	model to determine the
       set of buffered resources consumed by  an  instruction.	 Buffered  re-
       sources are treated like	scheduler resources.

   Instruction Issue
       Each  processor	scheduler implements a buffer of instructions.	An in-
       struction has to	wait in	the scheduler's	buffer	until  input  register
       operands	 become	 available.   Only at that point, does the instruction
       becomes	eligible  for  execution  and  may  be	 issued	  (potentially
       out-of-order)  for  execution.	Instruction  latencies are computed by
       llvm-mca	with the help of the scheduling	model.

       llvm-mca's scheduler is designed	to simulate multiple processor	sched-
       ulers.	The  scheduler	is responsible for tracking data dependencies,
       and dynamically selecting which processor resources are consumed	by in-
       structions.   It	 delegates  the	management of processor	resource units
       and resource groups to a	resource manager.  The resource	manager	is re-
       sponsible  for  selecting  resource units that are consumed by instruc-
       tions.  For example, if an  instruction	consumes  1cy  of  a  resource
       group, the resource manager selects one of the available	units from the
       group; by default, the resource manager uses a round-robin selector  to
       guarantee  that	resource  usage	 is  uniformly distributed between all
       units of	a group.

       llvm-mca's scheduler implements three instruction queues:

       o WaitQueue: a queue of instructions whose operands are not ready.

       o ReadyQueue: a queue of	instructions ready to execute.

       o IssuedQueue: a	queue of instructions executing.

       Depending on the	operand	availability, instructions that	are dispatched
       to  the	scheduler  are	either	placed	into the WaitQueue or into the

       Every cycle, the	scheduler checks if instructions can be	moved from the
       WaitQueue  to  the  ReadyQueue, and if instructions from	the ReadyQueue
       can be issued to	the underlying pipelines.  The	algorithm  prioritizes
       older instructions over younger instructions.

   Write-Back and Retire Stage
       Issued  instructions  are moved from the	ReadyQueue to the IssuedQueue.
       There, instructions wait	until they reach  the  write-back  stage.   At
       that point, they	get removed from the queue and the retire control unit
       is notified.

       When instructions are executed, the retire control unit flags  the  in-
       struction as "ready to retire."

       Instructions  are retired in program order.  The	register file is noti-
       fied of the retirement so that it can free the physical registers  that
       were allocated for the instruction during the register renaming stage.

   Load/Store Unit and Memory Consistency Model
       To  simulate  an	 out-of-order execution	of memory operations, llvm-mca
       utilizes	a simulated load/store unit (LSUnit) to	simulate the  specula-
       tive execution of loads and stores.

       Each  load  (or	store) consumes	an entry in the	load (or store)	queue.
       Users can specify flags -lqueue and -squeue to limit the	number of  en-
       tries  in  the  load  and store queues respectively. The	queues are un-
       bounded by default.

       The LSUnit implements a relaxed consistency model for memory loads  and
       stores.	The rules are:

       1. A younger load is allowed to pass an older load only if there	are no
	  intervening stores or	barriers between the two loads.

       2. A younger load is allowed to pass an older store provided  that  the
	  load does not	alias with the store.

       3. A younger store is not allowed to pass an older store.

       4. A younger store is not allowed to pass an older load.

       By  default,  the LSUnit	optimistically assumes that loads do not alias
       (-noalias=true) store operations.  Under	this assumption, younger loads
       are  always allowed to pass older stores.  Essentially, the LSUnit does
       not attempt to run any alias analysis to	predict	when loads and	stores
       do not alias with each other.

       Note  that,  in the case	of write-combining memory, rule	3 could	be re-
       laxed to	allow reordering of non-aliasing store operations.  That being
       said,  at the moment, there is no way to	further	relax the memory model
       (-noalias is the	only option).  Essentially,  there  is	no  option  to
       specify	a  different  memory  type (e.g., write-back, write-combining,
       write-through; etc.) and	consequently to	 weaken,  or  strengthen,  the
       memory model.

       Other limitations are:

       o The LSUnit does not know when store-to-load forwarding	may occur.

       o The  LSUnit  does  not	know anything about cache hierarchy and	memory

       o The LSUnit does not know how to identify serializing  operations  and
	 memory	fences.

       The  LSUnit  does  not  attempt	to  predict if a load or store hits or
       misses the L1 cache.  It	only knows if an instruction "MayLoad"	and/or
       "MayStore."   For  loads, the scheduling	model provides an "optimistic"
       load-to-use latency (which usually matches the load-to-use latency  for
       when there is a hit in the L1D).

       llvm-mca	 does  not know	about serializing operations or	memory-barrier
       like instructions.  The LSUnit conservatively assumes that an  instruc-
       tion which has both "MayLoad" and unmodeled side	effects	behaves	like a
       "soft" load-barrier.  That means, it serializes loads without forcing a
       flush  of  the load queue.  Similarly, instructions that	"MayStore" and
       have unmodeled side effects are treated like store  barriers.   A  full
       memory barrier is a "MayLoad" and "MayStore" instruction	with unmodeled
       side effects.  This is inaccurate, but it is the	best that we can do at
       the moment with the current information available in LLVM.

       A  load/store  barrier  consumes	 one entry of the load/store queue.  A
       load/store barrier enforces ordering of loads/stores.  A	 younger  load
       cannot  pass a load barrier.  Also, a younger store cannot pass a store
       barrier.	 A younger load	has to wait for	the memory/load	barrier	to ex-
       ecute.	A  load/store barrier is "executed" when it becomes the	oldest
       entry in	the load/store queue(s). That also means, by construction, all
       of the older loads/stores have been executed.

       In conclusion, the full set of load/store consistency rules are:

       1. A store may not pass a previous store.

       2. A store may not pass a previous load (regardless of -noalias).

       3. A store has to wait until an older store barrier is fully executed.

       4. A load may pass a previous load.

       5. A load may not pass a	previous store unless -noalias is set.

       6. A load has to	wait until an older load barrier is fully executed.

       Maintained by The LLVM Team (

       2003-2020, LLVM Project

7				  2020-08-23			   LLVM-MCA(1)


Want to link to this manual page? Use this URL:

home | help