Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
sdiag(1)			Slurm Commands			      sdiag(1)

NAME
       sdiag - Scheduling diagnostic tool for Slurm

SYNOPSIS
       sdiag

DESCRIPTION
       sdiag  shows information	related	to slurmctld execution about: threads,
       agents, jobs, and scheduling algorithms.	The goal  is  to  obtain  data
       from  slurmctld behaviour helping to adjust configuration parameters or
       queues policies.	The main reason	behind is to know Slurm	behaviour  un-
       der systems with	a high throughput.

       It  has two execution modes. The	default	mode --all shows several coun-
       ters and	statistics explained later, and	there is another execution op-
       tion --reset for	resetting those	values.

       Values are reset	at midnight UTC	time by	default.

       The  first  block  of information is related to global slurmctld	execu-
       tion:

       Server thread count
	      The number of current active slurmctld threads.  A  high	number
	      would  mean  a high load processing events like job submissions,
	      jobs dispatching,	jobs completing, etc. If this is  often	 close
	      to MAX_SERVER_THREADS it could point to a	potential bottleneck.

       Agent queue size
	      Slurm  design  has  scalability  in mind and sending messages to
	      thousands	of nodes is not	a trivial task.	 The  agent  mechanism
	      helps to control communication between the slurm daemons and the
	      controller for a	best  effort.  If  this	 values	 is  close  to
	      MAX_AGENT_CNT  there could be some delays	affecting jobs manage-
	      ment.

       Jobs submitted
	      Number of	jobs submitted since last reset

       Jobs started
	      Number of	jobs started since last	 reset.	 This  includes	 back-
	      filled jobs.

       Jobs completed
	      Number of	jobs completed since last reset.

       Jobs canceled
	      Number of	jobs canceled since last reset.

       Jobs failed
	      Number  of  jobs	failed	due to slurmd or other internal	issues
	      since last reset.

       The second block	of information is related to main scheduling algorithm
       based  on  jobs	priorities.  A	scheduling  cycle  implies  to get the
       job_write_lock lock, then trying	to get	resources  for	jobs  pending,
       starting	from the most priority one and going in	descendent order. Once
       a job can not get the resources the loop	keeps going but	just for  jobs
       requesting other	partitions. Jobs with dependencies or affected	by ac-
       counts limits are not processed.

       Last cycle
	      Time in microseconds for last scheduling cycle.

       Max cycle
	      Time in microseconds for the maximum scheduling cycle since last
	      reset.

       Total cycles
	      Number of	scheduling cycles since	last reset. Scheduling is done
	      in periodically and when a job is	submitted or  a	 job  is  com-
	      pleted.

       Mean cycle
	      Mean of scheduling cycles	since last reset

       Mean depth cycle
	      Mean  of	cycle depth. Depth means number	of jobs	processed in a
	      scheduling cycle.

       Cycles per minute
	      Counter of scheduling executions per minute

       Last queue length
	      Length of	jobs pending queue.

       The third block of information is related to backfilling	scheduling al-
       gorithm.	 A backfilling scheduling cycle	implies	to get locks for jobs,
       nodes and partitions objects then trying	 to  get  resources  for  jobs
       pending.	 Jobs  are processed based on priorities. If a job can not get
       resources the algorithm calculates when it could	get them  obtaining  a
       future  start time for the job.	Then next job is processed and the al-
       gorithm tries to	get resources for that job but avoiding	to affect  the
       previous	 ones,	and  again  it calculates the future start time	if not
       current resources available. The	backfilling algorithm takes more  time
       for  each  new  job  to process since more priority jobs	can not	be af-
       fected. The algorithm itself takes measures for avoiding	a long	execu-
       tion cycle and for taking all the locks for too long.

       Total backfilled	jobs (since last slurm start)
	      Number  of  jobs	started	thanks to backfilling since last slurm
	      start.

       Total backfilled	jobs (since last stats cycle start)
	      Number of	jobs started thanks to	backfilling  since  last  time
	      stats  where  reset.   By	default	these values are reset at mid-
	      night UTC	time.

       Total cycles
	      Number of	scheduling cycles since	last reset

       Last cycle when
	      Time when	last execution cycle happened in format	"weekday Month
	      MonthDay hour:minute.seconds year"

       Last cycle
	      Time  in microseconds of last backfilling	cycle.	It counts only
	      execution	time removing sleep time  inside  a  scheduling	 cycle
	      when  it takes too much time.  Note that locks are released dur-
	      ing the sleep time so that other work can	proceed.

       Max cycle
	      Time in microseconds  of	maximum	 backfilling  cycle  execution
	      since  last reset.  It counts only execution time	removing sleep
	      time inside a scheduling cycle when  it  takes  too  much	 time.
	      Note that	locks are released during the sleep time so that other
	      work can proceed.

       Mean cycle
	      Mean of backfilling scheduling cycles in microseconds since last
	      reset

       Last depth cycle
	      Number  of processed jobs	during last backfilling	scheduling cy-
	      cle. It counts every process even	if it has no option to execute
	      due to dependencies or limits.

       Last depth cycle	(try sched)
	      Number  of processed jobs	during last backfilling	scheduling cy-
	      cle. It counts only processes with a chance to run  waiting  for
	      available	 resources. These jobs are which makes the backfilling
	      algorithm	heavier.

       Depth Mean
	      Mean of processed	 jobs  during  backfilling  scheduling	cycles
	      since  last reset.  Jobs which are found to be ineligible	to run
	      when examined by the backfill scheduler are  not	counted	 (e.g.
	      jobs  submitted to multiple partitions and already started, jobs
	      which have reached a QOS or account limit	such as	 maximum  run-
	      ning jobs	for an account,	etc).

       Depth Mean (try sched)
	      Mean  of	processed  jobs	 during	 backfilling scheduling	cycles
	      since last reset.	 Jobs which are	found to be ineligible to  run
	      when  examined  by  the backfill scheduler are not counted (e.g.
	      jobs submitted to	multiple partitions and	already	started,  jobs
	      which  have  reached a QOS or account limit such as maximum run-
	      ning jobs	for an account,	etc).

       Last queue length
	      Number of	jobs pending to	be processed by	backfilling algorithm.
	      A	job once for each partition it requested.  A pending job array
	      will normally be counted as one job (tasks of a job array	 which
	      have already been	started/requeued or individually modified will
	      already have individual job records and are each	counted	 as  a
	      separate job).

       Queue length Mean
	      Mean  of	jobs pending to	be processed by	backfilling algorithm.
	      A	job once for each partition it requested.  A pending job array
	      will  normally be	counted	as one job (tasks of a job array which
	      have already been	started/requeued or individually modified will
	      already  have  individual	 job records and are each counted as a
	      separate job).

       The fourth and fifth blocks of information report the  most  frequently
       issued remote procedure calls (RPCs), calls made	for the	Slurmctld dae-
       mon to perform some action.  The	fourth block reports the  RPCs	issued
       by message type.	 You will need to look up those	RPC codes in the Slurm
       source code by looking them  up	in  the	 file  src/common/slurm_proto-
       col_defs.h.   The  report  includes the number of times each RPC	is in-
       voked, the total	time consumed by all of	those RPCs  plus  the  average
       time consumed by	each RPC in microseconds.  The fifth block reports the
       RPCs issued by user ID, the total number	of RPCs	they have issued,  the
       total time consumed by all of those RPCs	plus the average time consumed
       by each RPC in microseconds.

OPTIONS
       -a, --all
	      Get and report information. This is the default mode  of	opera-
	      tion.

       -h, --help
	      Print description	of options and exit.

       -i, --sort-by-id
	      Sort  Remote  Procedure  Call  (RPC) data	by message type	ID and
	      user ID.

       -r, --reset
	      Reset counters. Only supported for Slurm operators and  adminis-
	      trators.

       -t, --sort-by-time
	      Sort Remote Procedure Call (RPC) data by total run time.

       -T, --sort-by-time2
	      Sort Remote Procedure Call (RPC) data by average run time.

       --usage
	      Print list of options and	exit.

       -V, --version
	      Print current version number and exit.

ENVIRONMENT VARIABLES
       Some sdiag options may be set via environment variables.	These environ-
       ment variables, along with their	corresponding options, are listed  be-
       low.  (Note: commandline	options	will always override these settings)

       SLURM_CONF	   The location	of the Slurm configuration file.

COPYING
       Copyright (C) 2010-2011 Barcelona Supercomputing	Center.
       Copyright (C) 2010-2014 SchedMD LLC.

       Slurm  is free software;	you can	redistribute it	and/or modify it under
       the terms of the	GNU General Public License as published	 by  the  Free
       Software	 Foundation;  either version 2 of the License, or (at your op-
       tion) any later version.

       Slurm is	distributed in the hope	that it	will be	 useful,  but  WITHOUT
       ANY  WARRANTY;  without even the	implied	warranty of MERCHANTABILITY or
       FITNESS FOR A PARTICULAR	PURPOSE.  See the GNU General  Public  License
       for more	details.

SEE ALSO
       sinfo(1), squeue(1), scontrol(1), slurm.conf(5),

April 2015			Slurm Commands			      sdiag(1)

NAME | SYNOPSIS | DESCRIPTION | OPTIONS | ENVIRONMENT VARIABLES | COPYING | SEE ALSO

Want to link to this manual page? Use this URL:
<https://www.freebsd.org/cgi/man.cgi?query=sdiag&sektion=1&manpath=FreeBSD+12.1-RELEASE+and+Ports>

home | help