Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages


home | help
samtools-stats(1)	     Bioinformatics tools	     samtools-stats(1)

       samtools	stats -	produces comprehensive statistics from alignment file

       samtools	stats [options]	in.sam|in.bam|in.cram [region...]

       samtools	stats collects statistics from BAM files and outputs in	a text
       format.	The output can be visualized graphically using plot-bamstats.

       A summary of output sections is listed below, followed by more detailed

       CHK   Checksum
       SN    Summary numbers
       FFQ   First fragment qualities
       LFQ   Last fragment qualities
       GCF   GC	content	of first fragments
       GCL   GC	content	of last	fragments
       GCC   ACGT content per cycle
       GCT   ACGT content per cycle, read oriented
       FBC   ACGT content per cycle for	first fragments	only
       FTC   ACGT raw counters for first fragments
       LBC   ACGT content per cycle for	last fragments only
       LTC   ACGT raw counters for last	fragments
       BCC   ACGT content per cycle for	BC barcode
       CRC   ACGT content per cycle for	CR barcode
       OXC   ACGT content per cycle for	OX barcode
       RXC   ACGT content per cycle for	RX barcode
       QTQ   Quality distribution for BC barcode
       CYQ   Quality distribution for CR barcode
       BZQ   Quality distribution for OX barcode
       QXQ   Quality distribution for RX barcode
       IS    Insert sizes
       RL    Read lengths
       FRL   Read lengths for first fragments only
       LRL   Read lengths for last fragments only
       ID    Indel size	distribution
       IC    Indels per	cycle
       COV   Coverage (depth) distribution
       GCD   GC-depth

       Not  all	sections will be reported as some depend on the	data being co-
       ordinate	sorted while others are	only  present  when  specific  barcode
       tags are	in use.

       Some  of	 the statistics	are collected for "first" or "last" fragments.
       Records are put into these categories using  the	 PAIRED	 (0x1),	 READ1
       (0x40) and READ2	(0x80) flag bits, as follows:

       o   Unpaired  reads (i.e. PAIRED	is not set) are	all "first" fragments.
	   For these records, the READ1	and READ2 flags	are ignored.

       o   Reads where PAIRED and READ1	are set, and  READ2  is	 not  set  are
	   "first" fragments.

       o   Reads  where	 PAIRED	 and  READ2  are set, and READ1	is not set are
	   "last" fragments.

       o   Reads where PAIRED is set and either	both READ1 and READ2  are  set
	   or neither is set are not counted in	either category.

       The  CHK	row contains distinct CRC32 checksums of read names, sequences
       and quality values.  The	checksums are computed	per  alignment	record
       and  summed, meaning the	checksum does not change if the	input file has
       the sort-order changed.

       The SN section contains a series	of counts, percentages,	and  averages,
       in a similar style to samtools flagstat,	but more comprehensive.

	      raw total	sequences - total number of reads in a file. Same num-
	      ber reported by samtools view -c.

	      filtered sequences - number of discarded reads when using	-f  or
	      -F option.

	      sequences	- number of processed reads.

	      is  sorted  -  flag  indicating  whether	the file is coordinate
	      sorted (1) or not	(0).

	      1st fragments - number of	first fragment reads (flags  0x01  not
	      set; or flags 0x01 and 0x40 set, 0x80 not	set).

	      last  fragments  - number	of last	fragment reads (flags 0x01 and
	      0x80 set,	0x40 not set).

	      reads mapped - number of	reads,	paired	or  single,  that  are
	      mapped (flag 0x4 or 0x8 not set).

	      reads  mapped  and  paired - number of mapped paired reads (flag
	      0x1 is set and flags 0x4 and 0x8 are not set).

	      reads unmapped - number of unmapped reads	(flag 0x4 is set).

	      reads properly paired - number of	mapped paired reads with  flag
	      0x2 set.

	      paired  -	 number	 of paired reads, mapped or unmapped, that are
	      neither secondary	nor supplementary (flag	0x1 is set  and	 flags
	      0x100 (256) and 0x800 (2048) are not set).

	      reads  duplicated	- number of duplicate reads (flag 0x400	(1024)
	      is set).

	      reads MQ0	- number of mapped reads with mapping quality 0.

	      reads QC failed -	number of reads	that failed the	quality	checks
	      (flag 0x200 (512)	is set).

	      non-primary  alignments  - number	of secondary reads (flag 0x100
	      (256) set).

	      total length - number of processed bases	from  reads  that  are
	      neither secondary	nor supplementary (flags 0x100 (256) and 0x800
	      (2048) are not set).

	      total first fragment length - number of processed	bases that be-
	      long to first fragments.

	      total  last fragment length - number of processed	bases that be-
	      long to last fragments.

	      bases mapped - number of processed bases that  belong  to	 reads

	      bases  mapped  (cigar)  -	number of mapped bases filtered	by the
	      CIGAR string corresponding to the	 read  they  belong  to.  Only
	      alignment	 matches(M),  inserts(I),  sequence matches(=) and se-
	      quence mismatches(X) are counted.

	      bases trimmed - number of	bases trimmed by bwa, that  belong  to
	      non secondary and	non supplementary reads. Enabled by -q option.

	      bases  duplicated	 - number of bases that	belong to reads	dupli-

	      mismatches - number of mismatched	bases, as reported by  the  NM
	      tag associated with a read, if present.

	      error rate - ratio between mismatches and	bases mapped (cigar).

	      average length - ratio between total length and sequences.

	      average  first fragment length - ratio between total first frag-
	      ment length and 1st fragments.

	      average last fragment length - ratio between total last fragment
	      length and last fragments.

	      maximum  length  -  length  of  the longest read (includes hard-
	      clipped bases).

	      maximum first fragment length -  length  of  the	longest	 first
	      fragment read (includes hard-clipped bases).

	      maximum  last fragment length - length of	the longest last frag-
	      ment read	(includes hard-clipped bases).

	      average quality -	ratio between the sum of  base	qualities  and
	      total length.

	      insert  size  average - the average absolute template length for
	      paired and mapped	reads.

	      insert size standard deviation - standard	deviation for the  av-
	      erage template length distribution.

	      inward  oriented	pairs  - number	of paired reads	with flag 0x40
	      (64) set and flag	0x10 (16) not set or with flag 0x80 (128)  set
	      and flag 0x10 (16) set.

	      outward  oriented	 pairs - number	of paired reads	with flag 0x40
	      (64) set and flag	0x10 (16) set or with flag 0x80	(128) set  and
	      flag 0x10	(16) not set.

	      pairs with other orientation - number of paired reads that don't
	      fall in any of the above two categories.

	      pairs on different chromosomes - number of pairs where one  read
	      is on one	chromosome and the pair	read is	on a different chromo-

	      percentage of properly paired reads - percentage of reads	 prop-
	      erly paired out of sequences.

	      bases  inside the	target - number	of bases inside	the target re-
	      gion(s) (when a target file is specified with -t option).

	      percentage of target genome with coverage	> VAL -	percentage  of
	      target bases with	a coverage larger than VAL. By default,	VAL is
	      0, but a custom value can	be supplied by the user	 with  -g  op-

       The FFQ and LFQ sections	report the quality distribution	per first/last
       fragment	and per	cycle number.  They have one row per  cycle  (reported
       as the first column after the FFQ/LFQ key) with remaining columns being
       the observed integer counts per quality value, starting at quality 0 in
       the  left-most  row  and	 ending	at the largest observed	quality.  Thus
       each row	forms its own quality  distribution  and  any  cycle  specific
       quality artefacts can be	observed.

       GCF  and	 GCL  report  the total	GC content of each fragment, separated
       into first and last fragments.  The columns show	the GC percentile (be-
       tween 0 and 100)	and an integer count of	fragments at that percentile.

       GCC,  FBC  and  LBC report the nucleotide content per cycle either com-
       bined (GCC) or split into first (FBC) and last  (LBC)  fragments.   The
       columns	are cycle number (integer), and	percentage counts for A, C, G,
       T, N  and  other	 (typically  containing	 ambiguity  codes)  normalised
       against the total counts	of A, C, G and T only (excluding N and other).

       GCT  offers a similar report to GCC, but	whereas	GCC counts nucleotides
       as they appear in the SAM output	(in reference orientation), GCT	 takes
       into  account  whether  a  nucleotide belongs to	a reverse complemented
       read and	counts it in the original read orientation.  If	there  are  no
       reverse	complemented  reads in a file, the GCC and GCT reports will be

       FTC and LTC report the total numbers of nucleotides for first and  last
       fragments,  respectively. The columns are the raw counters for A, C, G,
       T and N bases.

       BCC, CRC, OXC and RXC are the barcode equivalent	of  GCC,  showing  nu-
       cleotide	 content  for the barcode tags BC, CR, OX and RX respectively.
       Their quality values distributions are in the QTQ,  CYQ,	 BZQ  and  QXQ
       sections, corresponding to the BC/QT, CR/CY, OX/BZ and RX/QX SAM	format
       sequence/quality	tags.  These quality value  distributions  follow  the
       same  format  used in the FFQ and LFQ sections. All these section names
       are followed by a number	(1 or 2), indicating that  the	stats  figures
       below  them  correspond	to the first or	second barcode (in the case of
       dual indexing). Thus, these sections will appear	as  BCC1,  CRC1,  OXC1
       and  RXC1, accompanied by their quality correspondents QTQ1, CYQ1, BZQ1
       and QXQ1. If a separator	is present in the barcode sequence (usually  a
       hyphen),	 indicating  dual  indexing,  then sections ending in "2" will
       also be reported	to show	the second tag statistics (e.g.	both BCC1  and
       BCC2 are	present).

       IS reports insert size distributions with one row per size, reported in
       the first column, with subsequent columns for the  frequency  of	 total
       pairs,  inward  oriented	pairs, outward orient pairs and	other orienta-
       tion pairs.  The	-i option specifies the	maximum	insert size reported.

       RL reports the distribution for all read	lengths, with one row per  ob-
       served  length (up to the maximum specified by the -l option).  Columns
       are read	length and frequency.  FRL and LRL contains the	same  informa-
       tion separated into first and last fragments.

       ID  reports  the	distribution of	indel sizes, with one row per observed
       size. The columns are size, frequency of	insertions at  that  size  and
       frequency of deletions at that size.

       IC  reports the frequency of indels occurring per cycle,	broken down by
       both insertion /	deletion and by	first /	last read.   Note  for	multi-
       base  indels this only counts the first base location.  Columns are cy-
       cle, number of insertions in first fragments, number of	insertions  in
       last  fragments,	 number	of deletions in	first fragments, and number of
       deletions in last fragments.

       COV reports a distribution of the alignment depth per covered reference
       site.   For  example  an	 average depth of 50 would ideally result in a
       normal distribution centred on 50, but the presence of repeats or copy-
       number  variation may reveal multiple peaks at approximate multiples of
       50.  The	first column is	an inclusive coverage range  in	 the  form  of
       [min-max].  The next columns are	a repeat of the	maximum	portion	of the
       depth range (now	as a single integer)  and  the	frequency  that	 depth
       range  was observed.  The minimum, maximum and range step size are con-
       trolled by the -c option.  Depths above and below the minimum and maxi-
       mum are reported	with ranges [<min] and [max<].

       GCD  reports  the  GC content of	the reference data aligned against per
       alignment record, with one row per observed GC percentage  reported  as
       the first column	and sorted on this column.  The	second column is a to-
       tal sequence percentile,	as a running  total  (ending  at  100%).   The
       first  and  second columns may be used to produce a simple distribution
       of GC content.  Subsequent columns list the  coverage  depth  at	 10th,
       25th,  50th, 75th and 90th GC percentiles for this specific GC percent-
       age, revealing any GC bias in  mapping.	 These	columns	 are  averaged
       depths, so are floating point with no maximum value.

       -c, --coverage MIN,MAX,STEP
	       Set  coverage  distribution  to	the specified range (MIN, MAX,
	       STEP all	given as integers) [1,1000,1]

       -d, --remove-dups
	       Exclude from statistics reads marked as duplicates

       -f, --required-flag STR|INT
	       Required	flag, 0	for unset. See also `samtools flags` [0]

       -F, --filtering-flag STR|INT
	       Filtering flag, 0 for unset. See	also `samtools flags` [0]

       --GC-depth FLOAT
	       the size	of GC-depth bins (decreasing bin size increases	memory
	       requirement) [2e4]

       -h, --help
	       This help message

       -i, --insert-size INT
	       Maximum insert size [8000]

       -I, --id	STR
	       Include only listed read	group or sample	name []

       -l, --read-length INT
	       Include in the statistics only reads with the given read	length

       -m, --most-inserts FLOAT
	       Report only the main part of inserts [0.99]

       -P, --split-prefix STR
	       A path or string	prefix to prepend  to  filenames  output  when
	       creating	 categorised statistics	files with -S/--split.	[input

       -q, --trim-quality INT
	       The BWA trimming	parameter [0]

       -r, --ref-seq FILE
	       Reference sequence (required for	GC-depth  and  mismatches-per-
	       cycle calculation).  []

       -S, --split TAG
	       In addition to the complete statistics, also output categorised
	       statistics based	on the tagged field TAG	(e.g., use --split  RG
	       to split	into read groups).

	       Categorised   statistics	 are  written  to  files  named	 <pre-
	       fix>_<value>.bamstat, where prefix is as	given by  --split-pre-
	       fix  (or	 the input filename by default)	and value has been en-
	       countered as the	specified tagged field's value in one or  more
	       alignment records.

       -t, --target-regions FILE
	       Do stats	in these regions only. Tab-delimited file chr,from,to,
	       1-based,	inclusive.  []

       -x, --sparse
	       Suppress	outputting IS rows where there are no insertions.

       -p, --remove-overlaps
	       Remove overlaps of paired-end  reads  from  coverage  and  base
	       count computations.

       -g, --cov-threshold INT
	       Only  bases  with coverage above	this value will	be included in
	       the target percentage computation [0]

       -X      If this option is set, it will allows user to  specify  custom-
	       ized index file location(s) if the data folder does not contain
	       any index file.	Example	usage:	samtools  stats	 [options]  -X
	       /data_folder/data.bam /index_folder/data.bai chrM:1-10

       -@, --threads INT
	       Number  of  input/output	compression threads to use in addition
	       to main thread [0].

       Written by Petr Danacek with major modifications	 by  Nicholas  Clarke,
       Martin Pollard, Josh Randall, and Valeriu Ohan, all from	the Sanger In-

       samtools(1), samtools-flagstat(1), samtools-idxstats(1)

       Samtools	website: <>

samtools-1.11		       22 September 2020	     samtools-stats(1)


Want to link to this manual page? Use this URL:

home | help