Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
samtools(1)		     Bioinformatics tools		   samtools(1)

NAME
       samtools	- Utilities for	the Sequence Alignment/Map (SAM) format

SYNOPSIS
       samtools	view -bt ref_list.txt -o aln.bam aln.sam.gz

       samtools	tview aln.sorted.bam ref.fasta

       samtools	quickcheck in1.bam in2.cram

       samtools	index aln.sorted.bam

       samtools	sort -T	/tmp/aln.sorted	-o aln.sorted.bam aln.bam

       samtools	collate	-o aln.name_collated.bam aln.sorted.bam

       samtools	idxstats aln.sorted.bam

       samtools	flagstat aln.sorted.bam

       samtools	flags PAIRED,UNMAP,MUNMAP

       samtools	stats aln.sorted.bam

       samtools	bedcov aln.sorted.bam

       samtools	depth aln.sorted.bam

       samtools	ampliconstats primers.bed in.bam

       samtools	mpileup	-C50 -f	ref.fasta -r chr3:1,000-2,000 in1.bam in2.bam

       samtools	coverage aln.sorted.bam

       samtools	merge out.bam in1.bam in2.bam in3.bam

       samtools	split merged.bam

       samtools	cat out.bam in1.bam in2.bam in3.bam

       samtools	fastq input.bam	> output.fastq

       samtools	fasta input.bam	> output.fasta

       samtools	faidx ref.fasta

       samtools	fqidx ref.fastq

       samtools	dict -a	GRCh38 -s "Homo	sapiens" ref.fasta

       samtools	calmd in.sorted.bam ref.fasta

       samtools	fixmate	in.namesorted.sam out.bam

       samtools	markdup	in.algnsorted.bam out.bam

       samtools	 addreplacerg  -r 'ID:fish' -r 'LB:1334' -r 'SM:alpha' -o out-
       put.bam input.bam

       samtools	reheader in.header.sam in.bam >	out.bam

       samtools	targetcut input.bam

       samtools	phase input.bam

       samtools	depad input.bam

       samtools	ampliconclip -b	bed.file input.bam

DESCRIPTION
       Samtools	is a set of utilities that manipulate alignments  in  the  SAM
       (Sequence  Alignment/Map),  BAM,	and CRAM formats.  It converts between
       the formats, does sorting, merging and indexing,	and can	retrieve reads
       in any regions swiftly.

       Samtools	 is designed to	work on	a stream. It regards an	input file `-'
       as the standard input (stdin) and an output file	`-'  as	 the  standard
       output (stdout).	Several	commands can thus be combined with Unix	pipes.
       Samtools	always output warning and error	messages to the	standard error
       output (stderr).

       Samtools	is also	able to	open files on remote FTP or HTTP(S) servers if
       the file	name starts with `ftp://', `http://',  etc.   Samtools	checks
       the  current working directory for the index file and will download the
       index upon absence. Samtools does not  retrieve	the  entire  alignment
       file unless it is asked to do so.

       If  an index is needed, samtools	looks for the index suffix appended to
       the filename, and if that isn't found it	tries again without the	 file-
       name suffix (for	example	in.bam.bai followed by in.bai).	 However if an
       index is	in a completely	different location or has  a  different	 name,
       both  the  main data filename and index filename	can be pasted together
       with ##idx##.  For example  /data/in.bam##idx##/indices/in.bam.bai  may
       be used to explicitly indicate where the	data and index files reside.

COMMANDS
       Each  command  has  its own man page which can be viewed	using e.g. man
       samtools-view or	with a recent GNU man using man	samtools view.	 Below
       we have a brief summary of syntax and sub-command description.

       Options	common	to all sub-commands are	documented below in the	GLOBAL
       COMMAND OPTIONS section.

       view	 samtools view [options] in.sam|in.bam|in.cram [region...]

		 With no options or regions specified, prints  all  alignments
		 in  the  specified input alignment file (in SAM, BAM, or CRAM
		 format) to standard output in SAM format (with	no  header  by
		 default).

		 You may specify one or	more space-separated region specifica-
		 tions after the input filename	to  restrict  output  to  only
		 those	alignments  which overlap the specified	region(s). Use
		 of region specifications requires a coordinate-sorted and in-
		 dexed input file.

		 Options  exist	to change the output format from SAM to	BAM or
		 CRAM, so this command also acts as a file  format  conversion
		 utility.

       tview	 samtools   tview   [-p	  chr:pos]   [-s   STR]	 [-d  display]
		 <in.sorted.bam> [ref.fasta]

		 Text alignment	viewer (based on the ncurses library). In  the
		 viewer,  press	`?' for	help and press `g' to check the	align-
		 ment	start	from   a   region   in	 the	format	  like
		 `chr10:10,000,000'  or	 `=10,000,000'	when  viewing the same
		 reference sequence.

       quickcheck
		 samtools quickcheck [options] in.sam|in.bam|in.cram [ ... ]

		 Quickly check that input files	appear to  be  intact.	Checks
		 that  beginning of the	file contains a	valid header (all for-
		 mats) containing at least one target sequence and then	 seeks
		 to  the  end of the file and checks that an end-of-file (EOF)
		 is present and	intact (BAM only).

		 Data in the middle of the file	is not read since  that	 would
		 be much more time consuming, so please	note that this command
		 will not detect internal corruption, but is useful for	 test-
		 ing  that  files are not truncated before performing more in-
		 tensive tasks on them.

		 This command will exit	with a non-zero	exit code if any input
		 files	don't have a valid header or are missing an EOF	block.
		 Otherwise it will exit	successfully (with a zero exit code).

       index	 samtools index	 [-bc]	[-m  INT]  aln.sam.gz|aln.bam|aln.cram
		 [out.index]

		 Index a coordinate-sorted SAM,	BAM or CRAM file for fast ran-
		 dom access.  Note for SAM this	only works  if	the  file  has
		 been BGZF compressed first.

		 This  index is	needed when region arguments are used to limit
		 samtools view and similar commands to particular  regions  of
		 interest.

		 If  an	output filename	is given, the index file will be writ-
		 ten to	out.index.  Otherwise, for a CRAM file aln.cram, index
		 file  aln.cram.crai  will  be	created; for a BAM or SAM file
		 aln.bam, either aln.bam.bai or	aln.bam.csi will  be  created,
		 depending on the index	format selected.

       sort	 samtools sort [-l level] [-m maxMem] [-o out.bam] [-O format]
		 [-n] [-t tag] [-T tmpprefix] [-@ threads]
		 [in.sam|in.bam|in.cram]

		 Sort alignments by leftmost coordinates, or by	read name when
		 -n is used.  An appropriate @HD-SO sort order header tag will
		 be added or an	existing one updated if	necessary.

		 The  sorted  output is	written	to standard output by default,
		 or to the specified file (out.bam) when  -o  is  used.	  This
		 command  will also create temporary files tmpprefix.%d.bam as
		 needed	when the entire	alignment data cannot fit into	memory
		 (as controlled	via the	-m option).

		 Consider using	samtools collate instead if you	need name col-
		 lated data without a full lexicographical sort.

       collate	 samtools collate [options] in.sam|in.bam|in.cram [_prefix_]

		 Shuffles and groups reads together by their names.  A	faster
		 alternative  to  a full query name sort, collate ensures that
		 reads of the same name	are  grouped  together	in  contiguous
		 groups,  but  doesn't	make any guarantees about the order of
		 read names between groups.

		 The output from this command should be	suitable for any oper-
		 ation	that  requires	all reads from the same	template to be
		 grouped together.

       idxstats	 samtools idxstats in.sam|in.bam|in.cram

		 Retrieve and print stats in the index file  corresponding  to
		 the  input file.  Before calling idxstats, the	input BAM file
		 should	be indexed by samtools index.

		 If run	on a SAM or CRAM file or an unindexed BAM  file,  this
		 command  will	still produce the same summary statistics, but
		 does so by reading through the	 entire	 file.	 This  is  far
		 slower	than using the BAM indices.

		 The output is TAB-delimited with each line consisting of ref-
		 erence	sequence name, sequence	length,	# mapped reads	and  #
		 unmapped reads. It is written to stdout.

       flagstat	 samtools flagstat in.sam|in.bam|in.cram

		 Does  a  full	pass  through  the input file to calculate and
		 print statistics to stdout.

		 Provides counts for each of 13	categories based primarily  on
		 bit  flags  in	the FLAG field.	Each category in the output is
		 broken	down into QC pass and QC fail, which is	 presented  as
		 "#PASS	+ #FAIL" followed by a description of the category.

       flags	 samtools flags	INT|STR[,...]

		 Convert between textual and numeric flag representation.

		 FLAGS:

		   0x1	 PAIRED		 paired-end (or	multiple-segment) sequencing technology
		   0x2	 PROPER_PAIR	 each segment properly aligned according to the	aligner
		   0x4	 UNMAP		 segment unmapped
		   0x8	 MUNMAP		 next segment in the template unmapped
		  0x10	 REVERSE	 SEQ is	reverse	complemented
		  0x20	 MREVERSE	 SEQ of	the next segment in the	template is reverse complemented
		  0x40	 READ1		 the first segment in the template
		  0x80	 READ2		 the last segment in the template
		 0x100	 SECONDARY	 secondary alignment
		 0x200	 QCFAIL		 not passing quality controls
		 0x400	 DUP		 PCR or	optical	duplicate
		 0x800	 SUPPLEMENTARY	 supplementary alignment

       stats	 samtools stats	[options] in.sam|in.bam|in.cram	[region...]

		 samtools stats	collects statistics from BAM files and outputs
		 in a text format.  The	output can be  visualized  graphically
		 using plot-bamstats.

       bedcov	 samtools	   bedcov	  [options]	    region.bed
		 in1.sam|in1.bam|in1.cram[...]

		 Reports the total read	base count (i.e. the sum of  per  base
		 read  depths)	for  each genomic region specified in the sup-
		 plied BED file. The regions are output	as they	appear in  the
		 BED  file  and	 are  0-based.	Counts for each	alignment file
		 supplied are reported in separate columns.

       depth	 samtools    depth     [options]     [in1.sam|in1.bam|in1.cram
		 [in2.sam|in2.bam|in2.cram] [...]]

		 Computes the read depth at each position or region.

       ampliconstats
		 samtools	ampliconstats	    [options]	   primers.bed
		 in.sam|in.bam|in.cram[...]

		 samtools ampliconstats	collects statistics from one  or  more
		 input	alignment  files  and  produces	tables in text format.
		 The output can	be visualized graphically using	plot-amplicon-
		 stats.

		 The  alignment	 files	should have previously been clipped of
		 primer	sequence, for example by samtools ampliconclip and the
		 sites	of  these primers should be specified as a bed file in
		 the arguments.

       mpileup	 samtools mpileup [-EB]	[-C capQcoef] [-r reg] [-f in.fa]  [-l
		 list] [-Q minBaseQ] [-q minMapQ] in.bam [in2.bam [...]]

		 Generate  textual  pileup for one or multiple BAM files.  For
		 VCF and BCF output, please use	the bcftools  mpileup  command
		 instead.   Alignment records are grouped by sample (SM) iden-
		 tifiers in @RG	header lines.  If sample identifiers  are  ab-
		 sent, each input file is regarded as one sample.

		 See  the  samtools-mpileup  man page for a description	of the
		 pileup	format and options.

       coverage	 samtools   coverage	[options]    [in1.sam|in1.bam|in1.cram
		 [in2.sam|in2.bam|in2.cram] [...]]

		 Produces a histogram or table of coverage per chromosome.

       merge	 samtools  merge  [-nur1f]  [-h	inh.sam] [-t tag] [-R reg] [-b
		 list] out.bam in1.bam [in2.bam	in3.bam	... inN.bam]

		 Merge multiple	sorted alignment  files,  producing  a	single
		 sorted	 output	 file  that contains all the input records and
		 maintains the existing	sort order.

		 If -h is specified the	@SQ headers of	input  files  will  be
		 merged	 into  the  specified  header,	otherwise they will be
		 merged	into a composite header	created	from the  input	 head-
		 ers.  If the @SQ headers differ in order this may require the
		 output	file to	be re-sorted after merge.

		 The ordering of the records in	the input files	must match the
		 usage of the -n and -t	command-line options.  If they do not,
		 the output order will be undefined.  See sort for information
		 about record ordering.

       split	 samtools split	[options] merged.sam|merged.bam|merged.cram

		 Splits	 a  file  by  read group, producing one	or more	output
		 files matching	a common prefix	(by default based on the input
		 filename) each	containing one read-group.

       cat	 samtools  cat	[-b list] [-h header.sam] [-o out.bam] in1.bam
		 in2.bam [ ... ]

		 Concatenate BAMs or CRAMs. Although this works	on either  BAM
		 or  CRAM,  all	 input	files  must be the same	format as each
		 other.	The sequence dictionary	of each	 input	file  must  be
		 identical,  although  this  command does not check this. This
		 command uses a	similar	trick to reheader which	 enables  fast
		 BAM concatenation.

       fastq/a	 samtools fastq	[options] in.bam
		 samtools fasta	[options] in.bam

		 Converts  a BAM or CRAM into either FASTQ or FASTA format de-
		 pending on the	command	invoked. The files will	 be  automati-
		 cally compressed if the file names have a .gz or .bgzf	exten-
		 sion.

		 The input to this program must	be collated by name.  Use sam-
		 tools collate or samtools sort	-n to ensure this.

       faidx	 samtools faidx	<ref.fasta> [region1 [...]]

		 Index	reference sequence in the FASTA	format or extract sub-
		 sequence from indexed reference sequence.  If	no  region  is
		 specified,   faidx   will   index   the   file	  and	create
		 _ref.fasta_.fai on the	disk. If regions  are  specified,  the
		 subsequences  will  be	retrieved and printed to stdout	in the
		 FASTA format.

		 The input file	can be compressed in the BGZF format.

		 FASTQ files can be read and indexed by	this command.  Without
		 using --fastq any extracted subsequence will be in FASTA for-
		 mat.

       fqidx	 samtools fqidx	<ref.fastq> [region1 [...]]

		 Index reference sequence in the FASTQ format or extract  sub-
		 sequence  from	 indexed  reference  sequence. If no region is
		 specified,   fqidx   will   index   the   file	  and	create
		 _ref.fastq_.fai  on  the  disk. If regions are	specified, the
		 subsequences will be retrieved	and printed to stdout  in  the
		 FASTQ format.

		 The input file	can be compressed in the BGZF format.

		 samtools  fqidx  should  only	be  used on fastq files	with a
		 small number of entries.  Trying to use it on a file contain-
		 ing  millions of short	sequencing reads will produce an index
		 that is almost	as big as the original file, and searches  us-
		 ing the index will be very slow and use a lot of memory.

       dict	 samtools dict ref.fasta|ref.fasta.gz

		 Create	a sequence dictionary file from	a fasta	file.

       calmd	 samtools calmd	[-Eeubr] [-C capQcoef] aln.bam ref.fasta

		 Generate  the	MD tag.	If the MD tag is already present, this
		 command will give a warning if	the MD tag generated  is  dif-
		 ferent	from the existing tag. Output SAM by default.

		 Calmd	can  also  read	 and write CRAM	files although in most
		 cases it is pointless as CRAM recalculates MD and NM tags  on
		 the  fly.  The	one exception to this case is where both input
		 and output CRAM files have been / are being created with  the
		 no_ref	option.

       fixmate	 samtools fixmate [-rpcm] [-O format] in.nameSrt.bam out.bam

		 Fill in mate coordinates, ISIZE and mate related flags	from a
		 name-sorted alignment.

       markdup	 samtools markdup [-l length] [-r] [-s]	[-T] [-S] in.al-
		 gsort.bam out.bam

		 Mark  duplicate alignments from a coordinate sorted file that
		 has been run through samtools fixmate	with  the  -m  option.
		 This  program	relies on the MC and ms	tags that fixmate pro-
		 vides.

       rmdup	 samtools rmdup	[-sS] <input.srt.bam> <out.bam>

		 This command is obsolete. Use markdup instead.

       addreplacerg
		 samtools addreplacerg [-r rg-line | -R	rg-ID] [-m  mode]  [-l
		 level]	[-o out.bam] in.bam

		 Adds or replaces read group tags in a file.

       reheader	 samtools reheader [-iP] in.header.sam in.bam

		 Replace   the	 header	  in   in.bam	with   the  header  in
		 in.header.sam.	 This command is much  faster  than  replacing
		 the header with a BAM->SAM->BAM conversion.

		 By default this command outputs the BAM or CRAM file to stan-
		 dard output (stdout), but for CRAM format files  it  has  the
		 option	 to perform an in-place	edit, both reading and writing
		 to the	same file.  No validity	checking is performed  on  the
		 header, nor that it is	suitable to use	with the sequence data
		 itself.

       targetcut samtools targetcut [-Q	minBaseQ] [-i inPenalty] [-0 em0]  [-1
		 em1] [-2 em2] [-f ref]	in.bam

		 This  command identifies target regions by examining the con-
		 tinuity of read depth,	computes haploid  consensus  sequences
		 of targets and	outputs	a SAM with each	sequence corresponding
		 to a target. When option -f is	in use,	BAQ will  be  applied.
		 This  command is only designed	for cutting fosmid clones from
		 fosmid	pool sequencing	[Ref. Kitzman et al. (2010)].

       phase	 samtools phase	[-AF] [-k len] [-b  prefix]  [-q  minLOD]  [-Q
		 minBaseQ] in.bam

		 Call and phase	heterozygous SNPs.

       depad	 samtools depad	[-SsCu1] [-T ref.fa] [-o output] in.bam

		 Converts  a  BAM  aligned against a padded reference to a BAM
		 aligned against the depadded reference.  The padded reference
		 may  contain verbatim "*" bases in it,	but "*"	bases are also
		 counted in the	reference numbering.  This means  that	a  se-
		 quence	 base-call  aligned against a reference	"*" is consid-
		 ered to be a cigar match ("M" or "X") operator	(if the	 base-
		 call is "A", "C", "G" or "T").	 After depadding the reference
		 "*" bases are deleted and such	 aligned  sequence  base-calls
		 become	insertions.  Similarly transformations apply for dele-
		 tions and padding cigar operations.

       ampliconclip
		 samtools ampliconclip [-o out.file] [-f  stat.file]  [--soft-
		 clip]	 [--hard-clip]	[--both-ends]  [--strand]  [--clipped]
		 [--fail] [--no-PG] -b bed.file	in.file

		 Clip reads in a SAM compatible	file based on data from	a  BED
		 file.

SAMTOOLS OPTIONS
       These  are  options  that are passed after the samtools command,	before
       any sub-command is specified.

       help, --help
	      Display a	brief usage  message  listing  the  samtools  commands
	      available.   If  the name	of a command is	also given, e.g., sam-
	      tools help view, the detailed usage message for that  particular
	      command is displayed.

       --version
	      Display  the  version numbers and	copyright information for sam-
	      tools and	the important libraries	used by	samtools.

       --version-only
	      Display the full samtools	version	number in  a  machine-readable
	      format.

GLOBAL COMMAND OPTIONS
       Several long-options are	shared between multiple	samtools sub-commands:
       --input-fmt,  --input-fmt-option,  --output-fmt,	  --output-fmt-option,
       --reference, --write-index, and --verbosity.  The input format is typi-
       cally auto-detected so specifying the format is usually unnecessary and
       the option is included for completeness.	 Note that not all subcommands
       have all	options.  Consult the subcommand help for more details.

       Format strings recognised are "sam", "sam.gz", "bam" and	"cram".	  They
       may  be	followed  by  a	 comma	separated  list	 of  options as	key or
       key=value. See below for	examples.

       The fmt-option arguments	accept either a	single option or option=value.
       Note  that some options only work on some file formats and only on read
       or write	streams.  If value is unspecified for a	 boolean  option,  the
       value is	assumed	to be 1.  The valid options are	as follows.

       level=INT
	   Output  only. Specifies the compression level from 1	to 9, or 0 for
	   uncompressed.  If the output	format is SAM, this also enables  BGZF
	   compression,	otherwise SAM defaults to uncompressed.

       nthreads=INT
	   Specifies  the  number of threads to	use during encoding and/or de-
	   coding.  For	BAM this will be encoding only.	 In CRAM  the  threads
	   are dynamically shared between encoder and decoder.

       filter=STRING
	   Apply  filter STRING	to all incoming	records, rejecting any that do
	   not satisfy the expression.	See the	FILTER EXPRESSIONS section be-
	   low for specifics.

       reference=fasta_file
	   Specifies a FASTA reference file for	use in CRAM encoding or	decod-
	   ing.	 It usually is not required for	decoding except	in the	situa-
	   tion	 of the	MD5 not	being obtainable via the REF_PATH or REF_CACHE
	   environment variables.

       decode_md=0|1
	   CRAM	input only; defaults to	1 (on).	 CRAM does not typically store
	   MD  and NM tags, preferring to generate them	on the fly.  When this
	   option is 0 missing MD, NM tags will	not be generated.  It  can  be
	   particularly	 useful	 when  combined	 with  a  file	encoded	 using
	   store_md=1 and store_nm=1.

       store_md=0|1
	   CRAM	output only; defaults to 0 (off).  CRAM	normally  only	stores
	   MD tags when	the reference is unknown and lets the decoder generate
	   these values	on-the-fly (see	decode_md).

       store_nm=0|1
	   CRAM	output only; defaults to 0 (off).  CRAM	normally  only	stores
	   NM tags when	the reference is unknown and lets the decoder generate
	   these values	on-the-fly (see	decode_md).

       ignore_md5=0|1
	   CRAM	input only; defaults to	0 (off).  When enabled,	 md5  checksum
	   errors  on  the reference sequence and block	checksum errors	within
	   CRAM	are ignored.  Use of this option is strongly discouraged.

       required_fields=bit-field
	   CRAM	input only; specifies which SAM	columns	need to	be  populated.
	   By  default	all  fields are	used.  Limiting	the decode to specific
	   columns can have significant	performance gains.  The	bit-field is a
	   numerical value constructed from the	following table.

	      0x1   SAM_QNAME
	      0x2   SAM_FLAG
	      0x4   SAM_RNAME
	      0x8   SAM_POS
	     0x10   SAM_MAPQ
	     0x20   SAM_CIGAR
	     0x40   SAM_RNEXT
	     0x80   SAM_PNEXT
	    0x100   SAM_TLEN
	    0x200   SAM_SEQ
	    0x400   SAM_QUAL
	    0x800   SAM_AUX
	   0x1000   SAM_RGAUX

       name_prefix=string
	   CRAM	 input	only; defaults to output filename.  Any	sequences with
	   auto-generated read names will use string as	the name prefix.

       multi_seq_per_slice=0|1
	   CRAM	output only; defaults to 0 (off).  By default  CRAM  generates
	   one	container  per	reference sequence, except in the case of many
	   small references (such as a fragmented assembly).

       version=major.minor
	   CRAM	output only.  Specifies	the CRAM version  number.   Acceptable
	   values are "2.1" and	"3.0".

       seqs_per_slice=INT
	   CRAM	output only; defaults to 10000.

       slices_per_container=INT
	   CRAM	 output	 only;	defaults  to 1.	 The effect of having multiple
	   slices per container	is to share the	compression header  block  be-
	   tween  multiple  slices.   This is unlikely to have any significant
	   impact unless the number of sequences per slice is  reduced.	  (To-
	   gether these	two options control the	granularity of random access.)

       embed_ref=0|1
	   CRAM	 output	only; defaults to 0 (off).  If 1, this will store por-
	   tions of the	reference sequence in each  slice,  permitting	decode
	   without  having  requiring  an  external  copy of the reference se-
	   quence.

       no_ref=0|1
	   CRAM	output only; defaults to 0 (off).  If  1,  sequences  will  be
	   stored  verbatim with no reference encoding.	 This can be useful if
	   no reference	is available for the file.

       use_bzip2=0|1
	   CRAM	output only; defaults to 0 (off).  Permits  use	 of  bzip2  in
	   CRAM	block compression.

       use_lzma=0|1
	   CRAM	output only; defaults to 0 (off).  Permits use of lzma in CRAM
	   block compression.

       lossy_names=0|1
	   CRAM	output only; defaults to 0 (off).  If 1,  templates  with  all
	   members  within  the	same CRAM slice	will have their	read names re-
	   moved.  New names will be automatically generated during  decoding.
	   Also	see the	name_prefix option.

       For example:

	   samtools view --input-fmt-option decode_md=0
	       --output-fmt cram,version=3.0 --output-fmt-option embed_ref
	       --output-fmt-option seqs_per_slice=2000 -o foo.cram foo.bam

       The --write-index option	enables	automatic index	creation while writing
       out BAM,	CRAM or	bgzf SAM files.	 Note to get  compressed  SAM  as  the
       output  format you need to manually request a compression level,	other-
       wise all	SAM files are uncompressed.  By	default	SAM and	BAM  will  use
       CSI  indices  while  CRAM will use CRAI indices.	 If you	need to	create
       BAI indices note	that it	is possible to specify the name	of  the	 index
       being written to, and hence the format, by using	the filename##idx##in-
       dexname notation.

       For example: to convert a BAM to	a compressed SAM with CSI indexing:

	   samtools view -h -O sam,level=6 --write-index in.bam	-o out.sam.gz

       To convert a SAM	to a compressed	BAM using BAI indexing:

	   samtools view --write-index in.sam -o out.bam##idx##out.bam.bai

       The --verbosity INT option sets the verbosity level  for	 samtools  and
       HTSlib.	The default is 3 (HTS_LOG_WARNING); 2 reduces warning messages
       and 0 or	1 also reduces some error messages, while values greater  than
       3  produce  increasing  numbers of additional warnings and logging mes-
       sages.

REFERENCE SEQUENCES
       The CRAM	format requires	use of a reference sequence for	 both  reading
       and writing.

       When  reading  a	 CRAM the @SQ headers are interrogated to identify the
       reference sequence MD5sum (M5: tag) and the  local  reference  sequence
       filename	(UR: tag).  Note that http:// and ftp:// based URLs in the UR:
       field are not used, but local fasta filenames (with or without file://)
       can be used.

       To create a CRAM	the @SQ	headers	will also be read to identify the ref-
       erence sequences, but M5: and UR: tags may not be present. In this case
       the -T and -t options of	samtools view may be used to specify the fasta
       or fasta.fai filenames respectively (provided the  .fasta.fai  file  is
       also backed up by a .fasta file).

       The search order	to obtain a reference is:

       1. Use any local	file specified by the command line options (eg -T).

       2. Look for MD5 via REF_CACHE environment variable.

       3. Look for MD5 in each element of the REF_PATH environment variable.

       4. Look for a local file	listed in the UR: header tag.

FILTER EXPRESSIONS
       Filter  expressions are used as an on-the-fly checking of incoming SAM,
       BAM or CRAM records, discarding records that do not match the specified
       expression.

       The  language  used is primarily	C style, but with a few	differences in
       the precedence rules for	bit operators and the inclusion	of regular ex-
       pression	matching.

       The operator precedence,	from strongest binding to weakest, is:

       Grouping	       (, )		E.g. "(1+2)*3"
       Values:	       literals, vars	Numbers, strings and variables
       Unary ops:      +, -, !,	~	E.g. -10 +10, !10 (not), ~5 (bit not)
       Math ops:       *, /, %		Multiply, division and (integer) modulo
       Math ops:       +, -		Addition / subtractin
       Bit-wise:       &		Integer	AND
       Bit-wise	       ^		Integer	XOR
       Bit-wise	       |		Integer	OR
       Conditionals:   >, >=, <, <=
       Equality:       ==, !=, =~, !~	=~ and !~ match	regular	expressions
       Boolean:	       &&, ||		Logical	AND / OR

       Expressions  are	computed using floating	point mathematics, so "10 / 4"
       evaluates to 2.5	rather than 2.	They may be  written  as  integers  in
       decimal	or  "0x"  plus hexadecimal, and	floating point with or without
       exponents.However operations that require integers first	do an implicit
       type  conversion, so "7.9 % 5" is 2 and "7.9 & 4.1" is equivalent to "7
       & 4", which is 4.  Strings are always specified	using  double  quotes.
       To  get	a double quote in a string, use	backslash.  Similarly a	double
       backslash is used to get	a literal backslash.  For example ab\"c\\d  is
       the string ab"c\d.

       Comparison  operators  are  evaluated as	a match	being 1	and a mismatch
       being 0,	thus "(2 > 1) +	(3 < 5)" evaluates as 2.

       The variables are where the file	format specifics are accessed from the
       expression.   The  variables  correspond	 to SAM	fields,	for example to
       find paired alignments with high	mapping	quality	and a very  large  in-
       sert  size, we may use the expression "mapq >= 30 && (tlen >= 100000 ||
       tlen <= -100000)".  Valid variable names	and their data types are:

       flag		    int		   Combined FLAG field
       flag.paired	    int		   Single bit, 0 or 1
       flag.proper_pair	    int		   Single bit, 0 or 2
       flag.unmap	    int		   Single bit, 0 or 4
       flag.munmap	    int		   Single bit, 0 or 8
       flag.reverse	    int		   Single bit, 0 or 16
       flag.mreverse	    int		   Single bit, 0 or 32
       flag.read1	    int		   Single bit, 0 or 64
       flag.read2	    int		   Single bit, 0 or 128
       flag.secondary	    int		   Single bit, 0 or 256
       flag.qcfail	    int		   Single bit, 0 or 512
       flag.dup		    int		   Single bit, 0 or 1024
       flag.supplementary   int		   Single bit, 0 or 2048
       library		    string	   Library (LB header via RG)
       mapq		    int		   Mapping quality
       mpos		    int		   Synonym for pnext
       mrefid		    int		   Mate	reference number (0 based)
       mrname		    string	   Synonym for rnext
       ncigar		    int		   Number of cigar operations
       pnext		    int		   Mate's alignment position (1-based)
       pos		    int		   Alignment position (1-based)
       qlen		    int		   Alignment length: no. query bases
       qname		    string	   Query name
       qual		    string	   Quality values (raw,	0 based)
       refid		    int		   Integer reference number (0 based)
       rlen		    int		   Alignment length: no. reference bases
       rname		    string	   Reference name
       rnext		    string	   Mate's reference name
       seq		    string	   Sequence
       tlen		    int		   Template length (insert size)
       [XX]		    int	/ string   XX tag value

       Flags are returned either as the	whole flag value or by checking	for  a
       single bit.  Hence the filter expression	flag.dup is equivalent to flag
       & 1024.

       "qlen" and "rlen" are measured using the	CIGAR string to	count the num-
       ber  of query (sequence)	and reference bases consumed.  Note "qlen" may
       not exactly match the length of the "seq" field if the sequence is "*".

       Reference names may be matched either by	their  string  forms  ("rname"
       and  "mrname") or as the	Nth @SQ	line (counting from zero) as stored in
       BAM using "tid" and "mtid" respectively.

       Auxiliary tags are described in square brackets and these expand	to ei-
       ther  integer  or  string  as defined by	the tag	itself (XX:Z:string or
       XX:i:int).  For example [NM]>=10	can be used  to	 look  for  alignments
       with  many  mismatches  and [RG]=~"grp[ABC]-" will match	the read-group
       string.

       If no comparison	is used	with an	auxiliary tag it is taken simply to be
       a test for the existance	of that	tag.  So "[NM]"	will return any	record
       containing an NM	tag, even if that tag is zero (NM:i:0).

       If you need to check specifically for a non-zero	value then use [NM] &&
       [NM]!=0.

       Some simple functions are available to operate on strings.  These treat
       the strings as arrays of	bytes, permitting their	length,	minimum, maxi-
       mum and average values to be computed.

       length	Length of the string (excluding	nul char)
       min	Minimum	byte value in the string
       max	Maximum	byte value in the string
       avg	Average	byte value in the string

       Note  that  "avg" is a floating point value and it may be NAN for empty
       strings.	 This means that "avg(qual)" does not  produce	an  error  for
       records	that  have both	seq and	qual of	"*".  This value will fail any
       conditional checks, so e.g. "avg(qual) >	20" works and will not	report
       these records.

ENVIRONMENT VARIABLES
       HTS_PATH
	      A	colon-separated	list of	directories in which to	search for HT-
	      Slib plugins.  If	$HTS_PATH starts or ends with a	colon or  con-
	      tains  a	double colon (::), the built-in	list of	directories is
	      searched at that point in	the search.

	      If no HTS_PATH variable is defined, the built-in list of	direc-
	      tories  specified	when HTSlib was	built is used, which typically
	      includes /usr/local/libexec/htslib and similar directories.

       REF_PATH
	      A	colon separated	(semi-colon on Windows)	list of	 locations  in
	      which  to	 look for sequences identified by their	MD5sums.  This
	      can be either a list of directories or URLs. Note	that if	a  URL
	      is  included  then  the  colon in	http://	and ftp:// and the op-
	      tional port number will be treated as part of the	URL and	not  a
	      PATH field separator.  For URLs, the text	%s will	be replaced by
	      the MD5sum being read.

	      If  no  REF_PATH	has  been  specified  it   will	  default   to
	      http://www.ebi.ac.uk/ena/cram/md5/%s  and	 if  REF_CACHE is also
	      unset, it	will be	set to $XDG_CACHE_HOME/hts-ref/%2s/%2s/%s.  If
	      $XDG_CACHE_HOME is unset,	$HOME/.cache (or a local system	tempo-
	      rary directory if	no home	directory is found) will be used simi-
	      larly.

       REF_CACHE
	      This  can	 be defined to a single	location housing a local cache
	      of references.  Upon downloading a reference it will  be	stored
	      in  the  location	 pointed  to  by REF_CACHE.  REF_CACHE will be
	      searched before attempting to load via the REF_PATH search list.
	      If  no  REF_PATH is defined, both	REF_PATH and REF_CACHE will be
	      automatically set	(see above), but if REF_PATH  is  defined  and
	      REF_CACHE	not then no local cache	is used.

	      To  avoid	 many  files  being  stored  in	 the  same  directory,
	      REF_CACHE	may be defined as a pattern using %nums	to consume num
	      chracters	 of the	MD5sum and %s to consume all remaining charac-
	      ters.  If	REF_CACHE lacks	%s then	it will	get  an	 implicit  /%s
	      appended.

	      To   aid	 population   of  the  REF_CACHE  directory  a	script
	      misc/seq_cache_populate.pl is provided in	the Samtools distribu-
	      tion.  This takes	a fasta	file or	a directory of fasta files and
	      generates	the MD5sum named files.

	      For example if you use seq_cache_populate	-subdirs 2 -root  /lo-
	      cal/ref_cache  to	 create	2 nested subdirectories	(the default),
	      each consuming 2 characters of the MD5sum, then  REF_CACHE  must
	      be set to	/local/ref_cache/%2s/%2s/%s.

EXAMPLES
       o Import	SAM to BAM when	@SQ lines are present in the header:

	   samtools view -b aln.sam > aln.bam

	 If @SQ	lines are absent:

	   samtools faidx ref.fa
	   samtools view -bt ref.fa.fai	aln.sam	> aln.bam

	 where ref.fa.fai is generated automatically by	the faidx command.

       o Convert a BAM file to a CRAM file using a local reference sequence.

	   samtools view -C -T ref.fa aln.bam >	aln.cram

LIMITATIONS
       o Unaligned words used in bam_endian.h, bam.c and bam_aux.c.

AUTHOR
       Heng  Li	from the Sanger	Institute wrote	the original C version of sam-
       tools.  Bob Handsaker from the Broad Institute implemented the BGZF li-
       brary.	Petr  Danecek  and  Heng  Li wrote the VCF/BCF implementation.
       James Bonfield from the Sanger Institute	developed the CRAM implementa-
       tion.   Other large code	contributions have been	made by	John Marshall,
       Rob Davies, Martin Pollard, Andrew Whitwham, Valeriu  Ohan  (all	 while
       primarily  at  the  Sanger  Institute), with numerous other smaller but
       valuable	contributions.	See the	per-command manual pages  for  further
       authorship.

SEE ALSO
       samtools-addreplacerg(1),  samtools-ampliconclip(1), samtools-amplicon-
       stats(1), samtools-bedcov(1), samtools-calmd(1),	samtools-cat(1),  sam-
       tools-collate(1),  samtools-coverage(1),	 samtools-depad(1),  samtools-
       depth(1), samtools-dict(1), samtools-faidx(1), samtools-fasta(1),  sam-
       tools-fastq(1),	samtools-fixmate(1), samtools-flags(1),	samtools-flag-
       stat(1),	 samtools-fqidx(1),  samtools-idxstats(1),  samtools-index(1),
       samtools-markdup(1),  samtools-merge(1),	samtools-mpileup(1), samtools-
       phase(1),   samtools-quickcheck(1),   samtools-reheader(1),   samtools-
       rmdup(1),  samtools-sort(1), samtools-split(1), samtools-stats(1), sam-
       tools-targetcut(1), samtools-tview(1),  samtools-view(1),  bcftools(1),
       sam(5), tabix(1)

       Samtools	website: <http://www.htslib.org/>
       File   format   specification   of  SAM/BAM,CRAM,VCF/BCF:  <http://sam-
       tools.github.io/hts-specs>
       Samtools	latest source: <https://github.com/samtools/samtools>
       HTSlib latest source: <https://github.com/samtools/htslib>
       Bcftools	website: <http://samtools.github.io/bcftools>

samtools-1.12			 17 March 2021			   samtools(1)

NAME | SYNOPSIS | DESCRIPTION | COMMANDS | SAMTOOLS OPTIONS | GLOBAL COMMAND OPTIONS | REFERENCE SEQUENCES | FILTER EXPRESSIONS | ENVIRONMENT VARIABLES | EXAMPLES | LIMITATIONS | AUTHOR | SEE ALSO

Want to link to this manual page? Use this URL:
<https://www.freebsd.org/cgi/man.cgi?query=samtools&sektion=1&manpath=FreeBSD+13.0-RELEASE+and+Ports>

home | help