Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages


home | help
samtools-markdup(1)	     Bioinformatics tools	   samtools-markdup(1)

       samtools	 markdup  -  mark  duplicate alignments	in a coordinate	sorted

       samtools	markdup	[-l length] [-r] [-s] [-T] [-S]	 [-f  file]  [-d  dis-
       tance]  [-c] [-t] [-m] [--mode] [--include-fails] [--no-PG] [-u]	[--no-
       multi-dup] in.algsort.bam out.bam

       Mark duplicate alignments from a	coordinate sorted file that  has  been
       run  through  samtools fixmate with the -m option.  This	program	relies
       on the MC and ms	tags that fixmate provides.

       -l INT	  Expected maximum read	length of INT bases.  [300]

       -r	  Remove duplicate reads.

       -s	  Print	some basic stats. See STATISTICS.

       -T PREFIX  Write	temporary files	to PREFIX.samtools.nnnn.mmmm.tmp

       -S	  Mark supplementary reads of duplicates as duplicates.

       -f file	  Write	stats to named file.

       -d distance
		  The optical duplicate	distance.  Suggested settings  of  100
		  for  HiSeq  style  platforms or about	2500 for NovaSeq ones.
		  Default is 0 to not look for optical duplicates.  When  set,
		  duplicate  reads  are	tagged with dt:Z:SQ for	optical	dupli-
		  cates	and dt:Z:LB otherwise.	Calculation  of	 distance  de-
		  pends	on coordinate data embedded in the read	names produced
		  by the Illumina sequencing machines.	Optical	duplicate  de-
		  tection will not work	on non standard	names.

       -c	  Clear	previous duplicate settings and	tags.

       -t	  Mark duplicates with the name	of the original	in a do	tag.

       -m, --mode TYPE
		  Duplicate decision method for	paired reads.  Values are t or
		  s.  Mode t measures positions	based  on  template  start/end
		  (default).   Mode  s	measures  positions  based on sequence
		  start.  While	the two	methods	identify mostly	the same reads
		  as  duplicates,  mode	 s  tends to return more results.  Un-
		  paired reads are treated identically by both modes.

       -u	  Output uncompressed SAM, BAM or CRAM.

		  Include quality checked failed reads.

		  Stop checking	 duplicates  of	 duplicates  for  correctness.
		  While	 still	marking	 reads as duplicates further checks to
		  make sure all	optical	duplicates are found are  not  carried
		  out.	 Also  operates	 on  -t	tagging	where reads may	tagged
		  with a better	quality	read but not necessarily the best one.
		  Using	 this option can speed up duplicate marking when there
		  are a	great many duplicates for each original	read.

       --no-PG	  Do not add a PG line to the output file.

       -@, --threads INT
		  Number of input/output compression threads to	use  in	 addi-
		  tion to main thread [0].

       Entries are:
       COMMAND:	the command line.
       READ: number of reads read in.
       WRITTEN:	reads written out.
       EXCLUDED: reads ignored.	 See below.
       EXAMINED: reads examined	for duplication.
       PAIRED: reads that are part of a	pair.
       SINGLE: reads that are not part of a pair.
       DUPLICATE PAIR: reads in	a duplicate pair.
       DUPLICATE SINGLE: single	read duplicates.
       DUPLICATE PAIR OPTICAL: optical duplicate paired	reads.
       DUPLICATE SINGLE	OPTICAL: optical duplicate single reads.
       DUPLICATE NON PRIMARY: supplementary/secondary duplicate	reads.
       DUPLICATE  NON  PRIMARY OPTICAL:	supplementary/secondary	optical	dupli-
       cate reads.
       DUPLICATE PRIMARY TOTAL:	number of primary duplicate reads.
       DUPLICATE TOTAL:	total number of	duplicate reads.
       ESTIMATED LIBRARY SIZE: estimate	of the number of unique	 fragments  in
       the sequencing library.

       Estimated  library size makes various assumptions e.g. the library con-
       sists of	unique fragments that are randomly selected (with replacement)
       with equal probability.	This is	unlikely to be true in practice.  How-
       ever it can provide a useful guide into how many	unique read pairs  are
       likely  to be available.	 In particular it can be used to determine how
       much more data might be obtained	by further sequencing of the library.

       Excluded	reads are those	marked	as  secondary,	supplementary  or  un-
       mapped.	 By  default  QC failed	reads are also excluded	but can	be in-
       cluded as an option.  Excluded reads are	not used for  calculating  du-
       plicates.   They	 can optionally	be marked as duplicates	if they	have a
       primary that is also a duplicate.

       This first collate command can be omitted if the	file is	 already  name
       ordered or collated:

       samtools	collate	-o namecollate.bam example.bam

       Add ms and MC tags for markdup to use later:

       samtools	fixmate	-m namecollate.bam fixmate.bam

       Markdup needs position order:

       samtools	sort -o	positionsort.bam fixmate.bam

       Finally mark duplicates:

       samtools	markdup	positionsort.bam markdup.bam

       Typically  the fixmate step would be applied immediately	after sequence
       alignment and the markdup step after sorting by	chromosome  and	 posi-
       tion.  Thus no additional sort steps are	normally needed.

       Written by Andrew Whitwham from the Sanger Institute.

       samtools(1), samtools-sort(1), samtools-collate(1), samtools-fixmate(1)

       Samtools	website: <>

samtools-1.13			  7 July 2021		   samtools-markdup(1)


Want to link to this manual page? Use this URL:

home | help