Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages


home | help
bwa(1)			     Bioinformatics tools			bwa(1)

       bwa - Burrows-Wheeler Alignment Tool

       bwa index ref.fa

       bwa mem ref.fa reads.fq > aln-se.sam

       bwa mem ref.fa read1.fq read2.fq	> aln-pe.sam

       bwa aln ref.fa short_read.fq > aln_sa.sai

       bwa samse ref.fa	aln_sa.sai short_read.fq > aln-se.sam

       bwa sampe ref.fa	aln_sa1.sai aln_sa2.sai	read1.fq read2.fq > aln-pe.sam

       bwa bwasw ref.fa	long_read.fq > aln.sam

       BWA is a	software package for mapping low-divergent sequences against a
       large reference genome, such as the human genome. It consists of	 three
       algorithms:  BWA-backtrack,  BWA-SW and BWA-MEM.	The first algorithm is
       designed	for Illumina sequence reads up to 100bp, while	the  rest  two
       for longer sequences ranged from	70bp to	1Mbp. BWA-MEM and BWA-SW share
       similar features	such as	long-read support  and	split  alignment,  but
       BWA-MEM,	which is the latest, is	generally recommended for high-quality
       queries as it is	faster and more	accurate.   BWA-MEM  also  has	better
       performance than	BWA-backtrack for 70-100bp Illumina reads.

       For  all	 the algorithms, BWA first needs to construct the FM-index for
       the reference genome (the index command). Alignment algorithms are  in-
       voked  with  different sub-commands: aln/samse/sampe for	BWA-backtrack,
       bwasw for BWA-SW	and mem	for the	BWA-MEM	algorithm.

       index  bwa index	[-p prefix] [-a	algoType] db.fa

	      Index database sequences in the FASTA format.


	      -p STR	Prefix of the output database [same as db filename]

	      -a STR	Algorithm for constructing BWT index.  BWA  implements
			three  algorithms  for BWT construction: is, bwtsw and
			rb2.  The first	algorithm is a little faster for small
			database  but requires large RAM and does not work for
			databases with total length longer than	2GB. The  sec-
			ond  algorithm is adapted from the BWT-SW source code.
			It in theory works with	 database  with	 trillions  of
			bases.	When  this option is not specified, the	appro-
			priate algorithm will be chosen	automatically.

       mem    bwa mem [-aCHjMpP] [-t nThreads] [-k minSeedLen] [-w  bandWidth]
	      [-d  zDropoff]  [-r seedSplitRatio] [-c maxOcc] [-D chainShadow]
	      [-m maxMateSW] [-W minSeedMatch] [-A matchScore] [-B  mmPenalty]
	      [-O  gapOpenPen]	[-E gapExtPen] [-L clipPen] [-U	unpairPen] [-x
	      readType]	[-R RGline] [-H	HDlines] [-v  verboseLevel]  db.prefix
	      reads.fq [mates.fq]

	      Align  70bp-1Mbp	query  sequences  with	the BWA-MEM algorithm.
	      Briefly, the algorithm works by seeding alignments with  maximal
	      exact  matches  (MEMs) and then extending	seeds with the affine-
	      gap Smith-Waterman algorithm (SW).

	      If mates.fq file is absent and option -p is not set,  this  com-
	      mand regards input reads are single-end. If mates.fq is present,
	      this command assumes the i-th read in reads.fq and the i-th read
	      in  mates.fq  constitute a read pair. If -p is used, the command
	      assumes the 2i-th	and the	(2i+1)-th read in reads.fq  constitute
	      a	read pair (such	input file is said to be interleaved). In this
	      case, mates.fq is	ignored. In the	paired-end mode, the mem  com-
	      mand will	infer the read orientation and the insert size distri-
	      bution from a batch of reads.

	      The BWA-MEM algorithm performs local alignment. It  may  produce
	      multiple	primary	 alignments  for different part	of a query se-
	      quence. This is a	crucial	feature	for long  sequences.  However,
	      some  tools  such	 as Picard's markDuplicates does not work with
	      split alignments.	One may	consider to  use  option  -M  to  flag
	      shorter split hits as secondary.


	      -t INT	Number of threads [1]

	      -k INT	Minimum	 seed length. Matches shorter than INT will be
			missed.	The alignment speed is usually insensitive  to
			this  value  unless it significantly deviates from 20.

	      -w INT	Band width. Essentially, gaps longer than INT will not
			be found. Note that the	maximum	gap length is also af-
			fected by the scoring matrix and the hit  length,  not
			solely determined by this option. [100]

	      -d INT	Off-diagonal  X-dropoff	 (Z-dropoff).  Stop  extension
			when the difference between the	best and  the  current
			extension  score  is  above |i-j|*A+INT, where i and j
			are the	current	positions of the query and  reference,
			respectively,  and  A is the matching score. Z-dropoff
			is similar to BLAST's X-dropoff	except that it doesn't
			penalize  gaps	in  one	of the sequences in the	align-
			ment. Z-dropoff	not only avoids	unnecessary extension,
			but  also  reduces  poor alignments inside a long good
			alignment. [100]

	      -r FLOAT	Trigger	 re-seeding  for  a  MEM  longer   than	  min-
			SeedLen*FLOAT.	 This is a key heuristic parameter for
			tuning the  performance.  Larger  value	 yields	 fewer
			seeds, which leads to faster alignment speed but lower
			accuracy. [1.5]

	      -c INT	Discard	a MEM if it has	more than INT occurence	in the
			genome.	This is	an insensitive parameter. [500]

	      -D FLOAT	Drop chains shorter than FLOAT fraction	of the longest
			overlapping chain [0.5]

	      -m INT	Perform	at most	INT rounds of mate-SW [50]

	      -W INT	Drop a chain if	 the  number  of  bases	 in  seeds  is
			smaller	 than  INT.  This option is primarily used for
			longer contigs/reads. When positive, it	 also  affects
			seed filtering.	[0]

	      -P	In  the	 paired-end mode, perform SW to	rescue missing
			hits only but do not try  to  find  hits  that	fit  a
			proper pair.


	      -A INT	Matching score.	[1]

	      -B INT	Mismatch  penalty. The sequence	error rate is approxi-
			mately:	{.75 * exp[-log(4) * B/A]}. [4]

	      -O INT[,INT]
			Gap open penalty. If two numbers  are  specified,  the
			first  is  the	penalty	of openning a deletion and the
			second for openning an insertion. [6]

	      -E INT[,INT]
			Gap extension penalty. If two numbers  are  specified,
			the  first  is the penalty of extending	a deletion and
			second for extending an	insertion. A gap of  length  k
			costs  O  + k*E	(i.e.  -O is for opening a zero-length
			gap). [1]

	      -L INT[,INT]
			Clipping penalty. When performing SW  extension,  BWA-
			MEM  keeps track of the	best score reaching the	end of
			query. If this score is	larger than the	best SW	 score
			minus  the  clipping penalty, clipping will not	be ap-
			plied. Note that in this case, the SAM AS tag  reports
			the best SW score; clipping penalty is not deduced. If
			two numbers are	provided,  the	first  is  for	5'-end
			clipping and second for	3'-end clipping. [5]

	      -U INT	Penalty	 for  an unpaired read pair. BWA-MEM scores an
			unpaired read pair  as	scoreRead1+scoreRead2-INT  and
			scores	 a   paired  as	 scoreRead1+scoreRead2-insert-
			Penalty. It compares these  two	 scores	 to  determine
			whether	 we should force pairing. A larger value leads
			to more	aggressive read	pair. [17]

	      -x STR	Read type. Changes multiple parameters unless  overri-
			den [null]

			pacbio:	  -k17	-W40  -r10 -A1 -B1 -O1 -E1 -L0 (PacBio
				  reads	to ref)

			ont2d:	  -k14 -W20 -r10 -A1 -B1 -O1 -E1  -L0  (Oxford
				  Nanopore 2D-reads to ref)

			intractg: -B9 -O16 -L5 (intra-species contigs to ref)


	      -p	Smart  pairing.	 If  two  adjacent reads have the same
			name, they are considered to form a  read  pair.  This
			way, paired-end	and single-end reads can be mixed in a
			single FASTA/Q stream.

	      -R STR	Complete read group header line. '\t' can be  used  in
			STR  and will be converted to a	TAB in the output SAM.
			The read group ID will be attached to  every  read  in
			the   output.  An  example  is	'@RG\tID:foo\tSM:bar'.

	      -H ARG	If ARG starts with @, it is interpreted	 as  a	string
			and  gets  inserted into the output SAM	header;	other-
			wise, ARG is interpreted as  a	file  with  all	 lines
			starting  with	@  in  the  file inserted into the SAM
			header.	[null]

	      -o FILE	Write the output SAM file to FILE.  For	 compatibility
			with other BWA commands, this option may also be given
			as -f FILE.  [standard ouptut]

			 Don't reduce the mapping quality of  split  alignment
			of lower alignment score.

	      -5	For  split alignment, mark the segment with the	small-
			est coordinate as the primary.	It  automatically  ap-
			plies option -q	as well. This option may help some Hi-
			C pipelines. By	default, BWA-MEM marks highest scoring
			segment	as primary.

	      -K  INT	Process	 INT  input  bases in each batch regardless of
			the number of threads in use [10000000*nThreads].   By
			default,  the batch size is proportional to the	number
			of threads in use.  Because the	inferred  insert  size
			distribution slightly depends on the batch size, using
			different number of threads may	produce	different out-
			put.  Specifying this option helps reproducibility.

	      -T INT	Don't  output  alignment  with	score  lower than INT.
			This option affects output and occasionally  SAM  flag
			2. [30]

	      -j	Treat  ALT  contigs  as	 part  of the primary assembly
			(i.e. ignore the db.prefix.alt file).

	      -h INT[,INT2]
			If a query has not  more  than	INT  hits  with	 score
			higher	than  80%  of the best hit, output them	all in
			the XA tag.  If	INT2 is	specified, BWA-MEM outputs  up
			to INT2	hits if	the list contains a hit	to an ALT con-
			tig. [5,200]

	      -a	Output all found alignments for	single-end or unpaired
			paired-end  reads. These alignments will be flagged as
			secondary alignments.

	      -C	Append FASTA/Q comment to SAM output. This option  can
			be  used  to transfer read meta	information (e.g. bar-
			code) to the SAM output. Note that the FASTA/Q comment
			(the  string  after  a	space in the header line) must
			conform	the SAM	spec  (e.g.  BC:Z:CGTAC).  Malformated
			comments lead to incorrect SAM output.

	      -Y	Use  soft  clipping  CIGAR operation for supplementary
			alignments. By default,	BWA-MEM	uses soft clipping for
			the primary alignment and hard clipping	for supplemen-
			tary alignments.

	      -M	Mark shorter split hits	as secondary (for Picard  com-

	      -v INT	Control	the verbosity level of the output. This	option
			has not	been fully supported throughout	BWA.  Ideally,
			a  value  0  for disabling all the output to stderr; 1
			for outputting errors only; 2 for warnings and errors;
			3  for all normal messages; 4 or higher	for debugging.
			When this option takes value 4,	the output is not SAM.

	      -I FLOAT[,FLOAT[,INT[,INT]]]
			Specify	 the mean, standard deviation (10% of the mean
			if absent), max	(4 sigma from the mean if absent)  and
			min  (4	 sigma if absent) of the insert	size distribu-
			tion. Only applicable to the FR	 orientation.  By  de-
			fault,	BWA-MEM	infers these numbers and the pair ori-
			entations given	enough reads. [inferred]

       aln    bwa aln [-n maxDiff] [-o maxGapO]	[-e maxGapE] [-d nDelTail] [-i
	      nIndelEnd]  [-k maxSeedDiff] [-l seedLen]	[-t nThrds] [-cRN] [-M
	      misMsc] [-O gapOsc]  [-E	gapEsc]	 [-q  trimQual]	 <in.db.fasta>
	      <in.query.fq> > <out.sai>

	      Find  the	SA coordinates of the input reads. Maximum maxSeedDiff
	      differences are allowed in the  first  seedLen  subsequence  and
	      maximum maxDiff differences are allowed in the whole sequence.


	      -n NUM	Maximum	 edit  distance	 if  the  value	is INT,	or the
			fraction of missing alignments given 2%	 uniform  base
			error  rate  if	FLOAT. In the latter case, the maximum
			edit distance is automatically	chosen	for  different
			read lengths. [0.04]

	      -o INT	Maximum	number of gap opens [1]

	      -e INT	Maximum	 number	of gap extensions, -1 for k-difference
			mode (disallowing long gaps) [-1]

	      -d INT	Disallow a long	deletion within	 INT  bp  towards  the
			3'-end [16]

	      -i INT	Disallow an indel within INT bp	towards	the ends [5]

	      -l INT	Take  the  first  INT  subsequence  as seed. If	INT is
			larger than the	query sequence,	seeding	will  be  dis-
			abled. For long	reads, this option is typically	ranged
			from 25	to 35 for `-k 2'. [inf]

	      -k INT	Maximum	edit distance in the seed [2]

	      -t INT	Number of threads (multi-threading mode) [1]

	      -M INT	Mismatch penalty. BWA will not search  for  suboptimal
			hits with a score lower	than (bestScore-misMsc). [3]

	      -O INT	Gap open penalty [11]

	      -E INT	Gap extension penalty [4]

	      -R INT	Proceed	 with  suboptimal  alignments  if there	are no
			more than INT equally best hits. This option only  af-
			fects  paired-end  mapping.  Increasing	this threshold
			helps to improve the pairing accuracy at the  cost  of
			speed, especially for short reads (~32bp).

	      -c	Reverse	query but not complement it, which is required
			for alignment in  the  color  space.  (Disabled	 since

	      -N	Disable	 iterative  search. All	hits with no more than
			maxDiff	differences will be found. This	mode  is  much
			slower than the	default.

	      -q INT	Parameter  for read trimming. BWA trims	a read down to
			argmax_x{\sum_{i=x+1}^l(INT-q_i)} if q_l<INT  where  l
			is the original	read length. [0]

	      -I	The input is in	the Illumina 1.3+ read format (quality
			equals ASCII-64).

	      -B INT	Length of barcode starting from	the 5'-end.  When  INT
			is  positive, the barcode of each read will be trimmed
			before mapping and will	be written at the BC SAM  tag.
			For  paired-end	 reads,	the barcode from both ends are
			concatenated. [0]

	      -b	Specify	the input read sequence	file is	the  BAM  for-
			mat.  For  paired-end data, two	ends in	a pair must be
			grouped	together and options -1	or -2 are usually  ap-
			plied  to  specify which end should be mapped. Typical
			command	lines for mapping pair-end  data  in  the  BAM
			format are:

			    bwa	aln ref.fa -b1 reads.bam > 1.sai
			    bwa	aln ref.fa -b2 reads.bam > 2.sai
			    bwa	sampe ref.fa 1.sai 2.sai reads.bam reads.bam >

	      -0	When -b	is specified, only  use	 single-end  reads  in

	      -1	When  -b  is  specified,  only use the first read in a
			read pair in mapping (skip single-end  reads  and  the
			second reads).

	      -2	When  -b  is  specified, only use the second read in a
			read pair in mapping.

       samse  bwa samse	[-n maxOcc] <in.db.fasta> <in.sai> <in.fq> > <out.sam>

	      Generate alignments in the SAM format  given  single-end	reads.
	      Repetitive hits will be randomly chosen.


	      -n INT	Maximum	 number	 of alignments to output in the	XA tag
			for reads paired properly. If a	read has more than INT
			hits, the XA tag will not be written. [3]

	      -r STR	Specify	   the	 read	group	in   a	 format	  like
			`@RG\tID:foo\tSM:bar'. [null]

       sampe  bwa sampe	[-a maxInsSize]	[-o maxOcc] [-n	maxHitPaired] [-N max-
	      HitDis] [-P] <in.db.fasta> <in1.sai> <in2.sai> <in1.fq> <in2.fq>
	      >	<out.sam>

	      Generate alignments in the SAM format  given  paired-end	reads.
	      Repetitive read pairs will be placed randomly.


	      -a INT  Maximum insert size for a	read pair to be	considered be-
		      ing mapped properly. Since 0.4.5,	this  option  is  only
		      used  when  there	are not	enough good alignment to infer
		      the distribution of insert sizes.	[500]

	      -o INT  Maximum occurrences of a read for	pairing. A  read  with
		      more  occurrneces	 will be treated as a single-end read.
		      Reducing this parameter helps faster pairing. [100000]

	      -P      Load the entire FM-index into memory to reduce disk  op-
		      erations	(base-space  reads only). With this option, at
		      least 1.25N bytes	of memory are required,	where N	is the
		      length of	the genome.

	      -n INT  Maximum number of	alignments to output in	the XA tag for
		      reads paired properly. If	a read has more	than INT hits,
		      the XA tag will not be written. [3]

	      -N INT  Maximum number of	alignments to output in	the XA tag for
		      disconcordant read pairs (excluding  singletons).	 If  a
		      read  has	 more  than  INT  hits,	the XA tag will	not be
		      written. [10]

	      -r STR  Specify	the   read   group   in	   a	format	  like
		      `@RG\tID:foo\tSM:bar'. [null]

       bwasw  bwa  bwasw  [-a  matchScore]  [-b	 mmPen]	 [-q  gapOpenPen]  [-r
	      gapExtPen] [-t nThreads] [-w bandWidth] [-T thres] [-s  hspIntv]
	      [-z  zBest]  [-N	nHspRev]  [-c thresCoef] <in.db.fasta> <in.fq>

	      Align query  sequences  in  the  in.fq  file.  When  mate.fq  is
	      present,	perform	paired-end alignment. The paired-end mode only
	      works for	reads Illumina short-insert libraries. In the  paired-
	      end  mode, BWA-SW	may still output split alignments but they are
	      all marked as not	properly paired; the mate positions  will  not
	      be written if the	mate has multiple local	hits.


	      -a INT	Score of a match [1]

	      -b INT	Mismatch penalty [3]

	      -q INT	Gap open penalty [5]

	      -r INT	Gap  extension	penalty.  The penalty for a contiguous
			gap of size k is q+k*r.	[2]

	      -t INT	Number of threads in the multi-threading mode [1]

	      -w INT	Band width in the banded alignment [33]

	      -T INT	Minimum	score threshold	divided	by a [37]

	      -c FLOAT	Coefficient  for  threshold  adjustment	 according  to
			query length. Given an l-long query, the threshold for
			a hit to be retained is	a*max{T,c*log(l)}. [5.5]

	      -z INT	Z-best heuristics. Higher -z increases accuracy	at the
			cost of	speed. [1]

	      -s INT	Maximum	SA interval size for initiating	a seed.	Higher
			-s increases accuracy at the cost of speed. [3]

	      -N INT	Minimum	 number	 of  seeds  supporting	the  resultant
			alignment to skip reverse alignment. [5]

       The  output  of	the  `aln'  command is binary and designed for BWA use
       only. BWA outputs the final  alignment  in  the	SAM  (Sequence	Align-
       ment/Map) format. Each line consists of:

       |Col | Field |			    Description			       |
       | 1  | QNAME | Query (pair) NAME					       |
       | 2  | FLAG  | bitwise FLAG					       |
       | 3  | RNAME | Reference	sequence NAME				       |
       | 4  | POS   | 1-based leftmost POSition/coordinate of clipped sequence |
       | 5  | MAPQ  | MAPping Quality (Phred-scaled)			       |
       | 6  | CIAGR | extended CIGAR string				       |
       | 7  | MRNM  | Mate Reference sequence NaMe (`='	if same	as RNAME)      |
       | 8  | MPOS  | 1-based Mate POSistion				       |
       | 9  | ISIZE | Inferred insert SIZE				       |
       |10  | SEQ   | query SEQuence on	the same strand	as the reference       |
       |11  | QUAL  | query QUALity (ASCII-33 gives the	Phred base quality)    |
       |12  | OPT   | variable OPTional	fields in the format TAG:VTYPE:VALUE   |

       Each bit	in the FLAG field is defined as:

	       |Chr |  Flag  |		    Description		     |
	       | p  | 0x0001 | the read	is paired in sequencing	     |
	       | P  | 0x0002 | the read	is mapped in a proper pair   |
	       | u  | 0x0004 | the query sequence itself is unmapped |
	       | U  | 0x0008 | the mate	is unmapped		     |
	       | r  | 0x0010 | strand of the query (1 for reverse)   |
	       | R  | 0x0020 | strand of the mate		     |
	       | 1  | 0x0040 | the read	is the first read in a pair  |
	       | 2  | 0x0080 | the read	is the second read in a	pair |
	       | s  | 0x0100 | the alignment is	not primary	     |
	       | f  | 0x0200 | QC failure			     |
	       | d  | 0x0400 | optical or PCR duplicate		     |
	       | S  | 0x0800 | supplementary alignment		     |

       The Please check	<> for the format spec-
       ification and the tools for post-processing the alignment.

       BWA generates the following optional fields. Tags starting with `X' are
       specific	to BWA.

	      |Tag |			 Meaning		      |
	      |NM  | Edit distance				      |
	      |MD  | Mismatching positions/bases		      |
	      |AS  | Alignment score				      |
	      |BC  | Barcode sequence				      |
	      |SA  | Supplementary alignments			      |
	      |X0  | Number of best hits			      |
	      |X1  | Number of suboptimal hits found by	BWA	      |
	      |XN  | Number of ambiguous bases in the referenece      |
	      |XM  | Number of mismatches in the alignment	      |
	      |XO  | Number of gap opens			      |
	      |XG  | Number of gap extentions			      |
	      |XT  | Type: Unique/Repeat/N/Mate-sw		      |
	      |XA  | Alternative hits; format: /(chr,pos,CIGAR,NM;)*/ |
	      |XS  | Suboptimal	alignment score			      |
	      |XF  | Support from forward/reverse alignment	      |
	      |XE  | Number of supporting seeds			      |

       Note  that XO and XG are	generated by BWT search	while the CIGAR	string
       by Smith-Waterman alignment. These two tags may	be  inconsistent  with
       the CIGAR string. This is not a bug.

   Alignment Accuracy
       When  seeding is	disabled, BWA guarantees to find an alignment contain-
       ing maximum maxDiff differences including maxGapO gap  opens  which  do
       not  occur  within nIndelEnd bp towards either end of the query.	Longer
       gaps may	be found if maxGapE is positive, but it	is not	guaranteed  to
       find  all  hits.	When seeding is	enabled, BWA further requires that the
       first seedLen subsequence contains no  more  than  maxSeedDiff  differ-

       When gapped alignment is	disabled, BWA is expected to generate the same
       alignment as Eland version 1, the Illumina alignment program.  However,
       as  BWA change `N' in the database sequence to random nucleotides, hits
       to these	random sequences will also be counted. As a  consequence,  BWA
       may mark	a unique hit as	a repeat, if the random	sequences happen to be
       identical to the	sequences which	should be unqiue in the	database.

       By default, if the best hit is not  highly  repetitive  (controlled  by
       -R), BWA	also finds all hits contains one more mismatch;	otherwise, BWA
       finds all equally best hits only. Base quality  is  NOT	considered  in
       evaluating  hits.  In the paired-end mode, BWA pairs all	hits it	found.
       It further performs Smith-Waterman alignment for	unmapped reads to res-
       cue  reads  with	a high erro rate, and for high-quality anomalous pairs
       to fix potential	alignment errors.

   Estimating Insert Size Distribution
       BWA estimates the insert	size distribution per 256*1024 read pairs.  It
       first  collects	pairs of reads with both ends mapped with a single-end
       quality 20 or higher and	then calculates	median (Q2), lower and	higher
       quartile	(Q1 and	Q3). It	estimates the mean and the variance of the in-
       sert size distribution from pairs whose insert sizes are	within	inter-
       val  [Q1-2(Q3-Q1), Q3+2(Q3-Q1)].	The maximum distance x for a pair con-
       sidered to be properly paired (SAM flag 0x2) is calculated  by  solving
       equation	 Phi((x-mu)/sigma)=x/L*p0,  where mu is	the mean, sigma	is the
       standard	error of the insert size distribution, L is the	length of  the
       genome, p0 is prior of anomalous	pair and Phi() is the standard cumula-
       tive distribution function. For mapping Illumina	short-insert reads  to
       the  human  genome, x is	about 6-7 sigma	away from the mean. Quartiles,
       mean, variance and x will be printed to the standard error output.

   Memory Requirement
       With bwtsw algorithm, 5GB memory	is required for	indexing the  complete
       human  genome  sequences.  For short reads, the aln command uses	~3.2GB
       memory and the sampe command uses ~5.4GB.

       Indexing	the human genome sequences takes 3 hours with bwtsw algorithm.
       Indexing	 smaller  genomes  with	 IS algorithms is faster, but requires
       more memory.

       The speed of alignment is largely determined by the error rate  of  the
       query  sequences	 (r).  Firstly,	 BWA runs much faster for near perfect
       hits than for hits with many differences, and it	stops searching	for  a
       hit with	l+2 differences	if a l-difference hit is found.	This means BWA
       will be very slow if r is high because in this case BWA	has  to	 visit
       hits  with  many	 differences  and looking for these hits is expensive.
       Secondly, the alignment algorithm behind	makes the speed	 sensitive  to
       [k log(N)/m], where k is	the maximum allowed differences, N the size of
       database	and m the length of a query. In	practice, we choose k w.r.t. r
       and therefore r is the leading factor. I	would not recommend to use BWA
       on data with r>0.02.

       Pairing is slower for shorter reads. This  is  mainly  because  shorter
       reads  have more	spurious hits and converting SA	coordinates to chromo-
       somal coordinates are very costly.

       Since version 0.6, BWA has been able to work with  a  reference	genome
       longer  than 4GB.  This feature makes it	possible to integrate the for-
       ward and	reverse	complemented genome in one FM-index, which  speeds  up
       both  BWA-short and BWA-SW. As a	tradeoff, BWA uses more	memory because
       it has to keep all positions and	ranks in 64-bit	integers, twice	larger
       than 32-bit integers used in the	previous versions.

       The latest BWA-SW also works for	paired-end reads longer	than 100bp. In
       comparison to BWA-short,	BWA-SW tends to	be more	 accurate  for	highly
       unique  reads  and  more	 robust	to relative long INDELs	and structural
       variants.  Nonetheless, BWA-short usually has higher power  to  distin-
       guish the optimal hit from many suboptimal hits.	The choice of the map-
       ping algorithm may depend on the	application.

       BWA   website   <>,   Samtools    website

       Heng  Li	 at  the Sanger	Institute wrote	the key	source codes and inte-
       grated	the   following	  codes	  for	 BWT	construction:	 bwtsw
       <>,  implemented by Chi-Kwong Wong at
       the	 University	  of	   Hong	      Kong	 and	    IS
       <>  originally  proposed by Nong Ge
       <> at the	Sun Yat-Sen University and im-
       plemented by Yuta Mori.

       The full	BWA package is distributed under GPLv3 as it uses source codes
       from BWT-SW which is covered by GPL. Sorting, hash table,  BWT  and  IS
       libraries are distributed under the MIT license.

       If  you	use the	BWA-backtrack algorithm, please	cite the following pa-

       Li H. and Durbin	R. (2009) Fast and accurate short read alignment  with
       Burrows-Wheeler	 transform.   Bioinformatics,  25,  1754-1760.	[PMID:

       If you use the BWA-SW algorithm,	please cite:

       Li H. and Durbin	R. (2010) Fast and accurate long-read  alignment  with
       Burrows-Wheeler	 transform.   Bioinformatics,	26,   589-595.	[PMID:

       If you use BWA-MEM or the fastmap component of BWA, please cite:

       Li H. (2013) Aligning sequence reads, clone sequences and assembly con-
       tigs with BWA-MEM. arXiv:1303.3997v1 [q-bio.GN].

       It  is likely that the BWA-MEM manuscript will not appear in a peer-re-
       viewed journal.

       BWA is largely influenced by BWT-SW. It uses source codes  from	BWT-SW
       and  mimics its binary file formats; BWA-SW resembles BWT-SW in several
       ways. The initial idea about BWT-based alignment	 also  came  from  the
       group  who  developed BWT-SW. At	the same time, BWA is different	enough
       from BWT-SW. The	short-read alignment algorithm bears no	similarity  to
       Smith-Waterman  algorithm any more. While BWA-SW	learns from BWT-SW, it
       introduces heuristics that can hardly be	applied	to the original	 algo-
       rithm.  In  all,	 BWA does not guarantee	to find	all local hits as what
       BWT-SW is designed to do, but it	is much	faster	than  BWT-SW  on  both
       short and long query sequences.

       I  started to write the first piece of codes on 24 May 2008 and got the
       initial stable version on 02 June 2008. During this period, I  was  ac-
       quainted	 that Professor	Tak-Wah	Lam, the first author of BWT-SW	paper,
       was collaborating with Beijing Genomics Institute on SOAP2, the succes-
       sor  to	SOAP  (Short Oligonucleotide Analysis Package).	SOAP2 has come
       out in November 2008. According to the SourceForge download  page,  the
       third  BWT-based	 short read aligner, bowtie, was first released	in Au-
       gust 2008. At the time of writing this manual, at least three more BWT-
       based short-read	aligners are being implemented.

       The BWA-SW algorithm is a new component of BWA. It was conceived	in No-
       vember 2008 and implemented ten months later.

       The BWA-MEM algorithm is	based on an  algorithm	finding	 super-maximal
       exact  matches (SMEMs), which was first published with the fermi	assem-
       bler paper in 2012. I first implemented the basic SMEM algorithm	in the
       fastmap command for an experiment and then extended the basic algorithm
       and added the extension part in Feburary	2013 to	make BWA-MEM  a	 fully
       featured	mapper.

bwa-0.7.17-r1188		23 October 2017				bwa(1)


Want to link to this manual page? Use this URL:

home | help