Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages


home | help
cmalign(1)			Infernal Manual			    cmalign(1)

       cmalign - align sequences to a covariance model

	      [options]	_cmfile_ _seqfile_

       cmalign	aligns	the RNA	sequences in _seqfile_ to the covariance model
       (CM) in _cmfile_.  The new alignment is output to stdout	 in  Stockholm
       format, but can be redirected to	a file _f_ with	the -o _f_ option.

       Either  _cmfile_	 or  _seqfile_ (but not	both) may be '-' (dash), which
       means reading this input	from stdin rather than a file.

       The sequence file _seqfile_ must	be in FASTA or Genbank format.

       cmalign uses an HMM banding technique to	accelerate  alignment  by  de-
       fault  as  described below for the --hbanded option. HMM	banding	can be
       turned off with the --nonbanded option.

       By default, cmalign computes the	alignment with maximum expected	 accu-
       racy  that  is consistent with constraints (bands) derived from an HMM,
       using a banded version of the Durbin/Holmes optimal accuracy algorithm.
       This behavior can be changed with the --cyk or --sample options.

       cmalign	takes  special	care  to  correctly align truncated sequences,
       where some nucleotides from the beginning (5') and/or end (3')  of  the
       actual full length biological sequence are not present in the input se-
       quence (see DL Kolbe and	SR Eddy, Bioinformatics, 25:1236-1243,	2009).
       This  behavior  is on by	default, but can be turned off with --notrunc.
       In previous versions of cmalign the --sub option	was required to	appro-
       priately	 handle	 truncated sequences. The --sub	option is still	avail-
       able in this version, but the new default method	for handling truncated
       sequences should	be as good or superior to the sub method in nearly all

       The --mapali _s_	option allows inclusion	of the fixed  training	align-
       ment  used to build the CM from file _s_	within the output alignment of

       It is possible to merge two or more alignments created by the  same  CM
       using  the  Easel miniapp esl-alimerge (included	in the easel/miniapps/
       subdirectory of Infernal). Previous versions of	cmalign	 included  op-
       tions  to merge alignments but they were	deprecated upon	development of
       esl-alimerge, which is significantly more memory	efficient.

       By default, cmalign will	output the alignment to	stdout.	 The alignment
       can  be	redirected  to an output file _f_ with the -o _f_ option. With
       -o, information on each aligned sequence,  including  score  and	 model
       alignment boundaries will be printed to stdout (more on this below).

       The  output  alignment will be in Stockholm format by default. This can
       be changed to Pfam, aligned FASTA (AFA),	A2M, Clustal, or Phylip	format
       using  the --outformat _s_ option, where	_s_ is the name	of the desired
       format.	As a special case, if the output alignment is large (more than
       10,000  sequences  or  more than	10,000,000 total nucleotides) than the
       output format will be Pfam format, with each sequence  appearing	 on  a
       single  line,  for  reasons of memory efficiency. For alignments	larger
       than this, using	--ileaved will force interleaved Stockholm format, but
       the  user  should  be  aware  that  this	 may  require a	lot of memory.
       --ileaved will only work	for alignments	up  to	100,000	 sequences  or
       100,000,000 total nucleotides.

       If  the output alignment	format is Stockholm or Pfam, the output	align-
       ment will be annotated with posterior probabilities which estimate  the
       confidence  level  of each aligned nucleotide.  This annotation appears
       as lines	beginning with "#=GR <seq name>	PP", one  per  sequence,  each
       immediately  below  the	corresponding  aligned	sequence "<seq name>".
       Characters in PP	lines have 12 possible values: "0-9", "*", or ".".  If
       ".",  the position corresponds to a gap in the sequence.	A value	of "0"
       indicates a posterior probability of between 0.0	and  0.05,  "1"	 indi-
       cates between 0.05 and 0.15, "2"	indicates between 0.15 and 0.25	and so
       on up to	"9" which indicates between 0.85 and 0.95. A value of "*"  in-
       dicates	a posterior probability	of between 0.95	and 1.0. Higher	poste-
       rior probabilities correspond to	greater	confidence  that  the  aligned
       nucleotide  belongs  where  it  appears	in the alignment.  With	--non-
       banded, the calculation of the posterior	 probabilities	considers  all
       possible	 alignments  of	 the target sequence to	the CM.	Without	--non-
       banded (i.e. in default mode), the calculation considers	only  possible
       alignments  within  the HMM bands. Further, the posterior probabilities
       are conditional on the truncation mode of the alignment.	 For  example,
       if  the sequence	alignment is truncated 5', a PP	value of "9" indicates
       between 0.85 and	0.95 of	all 5' truncated alignments include the	 given
       nucleotide  at  the  given  position.   The posterior annotation	can be
       turned off with the --noprob option. If --small is  enabled,  posterior
       annotation must also be turned off using	--noprob.

       The  tabular  output that is printed to stdout if the -o	option is used
       includes	one line per sequence and twelve fields	per line:  "idx":  the
       index of	the sequence in	the input file,	"seq name": the	sequence name;
       "length": the length of the sequence; "cm from" and "cm to": the	 model
       start and end positions of the alignment; "trunc": "no" if the sequence
       is not truncated, "5'" if the beginning of the sequence	truncated  5',
       "3'"  if	 the end of the	sequence is truncated, and "5'&3'" if both the
       beginning and the end are truncated; "bit sc": the  bit	score  of  the
       alignment,  "avg	 pp"  the average posterior probability	of all aligned
       nucleotides in the alignment; "band calc", "alignment" and "total": the
       time  in	 seconds  required  for	 calculating  HMM bands, computing the
       alignment, and complete processing of the sequence, respectively;  "mem
       (Mb)":  the size	in Mb of all dynamic programming matrices required for
       aligning	the sequence.  This tabular data can be	saved to file _f_ with
       the --sfile _f_ option.

       -h     Help; print a brief reminder of command line usage and available

       -o _f_ Save the alignment in Stockholm format to	a file _f_.   The  de-
	      fault is to write	it to standard output.

       -g     Configure	 the  model for	global alignment of the	query model to
	      the target sequences. By default,	the model  is  configured  for
	      local  alignment.	 Local alignments can contain large insertions
	      and deletions called "local ends"	in the structure to be	penal-
	      ized  differently	than normal indels. These are annotated	as "~"
	      columns in the RF	line of	the output alignment.  The  -g	option
	      can  be used to disallow these local ends.  The -g option	is re-
	      quired if	the --sub option is also used.

	      Align sequences using the	Durbin/Holmes optimal  accuracy	 algo-
	      rithm. This is the default.  The optimal accuracy	alignment will
	      be constrained by	HMM bands for acceleration unless  the	--non-
	      banded option is enabled.	 The optimal accuracy algorithm	deter-
	      mines the	alignment that maximizes the  posterior	 probabilities
	      of  the  aligned	nucleotides  within  it.  The posterior	proba-
	      bilites are determined using (possibly HMM banded)  variants  of
	      the Inside and Outside algorithms.

       --cyk  Do not use the Durbin/Holmes optimal accuracy alignment to align
	      the sequences, instead use the CYK  algorithm  which  determines
	      the  optimally scoring (maximum likelihood) alignment of the se-
	      quence to	the model, given the HMM bands (unless --nonbanded  is
	      also enabled).

	      Sample  an  alignment  from the posterior	distribution of	align-
	      ments.  The posterior distribution is determined	using  an  HMM
	      banded (unless --nonbanded) variant of the Inside	algorithm.

       --seed _n_
	      Seed  the	 random	 number	 generator  with _n_, an integer >= 0.
	      This option can only be used in combination with	--sample.   If
	      _n_ is nonzero, stochastic sampling of alignments	will be	repro-
	      ducible; the same	command	will give the same results.  If	_n_ is
	      0,  the  random number generator is seeded arbitrarily, and sto-
	      chastic samplings	may vary from run to run of the	same  command.
	      The default seed is 181.

	      Turn  off	 truncated alignment algorithms.  All sequences	in the
	      input file will be assumed to be full length,  unless  --sub  is
	      also  used, in which case	the program can	still handle truncated
	      sequences	but will use an	alternative strategy for their	align-

       --sub  Turn  on the sub model construction and alignment	procedure. For
	      each sequence, an	HMM is first used to predict the  model	 start
	      and  end consensus columns, and a	new sub	CM is constructed that
	      only models consensus columns from start to end. The sequence is
	      then  aligned  to	this sub CM.  Sub alignment is an older	method
	      than the default one for aligning	sequences  that	 are  possibly
	      truncated.  By  default,	cmalign	 uses special DP algorithms to
	      handle truncated sequences which should be  more	accurate  than
	      the sub method in	most cases.  --sub is still included as	an op-
	      tion mainly for testing against this default truncated  sequence
	      handling.	  This	"sub CM" procedure is not the same as the "sub
	      CMs" described by	Weinberg and Ruzzo.

	      This option is turned on by  default.  Accelerate	 alignment  by
	      pruning  away regions of the CM DP matrix	that are deemed	negli-
	      gible by an HMM.	First, each sequence is	scored with a CM  plan
	      9	HMM derived from the CM	using the Forward and Backward HMM al-
	      gorithms to calculate posterior probabilities that each  nucleo-
	      tide aligns to each state	of the HMM. These posterior probabili-
	      ties are used to derive constraints (bands) on the CM DP matrix.
	      Finally,	the  target  sequence  is  aligned to the CM using the
	      banded DP	matrix,	during which cells outside the bands  are  ig-
	      nored. Usually most of the full DP matrix	lies outside the bands
	      (often more than 95%),  making  this  technique  faster  because
	      fewer  DP	 calculations  are required, and more memory efficient
	      because only cells within	the bands need be allocated.

	      Importantly, HMM banding sacrifices the guarantee	of determining
	      the  optimally  accurarte	 or  optimal  alignment, which will be
	      missed if	it lies	outside	the bands. The tau  paramater  is  the
	      amount of	probability mass considered negligible during HMM band
	      calculation; lower values	of tau yield greater speedups but also
	      a	 greater  chance of missing the	optimal	alignment. The default
	      tau is 1E-7, determined empirically as a good  tradeoff  between
	      sensitivity and speed, though this value can be changed with the
	      --tau  <x> option. The level of acceleration increases with both
	      the  length  and primary sequence	conservation level of the fam-
	      ily. For example,	with the default tau of	1E-7, tRNA models (low
	      primary  sequence	 conservation  with length of about 75 nucleo-
	      tides) show about	10X acceleration, and SSU bacterial rRNA  mod-
	      els  (high  primary  sequence  conservation with length of about
	      1500 nucleotides)	show about 700X.  HMM banding  can  be	turned
	      off with the --nonbanded option.

       --tau _x_
	      Set  the	tail loss probability used during HMM band calculation
	      to _x_.  This is the amount of probability mass within  the  HMM
	      posterior	 probabilities	that is	considered negligible. The de-
	      fault value is 1E-7.  In general,	higher values will  result  in
	      greater acceleration, but	increase the chance of missing the op-
	      timal alignment due to the HMM bands.

       --mxsize	_x_
	      Set the maximum allowable	total DP matrix	size to	_x_ megabytes.
	      By  default  this	 size is 1028 Mb.  This	should be large	enough
	      for the vast majority of alignments, however if it is  not  cma-
	      lign  will  attempt to iteratively tighten the HMM bands it uses
	      to constrain the alignment by raising the	tau parameter and  re-
	      calculating  the	bands until the	total matrix size needed falls
	      below _x_	megabytes or the maximum allowable tau value (0.05  by
	      default, but changeable with --maxtau) is	reached. At each iter-
	      ation of band tightening,	tau is multiplied by a 2.0.  The  band
	      tightening  strategy  can	 be turned off with the	--fixedtau op-
	      tion.  If	the maximum tau	is reached  and	 the  required	matrix
	      size  still  exceeds _x_ or if HMM banding is not	being used and
	      the required matrix size exceeds _x_ then	cmalign	will exit pre-
	      maturely	and  report  an	error message that the matrix exceeded
	      its maximum allowable size. In this case,	the  --mxsize  can  be
	      used  to	raise  the size	limit or the maximum tau can be	raised
	      with --maxtau.  The limit	will commonly  be  exceeded  when  the
	      --nonbanded  option  is used without the --small option, but can
	      still occur when --nonbanded is not used.	Note that  if  cmalign
	      is being run in _n_ multiple threads on a	multicore machine then
	      each thread may have an allocated	matrix of up to	size _x_ Mb at
	      any given	time.

	      Turn  off	 the HMM band tightening strategy described in the ex-
	      planation	of the --mxsize	option above.

       --maxtau	_x_
	      Set the maximum allowed value for	tau  during  band  tightening,
	      described	 in the	explanation of --mxsize	above, to _x_.	By de-
	      fault this value is 0.05.

	      Turns off	HMM banding. The returned alignment is	guaranteed  to
	      be the globally optimally	accurate one (by default) or the glob-
	      ally optimally scoring one (if --cyk is enabled).	  The  --small
	      option  is  recommended in combination with this option, because
	      standard alignment without HMM banding requires a	lot of	memory
	      (see --small ).

	      Use  the divide and conquer CYK alignment	algorithm described in
	      SR Eddy, BMC Bioinformatics 3:18,	2002. The  --nonbanded	option
	      must be used in combination with this options.  Also, it is rec-
	      ommended whenever	--nonbanded is used that --small is also  used
	      because standard CM alignment without HMM	banding	requires a lot
	      of memory, especially for	large RNAs.  --small allows CM	align-
	      ment  within  practical  memory  limits, reducing	the memory re-
	      quired for alignment LSU rRNA, the largest known RNAs, from  150
	      Gb  to less than 300 Mb.	This option can	only be	used in	combi-
	      nation with --nonbanded, --notrunc, and --cyk.

       --sfile _f_
	      Dump per-sequence	alignment score	and timig information to  file
	      _f_.   The format	of this	file is	described above	(it's the same
	      data in the same format as the tabular stdout output when	the -o
	      option is	used).

       --tfile _f_
	      Dump tabular sequence tracebacks for each	individual sequence to
	      a	file _f_.  Primarily useful for	debugging.

       --ifile _f_
	      Dump per-sequence	insert information to file _f_.	 The format of
	      the  file	is described by	"#"-prefixed comment lines included at
	      the top of the file _f_.	The insert information is  valid  even
	      when the --matchonly option is used.

       --elfile	_f_
	      Dump  per-sequence  EL  state  (local end) insert	information to
	      file _f_.	 The format of the file	is described  by  "#"-prefixed
	      comment  lines  included at the top of the file _f_.  The	EL in-
	      sert information is valid	even when the  --matchonly  option  is

       --mapali	_f_
	      Reads the	alignment from file _f_	used to	build the model	aligns
	      it as a single object to the CM; e.g. the	alignment  in  _f_  is
	      held  fixed.  This allows	you to align sequences to a model with
	      cmalign and view them in the context of an existing trusted mul-
	      tiple alignment.	_f_ must be the	alignment file that the	CM was
	      built from. The program verifies that the	checksum of  the  file
	      matches that of the file used to construct the CM. A similar op-
	      tion to this one was called --withali in	previous  versions  of

	      Must be used in combination with --mapali	_f_.  Propogate	struc-
	      tural information	for any	pseudoknots that exist in _f_  to  the
	      output  alignment.  A  similar  option  to  this	one was	called
	      --withstr	in previous versions of	cmalign.

       --informat _s_
	      Assert that the input _seqfile_ is in format _s_.	  Do  not  run
	      Babelfish	 format	autodection. This increases the	reliability of
	      the program somewhat, because the	Babelfish can  make  mistakes;
	      particularly recommended for unattended, high-throughput runs of
	      Infernal.	 Acceptable formats are:  FASTA,  GENBANK,  and	 DDBJ.
	      _s_ is case-insensitive.

       --outformat _s_
	      Specify  the output alignment format as _s_.  Acceptable formats
	      are: Pfam, AFA, A2M, Clustal, and	Phylip.	 AFA is	aligned	fasta.
	      Only Pfam	and Stockholm alignment	formats	will include consensus
	      structure	annotation and	posterior  probability	annotation  of
	      aligned residues.

	      Output the alignments as DNA sequence alignments,	instead	of RNA

	      Do not annotate the output alignment with	 posterior  probabili-

	      Only  include  match columns in the output alignment, do not in-
	      clude any	insertions relative to the consensus model.  This  op-
	      tion  may	be useful when creating	very large alignments that re-
	      quire a lot of memory and	disk space, most of which is necessary
	      only  to	deal  with  insert  columns  that are gaps in most se-

	      Output the alignment in interleaved Stockholm format of a	 fixed
	      width  that may be more convenient for examination. This was the
	      default output alignment format of previous versions of cmalign.
	      Note that	cmalign	requires more memory when this option is used.
	      For this reason, --ileaved will only work	for alignments	of  up
	      to  100,000  sequences or	a total	of 100,000,000 aligned nucleo-

       --regress _s_
	      Save an additional copy of the output alignment with  no	author
	      information to file _s_.

	      Output additional	information in the tabular scores output (out-
	      put to stdout if -o is used, or to _f_ if	--sfile	_f_ is	used).
	      These are	mainly useful for testing and debugging.

       --cpu _n_
	      Specify  that _n_	parallel CPU workers be	used. If _n_ is	set as
	      "0", then	the program will be run	in serial mode,	without	 using
	      threads.	 You  can also control this number by setting an envi-
	      ronment variable,	 INFERNAL_NCPU.	  This	option	will  only  be
	      available	 if the	machine	on which Infernal was built is capable
	      of using POSIX threading (see the	Installation  section  of  the
	      user guide for more information).

       --mpi  Run  as an MPI parallel program. This option will	only be	avail-
	      able if Infernal has been	configured and built with  the	"--en-
	      able-mpi"	 flag  (see the	Installation section of	the user guide
	      for more information).

       See infernal(1) for a master man	page with a list of all	the individual
       man pages for programs in the Infernal package.

       For  complete documentation, see	the user guide that came with your In-
       fernal distribution (Userguide.pdf); or see the Infernal	web page ().

       Copyright (C) 2019 Howard Hughes	Medical	Institute.
       Freely distributed under	the BSD	open source license.

       For additional information on copyright and  licensing,	see  the  file
       called  COPYRIGHT  in your Infernal source distribution,	or see the In-
       fernal web page ().

       The Eddy/Rivas Laboratory
       Janelia Farm Research Campus
       19700 Helix Drive
       Ashburn VA 20147	USA

Infernal 1.1.3			   Nov 2019			    cmalign(1)


Want to link to this manual page? Use this URL:

home | help