Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages


home | help
cmsearch(1)			Infernal Manual			   cmsearch(1)

       cmsearch	- search covariance model(s) against a sequence	database

       cmsearch	[options] _cmfile_ _seqdb_

       cmsearch	 is used to search one or more covariance models (CMs) against
       a sequence database.  For each CM in _cmfile_, use  that	 query	CM  to
       search  the  target database of sequences in _seqdb_, and output	ranked
       lists of	the sequences with the most significant	matches	to the CM.  To
       build CMs from multiple alignments, see cmbuild.

       The  query _cmfile_ must	have been calibrated for E-values with cmcali-
       brate.  As a special exception, any models in _cmfile_ that  have  zero
       basepairs  need not be calibrated. For these models, profile HMM	search
       algorithms will be used instead of CM ones, as discussed	further	below.

       The query _cmfile_ may be '-' (a	dash character),  in  which  case  the
       query CM	input will be read from	a <stdin> pipe instead of from a file.
       The _seqdb_ may not be '-' because the current implementation needs  to
       be able to rewind the database, which is	not possible with stdin	input.

       The output format is designed to	be human-readable, but is often	so vo-
       luminous	that reading it	is impractical,	and parsing it is a pain.  The
       --tblout	option saves output in a simple	tabular	format that is concise
       and easier to parse.  The -o option allows redirecting the main output,
       including throwing it away in /dev/null.

       cmsearch	 reexamines  the  5'  and 3' termini of	target sequences using
       specialized algorithms for detection of truncated hits, in  which  part
       of  the	5' and/or 3' end of the	actual full length homologous sequence
       is missing in the target	sequence file. These types  of	hits  will  be
       most  common  in	 sequence  files  consisting of	unassembled sequencing
       reads. By default, any 5' truncated hit	is  required  to  include  the
       first  residue  of  the target sequence it derives from in _seqdb_, and
       any 3' truncated	hit is required	to include the final  residue  of  the
       target  sequence	 it derives from. Any 5' and 3'	truncated hit must in-
       clude the first and final residue of the	 target	 sequence  it  derives
       from.  The --anytrunc option will relax the requirements	for hit	inclu-
       sion of sequence	endpoints, and truncated hits are allowed to start and
       stop  at	 any  positions	 of target sequences. Importantly though, with
       --anytrunc, hit E-values	will be	less accurate because  model  calibra-
       tion  does  not	consider  the possibility of truncated hits, so	use it
       with caution.  The --notrunc option can be used to turn	off  truncated
       hit  detection.	 --notrunc  will  reduce the running time of cmsearch,
       most significantly for target _seqdb_ files that	include	many short se-

       Truncated  hit  detection  is  automatically turned off when the	--max,
       --nohmm,	--qdb, or --nonbanded options are used because	it  relies  on
       the  use	of an accelerated HMM banded alignment strategy	that is	turned
       off by any of those options.

       -h     Help; print a brief reminder  of	command	 line  usage  and  all
	      available	options.

       -g     Turn  on	the glocal alignment algorithm,	global with respect to
	      the query	model and local	with respect to	the  target  database.
	      By default, the local alignment algorithm	is used	which is local
	      with respect to both the target sequence and the model. In local
	      mode,  the  alignment to span two	or more	subsequences if	neces-
	      sary (e.g. if the	structures of the query	model and  target  se-
	      quence are only partially	shared), allowing certain large	inser-
	      tions and	deletions in the structure to be penalized differently
	      than  normal  indels.  Local  mode  performs better on empirical
	      benchmarks and is	significantly more sensitive for remote	homol-
	      ogy  detection.  Empirically,  glocal searches return many fewer
	      hits than	local searches,	so glocal may be desired for some  ap-
	      plications.  With	 -g, all models	must be	calibrated, even those
	      with zero	basepairs.

       -Z _x_ Calculate	E-values as if the search space	size was _x_ megabases
	      (Mb).  Without  the use of this option, the search space size is
	      defined as the total number of nucleotides in _seqdb_  times  2,
	      because both strands of each target sequence will	be searched.

	      Print  help,  as	with -h	, but also include expert options that
	      are not displayed	with -h	.  These expert	options	 are  not  ex-
	      pected  to be relevant for the vast majority of users and	so are
	      not described in the manual page.	 The only resources for	under-
	      standing	what  they actually do are the brief one-line descrip-
	      tions output when	--devhelp is enabled, and the source code.

       -o _f_ Direct the main human-readable output to a file _f_  instead  of
	      the default stdout.

       -A _f_ Save  a multiple alignment of all	significant hits (those	satis-
	      fying inclusion thresholds) to the file _f_.

       --tblout	_f_
	      Save a simple tabular  (space-delimited)	file  summarizing  the
	      hits  found, with	one data line per hit. The format of this file
	      is described in section 6	of the Infernal	user guide.

       --acc  Use accessions instead of	names in the main output, where	avail-
	      able for profiles	and/or sequences.

	      Omit  the	 alignment  section  from  the	main  output. This can
	      greatly reduce the output	volume.

	      Unlimit the length of each line in the main output. The  default
	      is a limit of 120	characters per line, which helps in displaying
	      the output cleanly on terminals and in editors, but can truncate
	      target profile description lines.

       --textw _n_
	      Set  the	main  output's line length limit to _n_	characters per
	      line. The	default	is 120.

	      Include extra search pipeline statistics in the main output, in-
	      cluding  filter  survival	statistics for truncated hit detection
	      and number of envelopes discarded	due to matrix size overflows.

       Reporting thresholds control which hits are reported  in	 output	 files
       (the  main output and --tblout) Hits are	ranked by statistical signifi-
       cance (E-value).	 By default, all hits with an E-value <=  10  are  re-
       ported.	 The following options allow you to change the default E-value
       reporting thresholds, or	to use bit score thresholds instead.

       -E _x_ In the per-target	output,	report target  sequences  with	an  E-
	      value  of	<= _x_.	 The default is	10.0, meaning that on average,
	      about 10 false positives will be reported	per query, so you  can
	      see  the top of the noise	and decide for yourself	if it's	really

       -T _x_ Instead of thresholding per-CM output on E-value,	report	target
	      sequences	with a bit score of >= _x_.

       Inclusion thresholds are	stricter than reporting	thresholds.  Inclusion
       thresholds control which	hits are considered to be reliable  enough  to
       be  included  in	an output alignment or in a possible subsequent	search
       round, or marked	as significant ("!") as	opposed	to questionable	 ("?")
       in hit output.

       --incE _x_
	      Use  an  E-value	of <= _x_ as the hit inclusion threshold.  The
	      default is 0.01, meaning that on average,	about 1	false positive
	      would be expected	in every 100 searches with different query se-

       --incT _x_
	      Instead of using E-values	for setting the	 inclusion  threshold,
	      instead  use  a bit score	of >= _x_ as the hit inclusion thresh-
	      old.  By default this option is unset.

       Curated CM databases may	define specific	bit score thresholds for  each
       CM,  superseding	 any  thresholding  based  on statistical significance

       To use these options, the profile must contain the appropriate (GA, TC,
       and/or  NC)  optional  score threshold annotation; this is picked up by
       cmbuild from Stockholm format alignment files. Each thresholding	option
       has  a score of _x_ bits, and acts as if	-T _x_ --incT _x_ has been ap-
       plied specifically using	each model's curated thresholds.

	      Use the GA (gathering) bit scores	in the model to	 set  hit  re-
	      porting  and  inclusion  thresholds. GA thresholds are generally
	      considered to be the reliable curated thresholds defining	family
	      membership;  for	example, in Rfam, these	thresholds define what
	      gets included in Rfam Full alignments  based  on	searches  with
	      Rfam Seed	models.

	      Use  the	NC (noise cutoff) bit score thresholds in the model to
	      set hit reporting	and inclusion thresholds.  NC  thresholds  are
	      generally	 considered  to	 be  the  score	of the highest-scoring
	      known false positive.

	      Use the TC (trusted cutoff) bit score thresholds in the model to
	      set  hit	reporting  and inclusion thresholds. TC	thresholds are
	      generally	considered to be the score of the lowest-scoring known
	      true positive that is above all known false positives.

       Infernal	 1.1  searches are accelerated in a six-stage filter pipeline.
       The first five stages use a profile HMM to define  envelopes  that  are
       passed  to  the stage six CM CYK	filter.	Any envelopes that survive all
       filters are assigned final scores using the the	CM  Inside  algorithm.
       (See the	user guide for more information.)

       The profile HMM filter is built by the cmbuild program and is stored in

       Each successive filter is slower	than the previous one, but better than
       it  at disciminating between subsequences that may contain high-scoring
       CM hits and those that do not. The first	three HMM  filter  stages  are
       the  same  as  those used in HMMER3.  Stage 1 (F1) is the local HMM SSV
       filter modified for long	sequences. Stage  2  (F2)  is  the  local  HMM
       Viterbi	filter.	 Stage 3 (F3) is the local HMM Forward filter. Each of
       the first three stages uses the profile HMM in local mode, which	allows
       a target	subsequence to align to	any region of the HMM. Stage 4 (F4) is
       a glocal	HMM filter, which requires a target subsequence	 to  align  to
       the  full-length	 profile  HMM. Stage 5 (F5) is the glocal HMM envelope
       definition filter, which	uses HMMER3's domain identification heursitics
       to define envelope boundaries. After each stage from 2 to 5 a bias fil-
       ter step	(F2b, F3b, F4b,	and F5b) is used to remove sequences that  ap-
       pear to have passed the filter due to biased composition	alone. Any en-
       velopes that survive stages F1 through F5b are then passed with the lo-
       cal CM CYK filter. The CYK filter uses constraints (bands) derived from
       an HMM alignment	of the envelope	to reduce the number of	required  cal-
       culations  and  save time.  Any envelopes that pass CYK are scored with
       the local CM Inside algorithm, again using HMM bands for	acceleration.

       The default filter thresholds that define the  minimum  score  required
       for  a  subsequence to survive each stage are defined based on the size
       of the database in _seqdb_ (or the size _x_ in megabases	(Mb) specified
       by  the -Z _x_ or --FZ _x_ options).  For larger	databases, the filters
       are more	strict leading to more acceleration but	potentially a  greater
       loss  of	 sensitivity. The rationale is that for	larger databases, hits
       must  have  higher  scores  to  achieve	statistical  significance,  so
       stricter	filtering that removes lower scoring insignificant hits	is ac-

       The P-value thresholds for all possible search space sizes and all fil-
       ter  stages  are	 listed	 next. (A P-value threshold of 0.01 means that
       roughly 1% of the highest scoring  nonhomologous	 subsequence  are  ex-
       pected  to  pass	the filter.) Z is defined as the number	of nucleotides
       in the complete target sequence file times 2 because both strands  will
       be searched with	each model.

       If  Z  is  less than 2 Mb: F1 is	0.35; F2 and F2b are off; F3, F3b, F4,
       F4b and F5 are 0.02; F6 is 0.0001.

       If Z is between 2 Mb and	20 Mb: F1 is 0.35; F2 and  F2b	are  off;  F3,
       F3b, F4,	F4b and	F5 are 0.005; F6 is 0.0001.

       If  Z is	between	20 Mb and 200 Mb: F1 is	0.35; F2 and F2b are 0.15; F3,
       F3b, F4,	F4b and	F5 are 0.003; F6 is 0.0001.

       If Z is between 200 Mb and 2 Gb:	F1 is 0.15; F2 and F2b are  0.15;  F3,
       F3b, F4,	F4b, F5, and F5b are 0.0008; and F6 is 0.0001.

       If  Z  is  between 2 Gb and 20 Gb: F1 is	0.15; F2 and F2b are 0.15; F3,
       F3b, F4,	F4b, F5, and F5b are 0.0002; and F6 is 0.0001.

       If Z is more than 20 Gb:	F1 is 0.06; F2 and F2b are 0.02; F3, F3b,  F4,
       F4b, F5,	and F5b	are 0.0002; and	F6 is 0.0001.

       These thresholds	were chosen based on performance on an internal	bench-
       mark testing many different possible settings.

       There are five options for controlling  the  general  filtering	level.
       These  options are, in order from least strict (slowest but most	sensi-
       tive) to	most strict (fastest but  least	 sensitive):  --max,  --nohmm,
       --mid,  --default,  (this  is  the default setting), --rfam.  and --hm-
       monly.  With --default the filter thresholds will be database-size  de-
       pendent.	 See the explanation of	each of	these individual options below
       for more	information.

       Additionally, an	expert user can	precisely control  each	 filter	 stage
       score  threshold	 with the --F1,	--F1b, --F2, --F2b, --F3, --F3b, --F4,
       --F4b, --F5, --F5b, and --F6 options. As	well as	turn each stage	on  or
       off with	the --noF1, --doF1b, --noF2, --noF2b, --noF3, --noF3b, --noF4,
       --noF4b,	--noF5,	and --noF6.  options.  These  options  are  only  dis-
       played  if the --devhelp	option is used to keep the number of displayed
       options with -h reasonable, and because they are	only  expected	to  be
       useful to a small minority of users.

       As  a  special  case,  for any models in	_cmfile_ which have zero base-
       pairs, profile HMM searches are run instead of CM searches.  HMM	 algo-
       rithms are more efficient than CM algorithms, and the benefit of	CM al-
       gorithms	is lost	for models with	no  secondary  structure  (zero	 base-
       pairs).	These  profile HMM searches will run significantly faster than
       the CM searches.	You can	force HMM-only searches	with the --hmmonly op-
       tion.  For more information on HMM-only searches	see the	description of
       the --hmmonly option below, and the user	guide.

       --max  Turn off all filters, and	run non-banded Inside on  every	 full-
	      length  target sequence. This increases sensitivity somewhat, at
	      an extremely large cost in speed.

	      Turn off all HMM filter stages (F1 through F5b). The CYK filter,
	      using QDBs, will be run on every full-length target sequence and
	      will enforce a P-value threshold	of  0.0001.  Each  subsequence
	      that  survives CYK will be passed	to Inside, which will also use
	      QDBs (but	a looser set). This increases sensitivity somewhat, at
	      a	very large cost	in speed.

       --mid  Turn off the HMM SSV and Viterbi filter stages (F1 through F2b).
	      Set remaining HMM	filter thresholds (F3 through F5b) to 0.02  by
	      default,	but  changeable	 to _x_	with --Fmid _x_	sequence. This
	      may increase sensitivity,	at a significant cost in speed.

	      Use the default filtering	strategy. This option  is  on  by  de-
	      fault.  The  filter thresholds are determined based on the data-
	      base size.

       --rfam Use a strict filtering  strategy	devised	 for  large  databases
	      (more  than  20 Gb). This	will accelerate	the search at a	poten-
	      tial cost	to sensitivity.	It will	have no	effect if the database
	      is larger	than 20	Gb.

	      Only use the filter profile HMM for searches, do not use the CM.
	      Only filter stages F1 through F3 will be executed, using	strict
	      P-value  thresholds  (0.02  for F1, 0.001	for F2 and 0.00001 for
	      F3).  Additionally a bias	composition filter is used  after  the
	      F1  stage	 (with	P=0.02 survival	threshold).  Any hit that sur-
	      vives all	stages and has an HMM E-value or bit score  above  the
	      reporting	 threshold  will  be  output.  The user	can change the
	      HMM-only filter thresholds and options  with  --hmmF1,  --hmmF2,
	      --hmmF3,	--hmmnobias,  --hmmnonull2, and	--hmmmax.  By default,
	      searches for any model with zero basepairs will be run  in  HMM-
	      only mode. This can be turned off, forcing CM searches for these
	      models with the --nohmmonly option.  These options are only dis-
	      played if	the --devhelp option is	used.

       --FZ _x_
	      Set  filter thresholds as	the defaults used if the database were
	      <x> megabases (Mb). If used with <x> greater than	20000 (20  Gb)
	      this option has the same effect as --rfam.

       --Fmid _x_
	      With  the	--mid option set the HMM filter	thresholds (F3 through
	      F5b) to _x_.  By default,	_x_ is 0.02.

	      Turn off truncated hit detection.

	      Allow truncated hits to begin and	end at any position in a  tar-
	      get  sequence.  By  default,  5' truncated hits must include the
	      first residue of their target sequence  and  3'  truncated  hits
	      must  include  the  final	residue	of their target	sequence. With
	      this option you may observe fewer	full length hits  that	extend
	      to the beginning and end of the query CM.

	      Turn  off	the null3 CM score corrections for biased composition.
	      This correction is not used during the HMM filter	stages.

       --mxsize	_x_
	      Set the maximum allowable	CM DP matrix size to _x_ megabytes. By
	      default  this  size  is 128 Mb.  This should be large enough for
	      the vast majority	of searches, especially	with  smaller  models.
	      If  cmsearch  encounters	an envelope in the CYK or Inside stage
	      that requires a larger matrix, the envelope will	be  discounted
	      from  consideration.  This behavior is like an additional	filter
	      that prevents expensive (slow) CM	DP calculations, but at	a  po-
	      tential cost to sensitivity.  Note that if cmsearch is being run
	      in _n_ multiple threads on a multicore machine then each	thread
	      may  have	 an allocated matrix of	up to size _x_ Mb at any given

       --smxsize _x_
	      Set the maximum allowable	 CM  search  DP	 matrix	 size  to  _x_
	      megabytes.  By default this size is 128 Mb.  This	option is only
	      relevant if the CM will not use HMM banded matrices, i.e.	if the
	      --max,  --nohmm, --qdb, --fqdb, --nonbanded, or --fnonbanded op-
	      tions are	also used. Note	that if	cmsearch is being run  in  _n_
	      multiple	threads	 on  a	multicore machine then each thread may
	      have an allocated	matrix of up to	size _x_ Mb at any given time.

       --cyk  Use the CYK algorithm, not Inside, to determine the final	 score
	      of all hits.

       --acyk Use   the	  CYK	algorithm  to  align  hits.  By	 default,  the
	      Durbin/Holmes optimal accuracy algorithm is  used,  which	 finds
	      the  alignment  that  maximizes  the  expected  accuracy	of all
	      aligned residues.

       --wcx _x_
	      For each CM, set the W parameter,	the expected maximum length of
	      a	 hit,  to  _x_ times the consensus length of the model.	By de-
	      fault, the W parameter is	read from the CM file and  was	calcu-
	      lated  based on the transition probabilities of the model	by cm-
	      build.  You can find out what the	default	W is for a model using
	      cmstat.	This  option should be used with caution as it impacts
	      the filtering pipeline at	several	different stages in nonobvious
	      ways. It is only recommended for expert users searching for hits
	      that are much longer than	any of the homologs used to build  the
	      model  in	 cmbuild,  e.g.	ones with large	introns	or other large
	      insertions.  This	option cannot be used in combination with  the
	      --nohmm,	--fqdb	or  --qdb  options because in those cases W is
	      limited by query-dependent bands.

	      Only search the top (Watson) strand of target sequences in  _se-
	      qdb_.   By  default,  both strands are searched. This will halve
	      the database size	(Z).

	      Only search the bottom (Crick) strand  of	 target	 sequences  in
	      _seqdb_.	 By  default,  both  strands  are searched.  This will
	      halve the	database size (Z).

       --tformat _s_
	      Assert that the target sequence database file is in format  _s_.
	      Accepted	formats	include	fasta, embl, genbank, ddbj, stockholm,
	      pfam, a2m, afa, clustal, and phylip The default is to autodetect
	      the format of the	file.

       --cpu _n_
	      Set  the	number of parallel worker threads to _n_.  By default,
	      Infernal sets this to the	number of CPU cores it detects in your
	      machine  -  that is, it tries to maximize	the use	of your	avail-
	      able processor cores. Setting _n_	 higher	 than  the  number  of
	      available	 cores	is of little if	any value, but you may want to
	      set it to	something less.	You can	also control  this  number  by
	      setting  an environment variable,	INFERNAL_NCPU.	This option is
	      only available if	Infernal was compiled with POSIX threads  sup-
	      port.  This  is  the default, but	it may have been turned	off at
	      compile-time for your site or machine for	some reason.

	      For debugging the	MPI master/worker version: pause after	start,
	      to  enable the developer to attach debuggers to the running mas-
	      ter and worker(s)	processes. Send	SIGCONT	signal to release  the
	      pause.  (Under gdb: (gdb)	signal SIGCONT)	(Only available	if op-
	      tional MPI support was enabled at	compile-time.)

       --mpi  Run in MPI master/worker mode, using mpirun.  To use --mpi,  the
	      sequence	file  must  have  first	 been 'indexed'	using the esl-
	      sfetch  program,	which  is  included  with  Infernal,  in   the
	      easel/miniapps/  subdirectory.   (Only available if optional MPI
	      support was enabled at compile-time.)

       See infernal(1) for a master man	page with a list of all	the individual
       man pages for programs in the Infernal package.

       For  complete documentation, see	the user guide that came with your In-
       fernal distribution (Userguide.pdf); or see the Infernal	web page ().

       Copyright (C) 2019 Howard Hughes	Medical	Institute.
       Freely distributed under	the BSD	open source license.

       For additional information on copyright and  licensing,	see  the  file
       called  COPYRIGHT  in your Infernal source distribution,	or see the In-
       fernal web page ().

       The Eddy/Rivas Laboratory
       Janelia Farm Research Campus
       19700 Helix Drive
       Ashburn VA 20147	USA

Infernal 1.1.3			   Nov 2019			   cmsearch(1)


Want to link to this manual page? Use this URL:

home | help