Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
exonerate(1)		   sequence comparison tool		  exonerate(1)

NAME
       exonerate - a generic tool for sequence comparison

SYNOPSIS
       exonerate [ options ] <query path> <target path>

DESCRIPTION
       exonerate is a general tool for sequence	comparison.

       It  uses	the C4 dynamic programming library.  It	is designed to be both
       general and fast.  It can produce either	gapped or ungapped alignments,
       according  to  a	variety	of different alignment models.	The C4 library
       allows sequence alignment using a reduced space full  dynamic  program-
       ming implementation, but	also allows automated generation of heuristics
       from the	alignment models, using	bounded	sparse dynamic programming, so
       that these alignments may also be rapidly generated.  Alignments	gener-
       ated using these	heuristics will	represent a  valid  path  through  the
       alignment  model,  yet  (unlike the exhaustive alignments), the results
       are not guaranteed to be	optimal.

CONVENTIONS
       A number	of conventions (and idiosyncracies) are	used within exonerate.
       An understanding	of them	facilitates interpretation of the output.

       Coordinates
	      An in-between coordinate system is used, where the positions are
	      counted between the symbols, rather than on the  symbols.	  This
	      numbering	 scheme	starts from zero.  This	numbering is shown be-
	      low for the sequence "ACGT":

	       A C G T
	      0	1 2 3 4

	      Hence the	 subsequence  "CG"  would  have	 start=1,  end=3,  and
	      length=2.	  This coordinate system is used internally in exoner-
	      ate, and for all the output formats produced with	the  exception
	      of  the  "human  readable"  alignment display and	the GFF	output
	      where convention and standards dictate otherwise.

       Reverse Complements
	      When an alignment	is reported on the reverse complement of a se-
	      quence,  the coordinates are simply given	on the reverse comple-
	      ment copy	of the sequence.  Hence	positions on the sequences are
	      never  negative.	 Generally, the	forward	strand is indicated by
	      '+', the reverse strand by '-', and an unknown or	not-applicable
	      strand  (as  in  the case	of a protein sequence) is indicated by
	      '.'

       Alignment Scores
	      Currently, only the raw alignment	scores	are  displayed.	  This
	      score  just is the sum of	transistion scores used	in the dynamic
	      programming.  For	example,  in  the  case	 of  a	Smith-Waterman
	      alignment,  this	will  be  the  sum  of the substitution	matrix
	      scores and the gap penalties.

GENERAL	OPTIONS
       Most arguments have short and long forms.   The	long  forms  are  more
       likely  to  be  stable  over  time, and hence should be used in scripts
       which call exonerate.

       -h | --shorthelp	<boolean>
	      Show help.  This will display a concise summary of the available
	      options, defaults	and values currently set.

       --help <boolean>
	      This  shows  all	the  help  options including the defaults, the
	      value currently set, and the environment variable	which  may  be
	      used  to	set  each  parameter.	There will be an indication of
	      which options are	mandatory.  Mandatory options have no default,
	      and  must	have a value supplied for exonerate to run.  If	manda-
	      tory options are used in order, their flags may be skipped  from
	      the  command  line  (see examples	below).	 Unlike	this man page,
	      the information from this	option will always be up to date  with
	      the latest version of the	program.

       -v | --version <boolean>
	      Display  the  version  number.   Also displays other information
	      such as the build	date and glib version used.

SEQUENCE INPUT OPTIONS
       Pairwise	comparisons will be performed between all query	sequences  and
       all target sequences.  Generally, for the best performance, shorter se-
       quences (eg. ESTs, shotgun reads, proteins) should be used as the query
       sequences,  and longer sequences	(eg. genomic sequences)	should be used
       as the target sequences.

       -q | --query  <paths>
	      Specify the query	sequences required.  These must	be in a	 FASTA
	      format  file.   Single  or muiltiple query sequences may be sup-
	      plied.  Additionally multiple copies of the fasta	 file  may  be
	      supplied	following  a  --query  flag, or	by using with multiple
	      --query flags.

       -t | --target <paths>
	      Specify the target sequences required.  Also, must be in a FASTA
	      format  file.   As  with the query sequences, single or multiple
	      target sequences and files may be	supplied.  The target filename
	      may  by  replace by a server name	and port number	in the form of
	      hostname:port when using exonerate-server.  See the man page for
	      exonerate-server	for  more  information on running exonerate in
	      client:server mode.  NEW(v2.4.0):	multiple servers  may  now  be
	      used.   These  will  be  queried in parallel if you have set the
	      --cores option.  NEW(v2.4.0): If an input	file is	 not  a	 FASTA
	      format  file,  it	 is  assumed  to contain a list	of other fasta
	      files, directories or servers (one per line).

       -Q | --querytype	<dna | protein>
	      Specify the query	type to	use.  If this  is  not	supplied,  the
	      query  type  is assumed to be DNA	when the first sequence	in the
	      file contains more than 85% [ACGTN] bases.  Otherwise, it	is as-
	      sumed  to	be peptide.  This option forces	the query type as some
	      nucleotide and peptide sequences can fall	either	side  of  this
	      threshold.

       -T | --targettype <dna |	protein>
	      Specify  the  target  type  to  use.   The  same	as --querytype
	      (above), except that it applies to the target.   Specifying  the
	      sequence	type  will  avoid  the	overhead of having to read the
	      first sequence in	the database twice (which may  be  significant
	      with chromosome-sized sequences)

       --querychunkid <id>

       --querychunktotal <total>

       --targetchunkid <id>

       --targetchunktotal <total>
	      These  options to	facilitate running exonerate on	compute	farms,
	      and avoid	having to  split  up  sequence	databases  into	 small
	      chunks  to  run on different nodes.  If, for example, you	wished
	      to split the target database into	three  parts,  you  would  run
	      three exonerate jobs on different	nodes including	the options:

	      --targetchunkid 1	--targetchunktotal 3
	      --targetchunkid 2	--targetchunktotal 3
	      --targetchunkid 3	--targetchunktotal 3
	      NB.  The	granularity offered by this option only	goes down to a
	      single sequence, so when there are more chunks than sequences in
	      the database, some processes will	do nothing.

       -V | --verbose <int>
	      Be  verbose - show information about what	is going on during the
	      analysis.	 The default is	1 (little information),	the higher the
	      number  given,  the more information is printed.	To silence all
	      the default output from exonerate, use --verbose 0  --showalign-
	      ment no --showvulgar no

ANALYSIS OPTIONS
       -E | --exhaustive <boolean>
	      Specify  whether or not exhaustive alignment should be used.  By
	      default, this is FALSE, and alignment heuristics will  be	 used.
	      If  it  is  set  to TRUE,	an exhaustive alignment	will be	calcu-
	      lated.  This requires quadratic time, and	 will  be  much,  much
	      slower, but will provide the optimal result for the given	model.
       -B | --bigseq <int>
	      Perform  alignment of large (multi-megabase) sequences.  This is
	      very memory efficient and	fast when both sequences  are  chromo-
	      some-sized, but currently	does not currently permit the use of a
	      word neighbourhood (ie. exactly matching seeds only).
       --revcomp <boolean>
	      Include comparison of the	reverse	complement of  the  query  and
	      target  where possible.  By default, this	option is enabled, but
	      when you know the	gene is	definitely on the  forward  strand  of
	      the  query  and  target, this option can halve the time taken to
	      compute alignments.
       --forcescan <none | query | target>
	      Force the	FSM to scan the	query sequence rather than the target.
	      This  option  is useful, for example, if you have	a single piece
	      of genomic sequence and you with to compare it to	the  whole  of
	      dbEST.   By  scanning  the  database, rather than	the query, the
	      analysis will be completed much more quickly, as	the  overheads
	      of  multiple query FSM construction, multiple target reading and
	      splice site predictions will be removed.	By default,  exonerate
	      will  guess  the	optimal	 strategy  based  on database sequence
	      sizes.
       --saturatethreshold <number>
	      When set to zero,	this option  does  nothing.   Otherwise,  once
	      more than	this number of words (in addition to the expected num-
	      ber of words by chance) have matched a position  on  the	query,
	      the  position  on	 the  query  will  be 'numbed' (ignore further
	      matches) for the current pairwise	comparison.
       --customserver <command>
	      When using exonerate in client:server mode with  a  non-standard
	      server,  this command allows you to send a custom	command	to the
	      server.  This command is sent by the client  (exonerate)	before
	      any  other commands, and is provided as a	way of passing parame-
	      ters or other commands specific to the custom server.   See  the
	      exonerate-server	man page for more information on running exon-
	      erate in client:server mode.
       --cores <number>
	      The number of cores/CPUs/threads that  should  be	 used.	 On  a
	      multi-core  or multi-CPU machine,	increasing this	ammount	allows
	      alignment	 computations  to  run	 in   parallel	 on   separate
	      CPUs/cores.   NB.	  Generally,  it  is better to parallelise the
	      analysis by splitting it up into separate	jobs, but this	option
	      may  prove  useful  for problems such as interactive single-gene
	      queries.

FASTA DATABASE OPTIONS
       --fastasuffix <extension>
	      If any of	the inputs given with --query or --target are directo-
	      ries, then exonerate will	recursively descent these directories,
	      reading all files	ending with this suffix	as fasta format	input.

GAPPED ALIGNMENT OPTIONS
       -m | --model <alignment model>
	      Specify the alignment model to use.  The models  currently  sup-
	      ported are:
	      ungapped
		     The  simplest  type of model, used	by default.  An	appro-
		     priate model with be selected automatically for the  type
		     of	input sequences	provided.
	      ungapped:trans
		     This ungapped model includes translation of all frames of
		     both the query and	target sequences.  This	is similar  to
		     an	ungapped tblastx type search.
	      affine:global
		     This  performs  gapped  global  alignment,	similar	to the
		     Needleman-Wunsch  algorithm,  except  with	 affine	 gaps.
		     Global  alignment	requires  that	both  the sequences in
		     their entirety are	included in the	alignment.
	      affine:bestfit
		     This performs a best fit or best  location	 alignment  of
		     the query onto the	target sequence.  The entire query se-
		     quence will be included in	the alignment,	but  only  the
		     best location for its alignment on	the target sequence.
	      affine:local
		     This  is local alignment with affine gaps,	similar	to the
		     Smith-Waterman-Gotoh algorithm.  A	general-purpose	align-
		     ment  algorithm.	As this	is local alignment, any	subse-
		     quence of the query and target sequence may appear	in the
		     alignment.
	      affine:overlap
		     This type of alignment finds the best overlap between the
		     query and target.	The overlap alignment must include the
		     start  of the query or target and the end of the query or
		     the target	sequence, to align sequences which overlap  at
		     the  ends,	 or  in	the mid-section	of a longer sequence..
		     This is the type of alignment frequently used in assembly
		     algorithms.
	      est2genome
		     This  model  is similar to	the affine:local model,	but it
		     also includes intron modelling on the target sequence  to
		     allow  alignment of spliced to unspliced coding sequences
		     for both forward and reversed genes.  This	is similar  to
		     the  alignment models used	in programs such as EST_GENOME
		     and sim4.
	      ner    NERs are non-equivalenced regions - large regions in both
		     the  query	 and target which are not aligned.  This model
		     can be used for protein alignments	 where	strongly  con-
		     served  helix  regions  will  be aligned, but weakly con-
		     served loop regions are not.  Similarly, this model could
		     be	used to	look for co-linearly conserved regions in com-
		     parison of	genomic	sequences.
	      protein2dna
		     This model	compares a protein sequence to a DNA sequence,
		     incorporating all the appropriate gaps and	frameshifts.
	      protein2dna:bestfit
		     This  is a	bestfit	version	of the protein2dna model, with
		     which the entire protein is included  in  the  alignment.
		     It	 is  currently	only  available	 when using exhaustive
		     alignment.
	      protein2genome
		     This model	allows alignment of a protein sequence to  ge-
		     nomic  DNA.    This  is similar to	the protein2dna	model,
		     with the addition of  modelling  of  introns  and	intron
		     phases.  This model is simliar to those used by genewise.
	      protein2genome:bestfit
		     This  is  a  bestfit version of the protein2genome	model,
		     with which	the entire protein is included in  the	align-
		     ment.   It	is currently only available when using exhaus-
		     tive alignment.
	      coding2coding
		     This model	is similar to the ungapped:trans model,	except
		     that  gaps	and frameshifts	are allowed.  It is similar to
		     a gapped tblastx search.
	      coding2genome
		     This is similar to	the est2genome model, except that  the
		     query  sequence is	translated during comparison, allowing
		     a more sensitive comparison.
	      cdna2genome
		     This combines  properties	of  the	 est2genome  and  cod-
		     ing2genome	 models,  to  allow  modeling of an whole cDNA
		     where a central coding region can be flanked by  non-cod-
		     ing  UTRs.	 When the CDS start and	end is known it	may be
		     specified using the --annotation option  (see  below)  to
		     permit  only  the	correct	coding region to appear	in the
		     alignemnt.
	      genome2genome
		     This model	is similar to the coding2coding	model,	except
		     introns  are  modelled  on	 both sequences.  (not working
		     well yet)

       The short names u, u:t, a:g, a:b, a:l, a:o, e2g,	ner,
	      p2d, p2d:b p2g, p2g:b, c2c, c2g cd2g and g2g can	also  be  used
	      for specifying models.

       -s | --score <threshold>
	      This is the overall score	threshold.  Alignments will not	be re-
	      ported below this	 threshold.   For  heuristic  alignments,  the
	      higher this threshold, the less time the analysis	will take.

       --percent <percentage>
	      Report  only  alignments scoring at least	this percentage	of the
	      maximal score for	each query.  eg. use --percent	90  to	report
	      alignments  with	90%  of	 the maximal score optainable for that
	      query.  This option is useful not	only because  it  reduces  the
	      spurious	matches	in the output, but because it generates	query-
	      specific thresholds (unlike --score ) for	a set  of  queries  of
	      differing	 lengths,  and will also speed up the search consider-
	      ably.  NB.  with this option, it is  possible  to	 have  a  cDNA
	      match  its corresponding gene exactly, yet still score less than
	      100%, due	to the addition	of the intron  penalty	scores,	 hence
	      this option must be used with caution.

       --showalignment <boolean>
	      Show the alignments in an	human readable form.

       --showsugar <boolean>
	      Display "sugar" output for ungapped alignments.  Sugar is	Simple
	      UnGapped Alignment Report, which	displays  ungapped  alignments
	      one-per-line.   The  sugar  line starts with the string "sugar:"
	      for easy extraction from the output, and is followed by the  the
	      following	9 fields in the	order below:

	      query_id	      Query identifier
	      query_start     Query position at	alignment start
	      query_end	      Query position alignment end
	      query_strand    Strand of	query matched
	      target_id	      |
	      target_start    |	the same 4 fields
	      target_end      |	for the	target sequence
	      target_strand   |
	      score	      The raw alignment	score

       --showcigar <boolean>
	      Show the alignments in "cigar" format.  Cigar is a Compact Idio-
	      syncratic	Gapped Alignment Report, which displays	gapped	align-
	      ments one-per-line.  The format starts with the same 9 fields as
	      sugar output (see	above),	and is followed	by a series of <opera-
	      tion,  length>  pairs where operation is one of match, insert or
	      delete, and the length describes the number of times this	opera-
	      tion is repeated.

       --showvulgar <boolean>
	      Shows the	alignments in "vulgar" format.	Vulgar is Verbose Use-
	      ful Labelled Gapped Alignment Report, This  format  also	starts
	      with  the	same 9 fields as sugar output (see above), and is fol-
	      lowed by a series	of <label, query_length, target_length>	 trip-
	      lets.  The label may be one of the following:

	      M	     Match
	      C	     Codon
	      G	     Gap
	      N	     Non-equivalenced region
	      5	     5'	splice site
	      3	     3'	splice site
	      I	     Intron
	      S	     Split codon
	      F	     Frameshift

       --showquerygff <boolean>
	      Report  GFF  output  for	features  on  the query	sequence.  See
	      http://www.sanger.ac.uk/Software/formats/GFF for	more  informa-
	      tion.

       --showtargetgff <boolean>
	      Report GFF output	for features on	the target sequence.

       --ryo <format>
	      Roll-your-own  output  format.   This  allows specification of a
	      printf-esque format line which is	used to	specify	which informa-
	      tion  to	include	in the output, and how it is to	be shown.  The
	      format field may contain the following fields:

	      %[qt][idlsSt]
		     For  either  {query,target},   report   the   {id,defini-
		     tion,length,sequence,Strand,type}	Sequences are reported
		     in	a fasta-format like block (no headers).
	      %[qt]a[bels]
		     For either	{query,target}	region	which  occurs  in  the
		     alignment,	report the {begin,end,length,sequence}
	      %[qt]c[bels]
		     For either	{query,target} region which occurs in the cod-
		     ing  sequence  in	the   alignment,   report   the	  {be-
		     gin,end,length,sequence}
	      %s     The raw score
	      %r     The rank (in results from a bestn search)
	      %m     Model name
	      %e[tism]
		     Equivalenced {total,id,similarity,mismatches} (ie.	%em ==
		     (%et - %ei))
	      %p[isS]
		     Percent {id,similarity,Self} over the  equivalenced  por-
		     tions  of	the  alignment.	 (ie. %pi == 100*(%ei /	%et)).
		     Percent Self is the score over the	equivalenced  portions
		     of	 the  alignment	as a percentage	of the self comparison
		     score of the query	sequence.
	      %g     Gene orientation ('+' = forward, '-' = reverse, '.' = un-
		     known)
	      %S     Sugar  block  (the	 9  fields  used  in sugar output (see
		     above)
	      %C     Cigar block (the fields of	a cigar	line after  the	 sugar
		     portion)
	      %V     Vulgar block (the fields of a vulgar line after the sugar
		     portion)
	      %%     Expands to	a percentage sign (%)
	      \n     Newline
	      \t     Tab
	      \\     Expands to	a backslash (\)
	      \{     Open curly	brace
	      \}     Close curly brace
	      {	     Begin per-transition output section
	      }	     End per-transition	output section
	      %P[qt][sabe]
		     Per-transition output  for	 {query,target}	 {sequence,ad-
		     vance,begin,end}
	      %P[nsl]
		     Per-transition output for {name,score,label}

       This  option  is	 very useful and flexible.  For	example, to report all
       the sections of query sequences which feature in	 alignments  in	 fasta
       format, use:

       --ryo ">%qi %qd\n%qas\n"

       To  output  all	the  symbols and scores	in an alignment, try something
       like:

       --ryo "%V{%Pqs %Pts %Ps\n}"

       -n | --bestn <number>
	      Report the best N	results	for each query.	 (Only results scoring
	      better than the score threshold
	       will  be	 reported).   The  option reduces the amount of	output
	      generated, and also allows exonerate to speed up the search.

       -S | --subopt <boolean>
	      This option allows for the reporting of (Waterman-Eggert	style)
	      suboptimal  alignments.	(It is on by default.)	All suboptimal
	      (ie. non-intersecting) alignments	will be	reported for each pair
	      of sequences scoring at least the	threshold provided by --score.

	      When  this  option  is  used with	exhaustive alignments, several
	      full quadratic time passes will be required, so the running time
	      will be considerably increased.

       -g | --gappedextension <boolean>
	      Causes a gapped extension	stage to be performed ie. dynamic pro-
	      gramming is applied in arbitrarily shaped	and dynamically	 sized
	      regions  surrounding HSP seeds.  The extension threshold is con-
	      trolled by the --extensionthreshold option.

	      Although sometimes slower	than BSDP, gapped  extension  improves
	      sensitivity with weak, gap-rich alignments such as during	cross-
	      species comparison.

	      NB. This option is now the default. Set it to false  to  reverse
	      to the old BSDP type alignments.	This option may	be slower than
	      BSDP for some large scale	analyses with simple alignment models.

       --refine	<strategy>
	      Force exonerate to refine	alignments generated by	heuristics us-
	      ing  dynamic  programming	 over larger regions.  This takes more
	      time, but	improves the quality of	the final alignments.

	      The strategies available for refinement are:

	      none   The default - no refinement is used.
	      full   An	exhaustive alignment is	calculated from	 the  pair  of
		     sequences in their	entirety.
	      region DP	is applied just	to the region of the sequences covered
		     by	the heuristic alignment.

       --refineboundary	<size>
	      Specify an extra boundary	to be included in the  region  subject
	      to alignment during refinement by	region.

VITERBI	ALGORITM OPTIONS
       -D | --dpmemory <Mb>
	      The  exhaustive  alignment traceback routines use	a Hughey-style
	      reduced memory technique.	 This option specifies how much	memory
	      will  be used for	this.  Generally, the more memory is permitted
	      here, the	faster the alignments will be produced.

CODE GENERATION	OPTIONS
       -C | --compiled <boolean>
	      This option allows disabling of generated	code for dynamic  pro-
	      gramming.	  It  is  mainly used during development of exonerate.
	      When set to FALSE, an "interpreted" version of the dynamic  pro-
	      gramming implementation is used, which is	much slower.

HEURISTIC OPTIONS
       --terminalrangeint
       --terminalrangeext
       --joinrangeint
       --joinrangeext
       --spanrangeint
       --spanrangeext
	      These  options are used to specify the size of the sub-alignment
	      regions to which DP is applied around  the  ends	of  the	 HSPs.
	      This can be at the HSP ends (terminal range), between HSPs (join
	      range), or between HSPs which may	be connected by	a large	region
	      such  as	an  intron  or	non-equivalenced  region (span range).
	      These ranges can be specified for	a number of matches back  onto
	      the HSP (internal	range) or out from the HSP (external range).

SEEDED DYNAMIC PROGRAMMING OPTIONS
       -x | --extensionthreshold <score>
	      This is the amount by which the score will be allowed to degrade
	      during SDP.  This	is the equivalent of the hspdropoff penalties,
	      except  it is applied during dynamic programming,	not HSP	exten-
	      sion.  Decreasing	this parameter will increase the speed of  the
	      SDP, and increasing it will increase the sensitivity.

       --singlepass  <boolean>
	      By  default the suboptimal SDP alignments	are reported by	a sin-
	      glepass algorithm, but may miss some suboptimal alignments  that
	      are close	together.  This	option can be used to force the	use of
	      a	multipass suboptimal alignment algorithm for SDP, resulting in
	      higher quality suboptimal	alignments.

BSDP OPTIONS
       --joinfilter <limit>
	      (experimental)

	      Only  allow  consider  this  number of SARs for joining HSPs to-
	      gether.  The SARs	with the highest potential for appearing in  a
	      high-scoring  alignment  are considered.	This option useful for
	      limiting time and	memory usage when searching unmasked data with
	      repetitive  sequences,  but  should not be set too low, as valid
	      matches may be ignored.  Something like --joinfilter 32 seems to
	      work well.

SEQUENCE OPTIONS
       --annotation <path>
	      Specify  basic  sequence	annotation  information.  This is most
	      useful with the cdna2genome model, but will work with other mod-
	      els.  The	annotation file	contains four fields per line:

	      <id> <strand> <cds_start>	<cds_length>

	      Here is a	simple example of such a file for 4 cDNAs:

	      dhh.human.cdna + 308 1191
	      dhh.mouse.cdna + 250 1191
	      csn7a.human.cdna + 178 828
	      csn7a.mouse.cdna + 126 828
	      These  annotation	 lines	will also work when only the first two
	      fields are used.	This can be used when specifying which	strand
	      of a specific sequence should be included	in a comparison.

SYMBOL COMPARISON OPTIONS
       --softmaskquery <boolean>
	      Indicate	that  the  query is softmasked.	 See description below
	      for --softmasktarget
       --softmasktarget	<boolean>
	      Indicate that the	target is softmasked.	In  a  softmasked  se-
	      quence  file,  instead  of  masking regions by Ns	or Xs they are
	      masked by	putting	those regions in lower case (and with unmasked
	      regions  in  upper  case).  This option allows the masking to be
	      ignored by some parts of the program,  combining	the  speed  of
	      searching	 masked	 data  with  sensitivity of searching unmasked
	      data.  The utility fastasoftmask supplied	which is supplied with
	      exonerate	 can  be  used	for producing softmasked sequence from
	      conventionally masked sequence.
       -d | --dnasubmat	<name>
	      Specify the the substitution matrix to be	used for DNA  compari-
	      son.   This  should  be  a path to a substitution	matrix in same
	      format as	that which is used by blast.
       -p | --proteinsubmat <name>
	      Specify the the substitution matrix to be	used for protein  com-
	      parison.	 (Both	DNA  and protein substitution matrices are re-
	      quired for some types of analysis).   The	 use  of  the  special
	      names,  nucleic,	blosum62,  pam250, edit	or identity will cause
	      built-in substitution matrices to	be used.
ALIGNMENT SEEDING OPTIONS
       -M | --fsmmemory	<Mb>
	      Specify the amount of memory to use for  the  FSM	 in  heuristic
	      analyses.	  exonerate multiplexes	the query to accelerate	large-
	      throughput database queries.  This figure	should always be  less
	      than  the	 physical  memory  on  the machine, but	when searching
	      large databases, generally, the more memory  it  is  allowed  to
	      use, the faster it will go.
       --forcefsm <none	| normal | compact>
	      Force the	use of more compact finite state machines for analyses
	      involving	big sequences and large	word neighbourhoods.   By  de-
	      fault,  exonerate	 will pick a sensible strategy,	so this	option
	      will rarely need to be set.
       --wordjump <int>
	      The jump between query words used	to yield the  word  neighbour-
	      hood.  If	set to 1, every	word is	used, if set to	2, every other
	      word is used, and	if set to the wordlength, only non-overlapping
	      words will be used.  This	option reduces the memory requirements
	      when using very large query sequences, and makes the search  run
	      faster,  but it also damages search sensitivity when high	values
	      are set.
       --wordambiguity <limit>
	      This option may be used to allow alignment seeds containing  IU-
	      PAC  ambiguity  symbols.	The limit is the maximum number	of am-
	      biguous words allowed at a single	position.  If  this  limit  is
	      reached  then  the  position  is not used	for alignment seeding.
	      Using this option	may slow down a	search.	 For  large  datasets,
	      it  is  recommended  to  use esd2esi --wordambiguity instead, as
	      then the speed overhead is only incurred during indexing,	rather
	      than during the database searching itself.  NB. This option only
	      works for	IUPAC symbols in the  target  sequence.	  Query	 words
	      containing IUPAC symbols are (currently) excluded	from seeding.
AFFINE MODEL OPTIONS
       -o | --gapopen <penalty>
	      This is the gap open penalty.
       -e | --gapextend	<penalty>
	      This is the gap extension	penalty.
       --codongapopen <penalty>
	      This is the codon	gap open penalty.
       --codongapextend	<penalty>
	      This is the codon	gap extension penalty.
NER OPTIONS
       --minner	<boolean>
	      Minimum NER length allowed.
       --maxner	<length>
	      Maximum  NER  length  allowed.   NB.  this  option  only affects
	      heuristic	alignments.
       --neropen <penalty>
	      Penalty for opening a non-equivalenced region.
INTRON MODELLING OPTIONS
       --minintron <length>
	      Minimum intron length  limit.   NB.  this	 option	 only  affects
	      heuristic	 alignments.   This  is	not a hard limit - it only af-
	      fects size of introns which are sought during  heuristic	align-
	      ment.
       --maxintron <length>
	      Maximum intron length limit.  See	notes above for	--minintron
       -i | --intronpenalty <penalty>
	      Penalty for introduction of an intron.
FRAMESHIFT MODELLING OPTIONS
       -f | --frameshift <penalty>
	      The penalty for the inclusion of a frameshift in an alignment.
ALPHABET OPTIONS
       --useaatla <boolean>
	      Use  three-letter	abbreviations for AA names.  ie. when display-
	      ing alignment "Met" is used instead of " M "
TRANSLATION OPTIONS
       --geneticcode <code>
	      Specify an alternative genetic code.  The	default	 code  (1)  is
	      the standard genetic code.  Other	genetic	codes may be specified
	      by in shorthand or longhand form.

	      In shorthand form, a number between 1 and	23 is used to  specify
	      one  of  17  built-in  genetic code variants.  These are genetic
	      code variants taken from:

	      http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi

	      These are:
	      1	     The Standard Code
	      2	     The Vertebrate Mitochondrial Code
	      3	     The Yeast Mitochondrial Code
	      4	     The Mold, Protozoan, and Coelenterate Mitochondrial  Code
		     and the Mycoplasma/Spiroplasma Code
	      5	     The Invertebrate Mitochondrial Code
	      6	     The Ciliate, Dasycladacean	and Hexamita Nuclear Code
	      9	     The Echinoderm and	Flatworm Mitochondrial Code
	      10     The Euplotid Nuclear Code
	      11     The Bacterial and Plant Plastid Code
	      12     The Alternative Yeast Nuclear Code
	      13     The Ascidian Mitochondrial	Code
	      14     The Alternative Flatworm Mitochondrial Code
	      15     Blepharisma Nuclear Code
	      16     Chlorophycean Mitochondrial Code
	      21     Trematode Mitochondrial Code
	      22     Scenedesmus obliquus mitochondrial	Code
	      23     Thraustochytrium Mitochondrial Code",
	      In longhand form,	a genetic code variant may be provided as a 64
	      byte string in TCAG order, eg. the standard genetic code in this
	      form would be:

	      FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG

HSP CREATION OPTIONS
       --hspfilter <threshold>
	      Use  aggressive  HSP  filtering  to speed	up heuristic searches.
	      The threshold specifies the number of HSPs centred about a point
	      in  the query which will be stored.  Any lower scoring HSPs will
	      be discarded.  This is an	experimental option  to	 handle	 speed
	      problems	caused	by some	sequences.  A value of about 100 seems
	      to work well.
       --useworddropoff	<boolean>
	      When this	is TRUE, the score threshold for admitting words  into
	      the word neighbourhood is	set to be the initial word score minus
	      the word threshold (see below).  This strategy  is  designed  to
	      prevent restricting the word SSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTT-
	      TNNKKSSRRVVVVAAAADDEEGGGG	When this is FALSE, the	word threshold
	      is taken to be an	absolute value.
       --seedrepeat <count>
	      The  seedrepeat parameter	sets the number	of seeds which must be
	      found on the same	diagonal or reading frame before HSP extension
	      will occur.  Increasing the value	for --seedrepeat will speed up
	      searches,	and is usually a better	option than using longer  word
	      lengths,	particularly when using	the exonerate-server where in-
	      creasing	word  lengths  requires	 recomputing  the  index,  and
	      greater increases	memory requirements.
       -w --dnawordlen <bases>
       -W --proteinwordlen <residues>
       -W --codonnwordlen <bases>
	      The word length used for DNA, protein or codon words.  When per-
	      forming DNA vs protein comparisons, a the	 DNA  wordlength  will
	      always (automatically) be	triple the protein wordlength.
       --dnahspdropoff <score>
       --proteinhspdropoff <score>
       --codonhspdropoff <score>
	      The amount by which an HSP score will be allowed to degrade dur-
	      ing HSP extension.  Separate threshold can be  set  for  dna  or
	      protein comparisons.
       --dnahspthreshold <score>
       --proteinhspthreshold <score>
       --codonhspthreshold <score>
	      The  HSP score thresholds.  An HSP must score at least this much
	      before it	will be	reported  or  be  used	in  preparation	 of  a
	      heuristic	alignment.
       --dnawordlimit  <score>
       --proteinwordlimit  <score>
       --codonwordlimit	 <score>
	      The  threshold  for admitting DNA	or protein words into the word
	      neighbourhood.  The behaviour of this option is altered  by  the
	      --useworddropoff option (see above).

       --geneseed <threshold>
	      Exclude HSPs from	gapped alignment computation which cannot fea-
	      ture in a	alignment containing at	least one HSP scoring at least
	      this threshold.

	      This  option provides considerable speed up for gapped alignment
	      computation, but may cause some very gap-rich alignments	to  be
	      missed.

	      It  is  useful  when aligning similar sequences back onto	genome
	      quickly, eg. try --geneseed 250
       --geneseedrepeat	<count>
	      The geneseedrepeat parameter is like the	seedrepeat  parameter,
	      but is only applied when looking for the geneseed	hsps.  Using a
	      larger value for --geneseedrepeat	will speed  up	searches  when
	      the --geneseed parameter is also used.  (experimental, implemen-
	      tation incomplete)
ALIGNMENT OPTIONS
       --alignmentwidth	<width>
	      Width of alignment display.  The default is 80.
       --forwardcoordinates <boolean>
	      By default, all coordinates are reported on the forward  strand.
	      Setting  this  option  to	 false	reverts	 to  the old behaviour
	      (pre-0.8.3) whereby alignments on	the reverse  complement	 of  a
	      sequence	are  reported using coordinates	on the reverse comple-
	      ment.
SUB-ALIGNMENT REGION OPTIONS
       --quality <percent>
	      This option excludes HSPs	from BSDP when their  components  out-
	      side of the SARs fall below this quality threshold.
SPLICE SITE PREDICTION OPTIONS
       --splice3 <path>
       --splice5 <path>
	      Provide a	file containing	a custom PSSM (position	specific score
	      matrix) for prediction of	the intron splice sites.

	      The file format for splice data is simple: lines beginning  with
	      '#'  are	comments, a line containing just the word 'splice' de-
	      notes the	position of the	splice site, and the other lines  show
	      the  observed  relative  frequencies  of	the bases flanking the
	      splice sites in the chosen organism (in ACGT order).

	      Example 5' splice	data file:

	       # start of example 5' splice data
	       # A C G T
	       28 40  17  14
	       59 14  13  14
		8  5  81   6
	       splice
		0  0 100   0
		0  0   0 100
	       54  2  42   2
	       74  8  11   8
		5  6  85   4
	       16 18  21  45
	       # end of	test 5'	splice data

	      Example 3' splice	data file:

	       # start of example 3' splice data
	       # A C G T
		10  31	14  44
		 8  36	14  43
		 6  34	12  48
		 6  34	 8  52
		 9  37	 9  45
		 9  38	10  44
		 8  44	 9  40
		 9  41	 8  41
		 6  44	 6  45
		 6  40	 6  48
		23  28	26  23
		 2  79	 1  18
	       100   0	 0   0
		 0   0 100   0
	       splice
		28  14	47  11
	       # end of	example	3' splice data

       --forcegtag <boolean>
	      Only allow splice	sites at gt....ag  sites  (or  ct....ac	 sites
	      when  the	 gene is reversed) With	this restriction in place, the
	      splice site prediction scores  are  still	 used  and  allow  tie
	      breaking when there is more than one possible splice site.

STRATEGIES FOR SPEED
       Keep all	data on	local disks.

       Apply  the  highest  acceptable score thresholds	using a	combination of
       --score,	--percent and --bestn.

       Repeat mask and dust the	genomic	(target)  sequence.   (Softmask	 these
       sequences and use --softmasktarget).

       Increase	the --fsmmemory	option to allow	more query multiplexing.

       Increase	the value for --seedrepeat

       When  using  an	alignment  model containing introns, set --geneseed as
       high as possible.

       If you are compiling exonerate yourself,	see the	README	file  supplied
       with the	source code for	details	of compile-time	optimisations.

STRATEGIES FOR SENSITIVITY
       Not documented yet.

       Increase	the word neighbourhood.	 Decrease the HSP threshold.  Increase
       the SAR ranges.	Run exhaustively.

ENVIRONMENT
       Not documented yet.

EXAMPLES
       exonerate cdna.fasta genomic.fasta
	      This simplest way	in which exonerate may be used.	  By  default,
	      an ungapped alignment model will be used.

       exonerate    --exhaustive   y   --model	 est2genome   cdna.fasta   ge-
       nomic.masked.fasta
	      Exhaustively align cdnas to  genomic  sequence.	This  will  be
	      much,  much slower, but more accurate.  This option causes exon-
	      erate to behave like EST_GENOME.

       exonerate --exhaustive --model affine:local query.fasta target.fasta
	      If the affine:local model	is used	with exhaustive	alignment, you
	      have the Smith-Waterman algorithm.

       exonerate   --exhaustive	  --model   affine:global  protein.fasta  pro-
       tein.fasta
	      Switch to	a global model,	and you	have Needleman-Wunsch.

       exonerate --wordthreshold 1 --gapped  no	 --showhsp  yes	 protein.fasta
       genome.fasta
	      Generate ungapped	Protein:DNA alignments

       exonerate  --model  coding2coding  --score  1000	 --bigseq  yes	--pro-
       teinhspthreshold	90 chr21.fa chr22.fa
	      Perform quick-and-dirty translated  pairwise  alignment  of  two
	      very large DNA sequences.

       Many similar combinations should	work.  Try them	out.

VERSION
       This documentation accompanies version 2.2.0 of the exonerate package.
AUTHOR
       Guy  St.C. Slater.  <guy@ebi.ac.uk>.  See the AUTHORS file accompanying
       the source code for a list of contributors.
AVAILABILITY
       This source code	for the	exonerate package is available under the terms
       of the GNU general public licence.

       Please see the file COPYING which was distrubuted with this package, or
       http://www.gnu.org/licenses/gpl.txt for details.

       This package has	been developed as part of the ensembl project.	Please
       see http://www.ensembl.org/ for more information.
SEE ALSO
       exonerate-server(1), ipcress(1),	blast(1L).

exonerate			 November 2002			  exonerate(1)

NAME | SYNOPSIS | DESCRIPTION | CONVENTIONS | GENERAL OPTIONS | SEQUENCE INPUT OPTIONS | ANALYSIS OPTIONS | FASTA DATABASE OPTIONS | GAPPED ALIGNMENT OPTIONS | VITERBI ALGORITM OPTIONS | CODE GENERATION OPTIONS | HEURISTIC OPTIONS | SEEDED DYNAMIC PROGRAMMING OPTIONS | BSDP OPTIONS | SEQUENCE OPTIONS | SYMBOL COMPARISON OPTIONS | ALIGNMENT SEEDING OPTIONS | AFFINE MODEL OPTIONS | NER OPTIONS | INTRON MODELLING OPTIONS | FRAMESHIFT MODELLING OPTIONS | ALPHABET OPTIONS | TRANSLATION OPTIONS | HSP CREATION OPTIONS | ALIGNMENT OPTIONS | SUB-ALIGNMENT REGION OPTIONS | SPLICE SITE PREDICTION OPTIONS | STRATEGIES FOR SPEED | STRATEGIES FOR SENSITIVITY | ENVIRONMENT | EXAMPLES | VERSION | AUTHOR | AVAILABILITY | SEE ALSO

Want to link to this manual page? Use this URL:
<https://www.freebsd.org/cgi/man.cgi?query=exonerate&sektion=1&manpath=FreeBSD+13.0-RELEASE+and+Ports>

home | help