Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
fasta36/ssearch36/[t]fastfasta36/ssearch36/[t]fast[x,y]36/lalign36    1(local)

NAME
       fasta36 - scan a	protein	or DNA sequence	library	for similar sequences

       fastx36	 - compare a DNA sequence to a protein sequence	database, com-
       paring the translated DNA sequence in forward and reverse frames.

       tfastx36	 - compare a protein sequence to a DNA sequence	database, cal-
       culating	 similarities with frameshifts to the forward and reverse ori-
       entations.

       fasty36	- compare a DNA	sequence to a protein sequence database,  com-
       paring the translated DNA sequence in forward and reverse frames.

       tfasty36	 - compare a protein sequence to a DNA sequence	database, cal-
       culating	similarities with frameshifts to the forward and reverse  ori-
       entations.

       fasts36 - compare unordered peptides to a protein sequence database

       fastm36	-  compare ordered peptides (or	short DNA sequences) to	a pro-
       tein (DNA) sequence database

       tfasts36	- compare unordered peptides  to  a  translated	 DNA  sequence
       database

       fastf36 - compare mixed peptides	to a protein sequence database

       tfastf36	- compare mixed	peptides to a translated DNA sequence database

       ssearch36  -  compare  a	protein	or DNA sequence	to a sequence database
       using the Smith-Waterman	algorithm.

       ggsearch36 - compare a protein or DNA sequence to a  sequence  database
       using a global alignment	(Needleman-Wunsch)

       glsearch36  -  compare a	protein	or DNA sequence	to a sequence database
       with alignments that are	global in the query and	local in the  database
       sequence	(global-local).

       lalign36	 - produce multiple non-overlapping alignments for protein and
       DNA sequences using the Huang and Miller	sim algorithm for  the	Water-
       man-Eggert algorithm.

       prss36,	prfx36	-  discontinued;  all the FASTA	programs will estimate
       statistical significance	using 500 shuffled sequence scores if two  se-
       quences are compared.

DESCRIPTION
       Release	3.6  of	 the  FASTA package provides a modular set of sequence
       comparison programs that	can run	on conventional	single processor  com-
       puters  or  in  parallel	on multiprocessor computers. More than a dozen
       programs	    -	  fasta36,     fastx36/tfastx36,     fasty36/tfasty36,
       fasts36/tfasts36, fastm36, fastf36/tfastf36, ssearch36, ggsearch36, and
       glsearch36 - are	currently available.

       All the comparison programs share a set of basic	command	line  options;
       additional options are available	for individual comparison functions.

       Threaded	 versions  of  the  FASTA  programs  (built  by	 default under
       Unix/Linux/MacOX) run in	parallel on modern Linux and  Unix  multi-core
       or multi-processor computers.  Accelerated versions of the Smith-Water-
       man algorithm are available for architectures with the  Intel  SSE2  or
       Altivec PowerPC architectures, which can	speed-up Smith-Waterman	calcu-
       lations 10 - 20-fold.

       In addition to the serial and threaded versions of the FASTA  programs,
       MPI  parallel  versions	are  available	as fasta36_mpi,	ssearch36_mpi,
       fastx36_mpi, etc. The MPI parallel versions use the same	 command  line
       options as the serial and threaded versions.

Running	the FASTA programs
       By  default, the	FASTA programs are no longer interactive; they are run
       from the	command	line by	specifying the program,	 query.file,  and  li-
       brary.file.   Program  options  must  preceed  the  query.file  and li-
       brary.file arguments:

     fasta36 -option1 -option2 -option3	query.file library.file	> fasta.output

       The "classic" interactive mode, which prompts for a query.file and  li-
       brary.file,  is	available  with	 the -I	option.	 Typing	a program name
       without any arguments (ssearch36) provides a short help	message;  pro-
       gram_name -help provides	a complete set of program options.

       Program options MUST preceed the	query.file and library.file arguments.

FASTA program options
       The  default  scoring matrix and	gap penalties used by each of the pro-
       grams have been selected	for high sensitivity searches with the various
       algorithms.   The default program behavior can be modified by providing
       command line options before the query.file and library.file  arguments.
       Command line options can	also be	used in	interactive mode.

       Command line arguments come in several classes.

       (1)  Commands  that  specify  the comparison type. FASTA, FASTS,	FASTM,
       SSEARCH,	GGSEARCH, and GLSEARCH can compare either protein or  DNA  se-
       quences,	 and  attempt  to recognize the	comparison type	by looking the
       residue composition. -n,	-p specify DNA (nucleotide) or protein compar-
       ison, respectively. -U specifies	RNA comparison.

       (2) Commands that limit the set of sequences compared: -1, -3, -M.

       (3)  Commands that modify the scoring parameters: -f gap-open penaltyP,
       -g  gap-extend  penalty,	 -j  inter-codon   frame-shift,	  within-codon
       frameshift, -s scoring-matrix, -r match/mismatch	score, -x X:X score.

       (4)  Commands  that modify the algorithm	(mostly	FASTA and [T]FASTX/Y):
       -c, -w, -y, -o. The -S can be used to ignore lower-case	(low  complex-
       ity) residues during the	initial	score calculation.

       (5)  Commands  that modify the output: -A, -b number, -C	width, -d num-
       ber, -L,	-m 0-11,B, -w line-width, -W context-width, -o offset1,ofset2

       (6) Commands that affect	statistical estimates: -Z, -k.

Option summary:
       -1     Sort by "init1" score (obsolete)

       -3     ([t]fast[x,y] only) use only forward frame translations

       -a     Displays the full	length (included unaligned  regions)  of  both
	      sequences	with fasta36, ssearch36, glsearch36, and fasts36.

       -A (fasta36 only) For DNA:DNA, force Smith-Waterman alignment for
	      output.	Smith-Waterman is the default for FASTA	protein	align-
	      ment and [t]fast[x,y], but not for DNA comparisons  with	FASTA.
	      For protein:protein, use band-alignment algorithm.

       -b #   number  of  best scores/descriptions to show (must be < expecta-
	      tion cutoff if -E	is given).  By	default,  this	option	is  no
	      longer used; all scores better than the expectation (E())	cutoff
	      are listed. To guarantee the display of  #  descriptions/scores,
	      use  -b  =#,  i.e.  -b =100 ensures that 100 descriptions/scores
	      will be displayed.  To guarantee at  least  1  description,  but
	      possibly many more (limited by -E	e_cut),	use -b >1.

       -c "E-opt E-join"
	      threshold	for gap	joining	(E-join) and band optimization (E-opt)
	      in FASTA and [T]FASTX/Y.	FASTA36	now uses BLAST-like  statisti-
	      cal  thresholds  for joining and band optimization.  The default
	      statistical thresholds for protein  and  translated  comparisons
	      are  E-opt=0.2,  E-join=0.5;  for	 DNA,  E-join =	0.1 and	E-opt=
	      0.02. The	actual number of joins and optimizations  is  reported
	      after  the  E-join  and  E-opt  scoring parameters.  Statistical
	      thresholds improves search speed 2 - 3X, and provides much  more
	      accurate statistical estimates for matrices other	than BLOSUM50.
	      The "classic" joining/optimization thresholds that were the  de-
	      fault  in	 fasta35 and earlier programs are available using -c O
	      (upper case O), possibly followed	a value	> 1.0 to set the  opt-
	      cut optimization threshold.

       -C #   length of	name abbreviation in alignments, default = 6.  Must be
	      less than	20.

       -d #   number of	best alignments	to show	( must be <  expectation  (-E)
	      cutoff and <= the	-b description limit).

       -D     turn  on	debugging  mode.   Enables checks on sequence alphabet
	      that cause problems with tfastx36, tfasty36 (only	available  af-
	      ter compile time option).	 Also preserves	temp files with	-e ex-
	      pand_script.sh option.

       -e expand_script.sh
	      Run a script to expand the set  of  sequences  displayed/aligned
	      based  on	 the  results  of the initial search.  When the	-e ex-
	      pand_script.sh option is used, after the initial scan  and  sta-
	      tistics calculation, but before the "Best	scores"	are shown, ex-
	      pand_script.sh with a single argument, the name of a  file  that
	      contains	the  accession	information (the text on the fasta de-
	      scription	line between > and the first space) and	the  E()-value
	      for  the	sequence.  expand_script.sh then uses this information
	      to send a	library	of additional sequences	to stdout. These addi-
	      tional  sequences	 are  included in the list of high-scoring se-
	      quences (if their	scores are significant)	and aligned. The addi-
	      tional sequences do not change the statistics or database	size.

       -E e_cut	e_cut_r
	      expectation  value  upper	limit for score	and alignment display.
	      Defaults are 10.0	for FASTA36 and	 SSEARCH36  protein  searches,
	      5.0  for translated DNA/protein comparisons, and 2.0 for DNA/DNA
	      searches.	FASTA version 36 now reports additional	alignments be-
	      tween  the query and the library sequence, the second value sets
	      the threshold for	the subsequent alignments.  If not given,  the
	      threshold	 is  e_cut/10.0.   If given and	value >	1.0, e_cut_r =
	      e_cut / value; for value < 1.0, e_cut_r =	value;	If  e_cut_r  <
	      0, then the additional alignment option is disabled.

       -f #   penalty for opening a gap.

       -F #   expectation  value  lower	limit for score	and alignment display.
	      -F 1e-6 prevents library sequences with  E()-values  lower  than
	      1e-6  from being displayed. This allows the use to focus on more
	      distant relationships.

       -g #   penalty for additional residues in a gap

       -h     Show short help message.

       -help  Show long	help message, with all options.

       -H     show histogram (with fasta-36.3.4, the histogram is not shown by
	      default).

       -i     (fasta DNA, [t]fastx[x,y]) compare against only the reverse com-
	      plement of the library sequence.

       -I     interactive mode;	prompt for query filename, library.

       -j # # ([t]fast[x,y] only) penalty for a	frameshift between two codons,
	      ([t]fasty	only) penalty for a frameshift within a	codon.

       -J     (lalign36	only) show identity alignment.

       -k     specify  number of shuffles for statistical parameter estimation
	      (default=500).

       -l str specify FASTLIBS file

       -L     report long sequence description in alignments (up to 200	 char-
	      acters).

       -m 0,1,2,3,4,5,6,8,9,10,11,B,BB,"F# out.file" alignment display
	      options.	 -m  0,	1, 2, 3	display	different types	of alignments.
	      -m 4 provides an alignment "map" on the query. -m	5 combines the
	      alignment	 map and a -m 0	alignment.  -m 6 provides an HTML out-
	      put.

       -m 8 seeks to mimic BLAST -m 8 tabular output.  Only query and
	      library sequence names, and  identity,  mismatch,	 starts/stops,
	      E()-values,  and	bit  scores are	displayed.  -m 8C mimics BLAST
	      tabular format with comment lines.  -m 8	formats	 do  not  show
	      alignments.

       -m 9 does not change the	alignment output, but provides
	      alignment	 coordinate  and percent identity information with the
	      best scores report.  -m 9c adds encoded alignment	information to
	      the  -m  9;  -m 9C adds encoded alignment	information as a CIGAR
	      formatted	string.	To accomodate frameshifts,  the	 CIGAR	format
	      has  been	 supplemented with F (forward) and R (reverse).	 -m 9i
	      provides only percent identity and alignment length  information
	      with  the	 best scores.  With current versions of	the FASTA pro-
	      grams, independent -m options can	be combined; e.g. -m 1	-m  9c
	      -m 6.

       -m 11 provides lav format output	from lalign36.	It does	not
	      currently	 affect	 other	alignment  algorithms.	The lav2ps and
	      lav2svg programs can be used to convert  lav  format  output  to
	      postscript/SVG alignment "dot-plots".

       -m B provides BLAST-like	alignments.  Alignments	are labeled as
	      "Query"  and  "Sbjct",  with coordinates on the same line	as the
	      sequences, and BLAST-like	symbols	for matches and	mismatches. -m
	      BB extends BLAST similarity to all the output, providing an out-
	      put that closely mimics BLAST output.

       -m "F# out.file"	allows one search to write different alignment
	      formats to different files.  The	'F'  indicates	separate  file
	      output; the '#' is the output format (1-6,8,9,10,11,B,BB,	multi-
	      ple compatible formats  can  be  combined	 separated  by	commas
	      -',').

       -M #-# molecular	 weight	(residue) cutoffs.  -M "101-200" examines only
	      library sequences	that are 101-200 residues long.

       -n     force query to nucleotide	sequence

       -N #   break long library sequences into	blocks of # residues.	Useful
	      for  bacterial  genomes, which have only one sequence entry.  -N
	      2000 works well for well for bacterial genomes. (This option was
	      required	when  FASTA  only  provided  one alignment between the
	      query and	library	sequence.  It is not as	useful,	now that  mul-
	      tiple alignments are available.)

       -o "#,#"
	      offsets query, library sequence for numbering alignments

       -O file
	      send output to file.

       -p     force query to protein alphabet.

       -P pssm_file
	      (ssearch36,  ggsearch36,	glsearch36  only).   Provide  blastpgp
	      checkpoint file as the PSSM for searching. Two PSSM file formats
	      are  available,  which  must  be	provided  with	the  filename.
	      'pssm_file 0' uses a binary format  that	is  machine  specific;
	      'pssm_file 1' uses the "blastpgp -u 1 -C pssm_file" ASN.1	binary
	      format (preferred).

       -q/-Q  quiet option; do not prompt for input (on	by default)

       -r "+n/-m"
	      (DNA only) values	for match/mismatch for DNA comparisons.	+n  is
	      used for the maximum positive value and -m is used for the maxi-
	      mum negative value. Values between max and  min,	are  rescaled,
	      but residue pairs	having the value -1 continue to	be -1.

       -R file
	      save all scores to statistics file (previously -r	file)

       -s name
	      specify  substitution  matrix.   BLOSUM50	 is  used  by default;
	      PAM250, PAM120, and BLOSUM62 can	be  specified  by  setting  -s
	      P120,  P250, or BL62.  Additional	scoring	matrices include: BLO-
	      SUM80 (BL80), and	MDM10, MDM20, MDM40 (Jones, Taylor, and	Thorn-
	      ton,  1992  CABIOS  8:275-282; specified as -s MD10, -s MD20, -s
	      MD40), OPTIMA5 (-s OPT5, Kann  and  Goldstein,  (2002)  Proteins
	      48:367-376),  and	 VTML160 (-s VT160, Mueller and	Vingron	(2002)
	      J. Comp. Biol. 19:8-13).	Each scoring matrix has	associated de-
	      fault gap	penalties.  The	BLOSUM62 scoring matrix	and -11/-1 gap
	      penalties	can be specified with -s BP62.

	      Alternatively, a BLASTP format scoring matrix file can be	speci-
	      fied, e.g. -s matrix.filename.  DNA scoring matrices can also be
	      specified	with the "-r" option.

	      With fasta36.3, variable scoring matrices	can  be	 specified  by
	      preceeding  the  scoring	matrix	abbreviation with '?', e.g. -s
	      '?BP62'. Variable	scoring	matrices allow the FASTA  programs  to
	      choose  an  alternative  scoring	matrix with higher information
	      content (bit score/position) when	short queries are  used.   For
	      example,	a  90  nucleotide  FASTX  query	 can produce only a 30
	      amino-acid alignment, so a scoring matrix	with  1.33  bits/posi-
	      tion  is	required to produce a 40 bit score. The	FASTA programs
	      include BLOSUM50 (0.49 bits/pos) and  BLOSUM62  (0.58  bits/pos)
	      but can range to MD10 (3.44 bits/position). The variable scoring
	      matrix option searches down the list of scoring matrices to find
	      one  with	 information  content  high enough to produce a	40 bit
	      alignment	score.

       -S     treat lower case letters in the query or database	 as  low  com-
	      plexity  regions	that  are equivalent to	'X' during the initial
	      database scan, but are treated as	normal residues	for the	 final
	      alignment	display.  Statistical estimates	are based on the 'X'ed
	      out sequence used	during the initial search.  Protein  databases
	      (and query sequences) can	be generated in	the appropriate	format
	      using   John   Wooton's	"pseg"	 program,    available	  from
	      ftp://ftp.ncbi.nih.gov/pub/seg/pseg.  Once you have compiled the
	      "pseg" program, use the command:

	      pseg database.fasta -z 1 -q  > database.lc_seg

       -t #   Translation table	- [t]fastx36 and [t]fasty36 support the	 BLAST
	      tranlation  tables.  See http://www.ncbi.nih.gov/htbin-post/Tax-
	      onomy/wprintgc?mode=c/.

       -T #   (threaded, parallel only)	number of threads or  workers  to  use
	      (on  Linux/MacOS/Unix,  the default is to	use as many processors
	      as are available;	on Windows systems, 2 processors are used).

       -U     Do RNA sequence comparisons: treat 'T' as	'U',  allow  G:U  base
	      pairs (by	scoring	"G-A" and "T-C"	as score(G:G)-3).  Search only
	      one strand.

       -V "?$%*"
	      Allow special annotation characters in  query  sequence.	 These
	      characters will be displayed in the alignments on	the coordinate
	      number line.

       -w # line width for similarity score, sequence alignment, output.

       -W # context length (default is 1/2 of line width -w) for alignment,
	      like fasta and ssearch, that provide  additional	sequence  con-
	      text.

       -X extended options.  Less used options.	Other options include
	      -XB, -XM4G, -Xo, -Xx, and	-Xy; see fasta_guide.pdf.

       -z 1, 2,	3, 4, 5, 6
	      Specify  the  statistical	calculation. Default is	-z 1 for local
	      similarity searches, which uses regression against the length of
	      the library sequence. -z -1 disables statistics.	-z 0 estimates
	      significance without normalizing for sequence length. -z 2  pro-
	      vides  maximum  likelihood estimates for lambda and K, censoring
	      the 250 lowest and 250 highest scores. -z	3  uses	 Altschul  and
	      Gish's statistical estimates for specific	protein	BLOSUM scoring
	      matrices and gap penalties.  -z  4,5:  an	 alternate  regression
	      method.	-z 6 uses a composition	based maximum likelihood esti-
	      mate based on the	 method	 of  Mott  (1992)  Bull.  Math.	 Biol.
	      54:59-75.

       -z 11,12,14,15,16
	      compute  the  regression	against	 scores	 of  randomly shuffled
	      copies of	the library sequences.	Twice as many comparisons  are
	      performed,  but  accurate	 estimates can be generated from data-
	      bases of related sequences. -z  11  uses	the  -z	 1  regression
	      strategy,	etc.

       -z 21, 22, 24, 25, 26
	      compute  two E()-values.	The standard (library-based) E()-value
	      is calculated in the standard way	(-z 1, 2, etc),	but  a	second
	      E2() value is calculated by shuffling the	high-scoring sequences
	      (those with E()-values less than the threshold).	For  "average"
	      composition  proteins,  these  two  estimates  will  be  similar
	      (though the best-shuffle estimates  are  always  more  conserva-
	      tive).   For  biased composition proteins, the two estimates may
	      differ by	100-fold or more.  A second -z option, e.g. -z "21 2",
	      specifies	 the  estimation method	for the	best-shuffle E2()-val-
	      ues. Best-shuffle	E2()-values approximate	the estimates given by
	      PRSS (or in a pairwise SSEARCH).

       -Z db_size
	      Set the apparent database	size used for expectation value	calcu-
	      lations (used for	protein/protein	FASTA  and  SSEARCH,  and  for
	      [T]FASTX/Y).

Reading	sequences from STDIN
       The  FASTA  programs  can accept	a query	sequence from the unix "stdin"
       data stream.  This makes	it much	easier to use fasta36  and  its	 rela-
       tives  as part of a WWW page. To	indicate that stdin is to be used, use
       "@" as the query	sequence file name.  "@" can also be used to specify a
       subset of the query sequence to be used,	e.g:

     cat query.aa | fasta36 @:50-150 s

       would  search the 's' database with residues 50-150 of query.aa.	 FASTA
       cannot automatically detect the sequence	type  (protein	vs  DNA)  when
       "stdin"	is  used  and assumes protein comparisons by default; the '-n'
       option is required for DNA for STDIN queries.

Environment variables:
       FASTLIBS
	      location of library choice file (-l FASTLIBS)

       SRCH_URL1, SRCH_URL2
	      format strings used to define options to re-search the database.

       REF_URL
	      the format string	used to	define the option to  lookup  the  li-
	      brary sequence in	entrez,	or some	other database.

AUTHOR
       Bill Pearson
       wrp@virginia.EDU

       Version:	$ Id: $	Revision: $Revision: 210 $

			 fasta36/ssearch36/[t]fast[x,y]36/lalign36    1(local)

NAME | DESCRIPTION | Running the FASTA programs | FASTA program options | Option summary: | Reading sequences from STDIN | Environment variables: | AUTHOR

Want to link to this manual page? Use this URL:
<https://www.freebsd.org/cgi/man.cgi?query=fasta36&sektion=1&manpath=FreeBSD+12.0-RELEASE+and+Ports>

home | help