Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages


home | help
FA2HTGS(1)		   NCBI	Tools User's Manual		    FA2HTGS(1)

       fa2htgs	- formatter for	high throughput	genome sequencing project sub-

       fa2htgs [-] [-6 str] [-7	str] [-A filename] [-C str] [-D] [-L filename]
       [-M str]	 [-N]  [-O filename] [-P str] [-Q filename] [-S	str] [-T file-
       name] [-X] [-a str] [-b N] [-c str] [-d str] [-e	filename] [-f]	-g str
       [-h str]	 [-i filename]	[-k str]  [-l N]  [-m]	[-n str] [-o filename]
       [-p N] [-q] [-r str] -s str [-t filename] [-u] [-v] [-w]	[-x str]

       fa2htgs is a program used to generate Seq-submits  (an  ASN.1  sequence
       submission file)	for high throughput genome sequencing projects.

       fa2htgs	will  read  a FASTA file (or an	Ace Contig file	with Phrap se-
       quence quality values), a Sequin	submission template file, (to get con-
       tact and	citation information for the submission), and a	series of com-
       mand line arguments (see	below).	 This program will then	combines these
       information  to	make  a	submission suitable for	GenBank. Once you have
       generated your submission file, you need	to follow the submission  pro-
       tocol (see the README present on	your FTP account or mailed out to your

       fa2htgs is intended for the automation by scripts for  bulk  submission
       of unannotated genome sequence. It can easily be	extended from its cur-
       rent simple form	to allow more complicated  processing.	 A  submission
       prepared	 with fa2htgs can also be read into Psequin(1),	and then anno-
       tated more extensively.

       Questions and concerns about this processing protocol, or  how  to  use
       this tool should	be forwarded to	<>.

       A summary of options is included	below.

       -      Print usage message

       -6 str SP6 clone	(e.g., Contig1,left)

       -7 str T7 clone (e.g., Contig2,right)

       -A filename
	      Filename	for  accession	list input (mutually exclusive with -T
	      and -i).	The input file contains	 a  tab-delimited  table  with
	      three  to	 five columns, which are accession number, start posi-
	      tion, stop position, and (optionally)  length  and  strand.   If
	      start  >	stop,  the minus strand	on the referenced accession is
	      used.  A gap is indicated	by the word "gap" instead of an	acces-
	      sion,  0	for the	start and stop positions, and a	number for the

       -C str Clone library name  (will	 appear	 as  /clone-lib="str"  on  the
	      source feature)

       -D     HTGS_DRAFT sequence

       -L filename
	      Read  phrap contig order from filename.  This is a tab-delimited
	      file that	can be used to drive the order	of  contigs  (normally
	      specified	by -P),	as well	as indicating the SP6 and T7 ends.  It
	      can also be used when contigs are	known to be in opposite	orien-
	      tation.  For example:

		  Contig2     +	      1	      SP6     left
		  Contig3     +	      1
		  Contig1     -		      T7      right

	      The  first column	is the contig name, the	second is the orienta-
	      tion, the	third is the fragment_group, the fourth	indicates  the
	      SP6  or  T7  end,	and the	fifth says which side of SP6 or	T7 end
	      had vector removed.

       -M str Map name (will appear as /map="str" on the source	feature)

       -N     Annotate assembly_fragments

       -O filename
	      Read comment from	filename (100-character-per-line maximum; ~ is
	      a	 linebreak  and	 `~  is	a literal ~.  You can check the	format
	      with PSequin(1).)

       -P str Contigs to use, separated	by commas.  If	-P  is	not  indicated
	      with  the	 -T option, then the fragments will go in in the order
	      that they	are in the ace file (which is appropriate for a	 phase
	      1	 record,  but not for a	phase 2	or 3).	If you need to set the
	      order of the segments of the ace file, you need to set  it  with
	      the -P flag, like	this: -P "Contig1,Contig4,Contig3,Contig2,Con-

       -Q filename
	      Read quality scores from filename

       -S str Strain name

       -T filename
	      Filename for phrap input (mutually exclusive with	-A and -i)

       -X     The coordinates in the input file	are on the resulting segmented
	      sequence.	 (Bases	1 through n of each accession are used.)  Oth-
	      erwise, the coordinates are on the individual accessions,	 which
	      need not start at	base 1 of the record.

       -a str GenBank accession; use if	and only if updating a sequence.

       -b N   Gap  length (default = 100; anything from	0 to 1000000000	is le-

       -c str Clone name (will appear as /clone	in the source feature; can  be
	      the same as -s)

       -d str Title for	sequence (will appear in GenBank DEFINITION line)

       -e filename
	      Log errors to filename

       -f     htgs_fulltop keyword

       -g str Genome  Center  tag (probably the	same as	your login name	on the
	      NCBI FTP server)

       -h str Chromosome (will appear as /chromosome in	the source feature)

       -i filename
	      Filename for fasta input (default	is stdin;  mutually  exclusive
	      with -A and -T)

       -k str Add the supplied string as a keyword.

       -l N   Length  of  sequence  in bp (default = 0). The length is checked
	      against the actual number	of bases we get. For phase 1 and 2 se-
	      quence  it is also used to estimate gap lengths. For phase 1 and
	      2	records, it is important to use	 a  number  GREATER  than  the
	      amount  of  provided  nucleotide,	 otherwise  this will generate
	      false `gaps'.  Here is assumed that the putative full length  of
	      the  BAC or cosmid will be used.	There should be	at least 20 to
	      30 `n' in	between	the segments (you can check for	these  in  Se-
	      quin), as	this will ensure proper	behavior when this sequence is
	      used with	 BLAST.	  Otherwise  `artifactual'  unrelated  segment
	      neighbors	may be brought into proximity of each other.

       -m     Take comment from	template

       -n str Organism name (default = Homo sapiens)

       -o filename
	      Filename for asn.1 output	(default = stdout)

       -p N   HTGS phase:
	      1	     A	collection  of	unordered contigs with gaps of unknown
		     length.  A	Phase 1	record must at the very	least have two
		     segments with one gap.  (default)
	      2	     A	series	of  ordered  contigs,  possibly	with known gap
		     lengths.  This could be a single sequence	without	 gaps,
		     if	the sequence has ambiguities to	resolve.
	      3	     A single contiguous sequence.  This sequence is finished,
		     but not necessarily annotated.

       -q     htgs_cancelled keyword

       -r str Remark for update	(brief comment describing the  nature  of  the
	      update, such as "new sequence", "new citation", or "updated fea-

       -s str Sequence name.  The sequence must	have a	name  that  is	unique
	      within  the  genome center. We use the combination of the	genome
	      center name (-g argument)	and the	sequence name  (-s)  to	 track
	      this  sequence  and  to talk to you about	it.  The name can have
	      any form you like	but must be unique within your center.

       -t filename
	      Filename for Seq-submit template (default	= template.sub)

       -u     Take biosource from template

       -v     htgs_activefin keyword

       -w     Whole Genome Shotgun flag

       -x str Secondary	 accession  numbers,   separated   by	commas,	  s.t.

	      In some cases a large segment will supersede another or group of
	      other accession numbers (records).  These	records	which  are  no
	      longer  wanted in	GenBank	should be made secondary. Using	the -x
	      argument you can list the	Accession Numbers  you	want  to  make
	      secondary.   This	 will instruct us to remove the	accession num-
	      ber(s) from GenBank, and will no longer be part of  the  GenBank
	      release. They will nonetheless be	available from Entrez.

	      GREAT CARE should	be taken when using this argument!!!  Improper
	      use of accession numbers here will result	in  the	 inappropriate
	      withdrawal  of  GenBank records from GenBank, EMBL and DDBJ.  We
	      provide this parameter as	a convenience to  submitting  centers,
	      but this may need	to be removed if it is not used	carefully.

       The National Center for Biotechnology Information.

       Psequin(1), fa2htgs/README

NCBI				  2006-05-29			    FA2HTGS(1)


Want to link to this manual page? Use this URL:

home | help