Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages


home | help
tabix(1)		     Bioinformatics tools		      tabix(1)

       tabix - Generic indexer for TAB-delimited genome	position files

       tabix  [-0lf]  [-p gff|bed|sam|vcf] [-s seqCol] [-b begCol] [-e endCol]
       [-S lineSkip] [-c metaChar] [region1 [region2	[...]]]

       Tabix indexes a TAB-delimited genome position file and  cre-
       ates  an	 index	file ( or	when region is
       absent from the command-line. The input	data  file  must  be  position
       sorted and compressed by	bgzip which has	a gzip(1) like interface.

       After  indexing,	 tabix is able to quickly retrieve data	lines overlap-
       ping regions specified in the format  "chr:beginPos-endPos".   (Coordi-
       nates specified in this region format are 1-based and inclusive.)

       Fast  data  retrieval also works	over network if	URI is given as	a file
       name and	in this	case the index file will be downloaded if  it  is  not
       present locally.

       The  tabix  (.tbi)  and BAI index formats can handle individual chromo-
       somes up	to 512 Mbp (2^29 bases)	in length.  If your input  file	 might
       contain	data  lines with begin or end positions	greater	than that, you
       will need to use	a CSI index.

       -0, --zero-based
		 Specify that the position in the data file is	0-based	 (e.g.
		 UCSC files) rather than 1-based.

       -b, --begin INT
		 Column	of start chromosomal position. [4]

       -c, --comment CHAR
		 Skip lines started with character CHAR. [#]

       -C, --csi Produce  CSI  format  index instead of	classical tabix	or BAI
		 style indices.

       -e, --end INT
		 Column	of end chromosomal position. The end column can	be the
		 same as the start column. [5]

       -f, --force
		 Force to overwrite the	index file if it is present.

       -m, --min-shift INT
		 set minimal interval size for CSI indices to 2^INT [14]

       -p, --preset STR
		 Input	format	for indexing. Valid values are:	gff, bed, sam,
		 vcf.  This option should not be applied together with any  of
		 -s,  -b, -e, -c and -0; it is not used	for data retrieval be-
		 cause this setting is stored in the index file. [gff]

       -s, --sequence INT
		 Column	of sequence name. Option -s, -b, -e, -S, -c and	-0 are
		 all  stored  in  the index file and thus not used in data re-
		 trieval. [1]

       -S, --skip-lines	INT
		 Skip first INT	lines in the data file.	[0]

       -h, --print-header
	      Print also the header/meta lines.

       -H, --only-header
	      Print only the header/meta lines.

       -l, --list-chroms
	      List the sequence	names stored in	the index file.

       -r, --reheader FILE
	      Replace the header with the content of FILE

       -R, --regions FILE
	      Restrict to regions listed in the	FILE. The FILE can be BED file
	      (requires	.bed, .bed.gz, .bed.bgz	file name extension) or	a TAB-
	      delimited	file with CHROM, POS, and,   optionally,  POS_TO  col-
	      umns,  where positions are 1-based and inclusive.	 When this op-
	      tion is in use, the input	file may not be	sorted.

       -T, --targets FILE
	      Similar to -R but	the entire input will be read sequentially and
	      regions not listed in FILE will be skipped.

       -D     Do  not download the index file before opening it. Valid for re-
	      mote files only.

       --cache INT
	      Set the BGZF block cache size to INT megabytes. [10]

	      This is of most benefit when the -R option is  used,  which  can
	      cause  blocks  to	be read	more than once.	 Setting the size to 0
	      will disable the cache.

	      This option can be used when multiple regions  are  supplied  in
	      the  command  line  and the user needs to	quickly	see which file
	      records belong to	which region.  For this, a line	with the  name
	      of  the region, preceded by the file specific comment symbol, is
	      inserted	in  the	 output	 before	 its  corresponding  group  of

       --verbosity INT
	      Set  verbosity  of  logging messages printed to stderr.  The de-
	      fault is 3, which	turns on error and warning messages; 2 reduces
	      warning  messages;  1 prints only	error messages and 0 is	mostly
	      silent.  Values higher than 3 produce  additional	 informational
	      and debugging messages.

       (grep  ^"#"  in.gff; grep -v ^"#" in.gff	| sort -k1,1 -k4,4n) | bgzip >

       tabix -p	gff sorted.gff.gz;

       tabix sorted.gff.gz chr1:10,000,000-20,000,000;

       It is straightforward to	achieve	overlap	queries	using the standard  B-
       tree  index (with or without binning) implemented in all	SQL databases,
       or the R-tree index in PostgreSQL and Oracle. But there are still  many
       reasons	to  use	 tabix.	 Firstly,  tabix  directly works with a	lot of
       widely used TAB-delimited formats such as GFF/GTF and BED.  We  do  not
       need  to	 design	database schema	or specialized binary formats. Data do
       not need	to be duplicated in different formats, either. Secondly, tabix
       works  on  compressed  data  files while	most SQL databases do not. The
       GenCode annotation GTF can be compressed	down to	4%.  Thirdly, tabix is
       fast.  The  same	indexing algorithm is known to work efficiently	for an
       alignment with a	few billion short reads. SQL databases probably	cannot
       easily  handle  data  at	this scale. Last but not the least, tabix sup-
       ports remote data retrieval. One	can put	the data file and the index at
       an  FTP	or  HTTP  server, and other users or even web services will be
       able to get a slice without downloading the entire file.

       Tabix was written by Heng Li. The BGZF library  was  originally	imple-
       mented  by Bob Handsaker	and modified by	Heng Li	for remote file	access
       and in-memory caching.

       bgzip(1), samtools(1)

htslib-1.11		       22 September 2020		      tabix(1)


Want to link to this manual page? Use this URL:

home | help