Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
faidx(5)		    Bioinformatics formats		      faidx(5)

NAME
       faidx - an index	enabling random	access to FASTA	and FASTQ files

SYNOPSIS
       file.fa.fai, file.fasta.fai, file.fq.fai, file.fastq.fai

DESCRIPTION
       Using an	fai index file in conjunction with a FASTA/FASTQ file contain-
       ing reference sequences enables efficient access	to  arbitrary  regions
       within  those  reference	 sequences.   The index	file typically has the
       same filename as	the corresponding  FASTA/FASTQ	file,  with  .fai  ap-
       pended.

       An  fai	index  file  is	a text file consisting of lines	each with five
       TAB-delimited columns for a FASTA file and six for FASTQ:

       NAME	    Name of this reference sequence
       LENGTH	    Total length of this reference sequence, in	bases
       OFFSET	    Offset in the FASTA/FASTQ file of this sequence's first base
       LINEBASES    The	number of bases	on each	line
       LINEWIDTH    The	number of bytes	in each	line, including	the newline
       QUALOFFSET   Offset of sequence's first quality within the FASTQ	file

       The NAME	and LENGTH columns contain the same data as  would  appear  in
       the  SN	and  LN	 fields	of a SAM @SQ header for	the same reference se-
       quence.

       The OFFSET column contains the offset within the	FASTA/FASTQ  file,  in
       bytes starting from zero, of the	first base of this reference sequence,
       i.e., of	the character following	the newline at the end of  the	header
       line  (the  ">"	line in	FASTA, "@" in FASTQ). Typically	the lines of a
       fai index file appear in	the order in which the reference sequences ap-
       pear  in	 the  FASTA/FASTQ file,	so .fai	files are typically sorted ac-
       cording to this column.

       The LINEBASES column contains the number	of bases in each  of  the  se-
       quence  lines that form the body	of this	reference sequence, apart from
       the final line which may	be shorter.  The LINEWIDTH column contains the
       number of bytes in each of the sequence lines (except perhaps the final
       line), thus differing from LINEBASES in that it also counts  the	 bytes
       forming the line	terminator.

       The  QUALOFFSET	works the same way as OFFSET but for the first quality
       score of	this reference sequence.  This would be	 the  first  character
       following  the  newline	at  the	 end of	the "+"	line.  For FASTQ files
       only.

   FASTA Files
       In order	to be indexed with samtools faidx, a FASTA file	must be	a text
       file of the form

	      >name [description...]
	      ATGCATGCATGCATGCATGCATGCATGCAT
	      GCATGCATGCATGCATGCATGCATGCATGC
	      ATGCAT
	      >name [description...]
	      ATGCATGCATGCAT
	      GCATGCATGCATGC
	      [...]

       In  particular, each reference sequence must be "well-formatted", i.e.,
       all of its sequence lines must be the same length, apart	from the final
       sequence	 line  which may be shorter.  (While this sequence line	length
       must be the same	within each sequence, it may  vary  between  different
       reference sequences in the same FASTA file.)

       This also means that although the FASTA file may	have Unix- or Windows-
       style or	other line termination,	the newline characters present must be
       consistent, at least within each	reference sequence.

       The  samtools implementation uses the first word	of the ">" header line
       text (i.e., up to the first whitespace character,  having  skipped  any
       initial whitespace after	the ">") as the	NAME column.

   FASTQ Files
       FASTQ files for indexing	work in	the same way as	the FASTA files.

	      @name [description...]
	      ATGCATGCATGCATGCATGCATGCATGCAT
	      GCATGCATGCATGCATGCATGCATGCATGC
	      ATGCAT
	      +
	      FFFA@@FFFFFFFFFFHHB:::@BFFFFGG
	      HIHIIIIIIIIIIIIIIIIIIIIIIIFFFF
	      8011<<
	      @name [description...]
	      ATGCATGCATGCAT
	      GCATGCATGCATGC
	      +
	      IIA94445EEII==
	      =>IIIIIIIIICCC
	      [...]

       Quality	lines  must be wrapped at the same length as the corresponding
       sequence	lines.

EXAMPLE
       For example, given this FASTA file

	      >one
	      ATGCATGCATGCATGCATGCATGCATGCAT
	      GCATGCATGCATGCATGCATGCATGCATGC
	      ATGCAT
	      >two another chromosome
	      ATGCATGCATGCAT
	      GCATGCATGCATGC

       formatted with Unix-style (LF) line termination,	the corresponding  fai
       index would be

	      one   66	  5   30   31
	      two   28	 98   14   15

       If the FASTA file were formatted	with Windows-style (CR-LF) line	termi-
       nation, the fai index would be

	      one   66	   6   30   32
	      two   28	 103   14   16

       An example FASTQ	file

	      @fastq1
	      ATGCATGCATGCATGCATGCATGCATGCAT
	      GCATGCATGCATGCATGCATGCATGCATGC
	      ATGCAT
	      +
	      FFFA@@FFFFFFFFFFHHB:::@BFFFFGG
	      HIHIIIIIIIIIIIIIIIIIIIIIIIFFFF
	      8011<<
	      @fastq2
	      ATGCATGCATGCAT
	      GCATGCATGCATGC
	      +
	      IIA94445EEII==
	      =>IIIIIIIIICCC

       Formatted with Unix-style line termination would	give this fai index

	      fastq1   66     8	  30   31    79
	      fastq2   28   156	  14   15   188

SEE ALSO
       samtools(1)

       https://en.wikipedia.org/wiki/FASTA_format

       https://en.wikipedia.org/wiki/FASTQ_format

	      Further description of the FASTA and FASTQ formats

htslib				   June	2018			      faidx(5)

NAME | SYNOPSIS | DESCRIPTION | EXAMPLE | SEE ALSO

Want to link to this manual page? Use this URL:
<https://www.freebsd.org/cgi/man.cgi?query=faidx&sektion=5&manpath=FreeBSD+12.2-RELEASE+and+Ports>

home | help