Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
vcftools(man)			27 August 2014			 vcftools(man)

NAME
       vcftools	 v0.1.13 - Utilities for the variant call format (VCF) and bi-
       nary variant call format	(BCF)

SYNOPSIS
       vcftools	[ --vcf	FILE | --gzvcf FILE | --bcf FILE] [ --out OUTPUT  PRE-
       FIX ] [ FILTERING OPTIONS ]  [ OUTPUT OPTIONS ]

DESCRIPTION
       vcftools	 is  a suite of	functions for use on genetic variation data in
       the form	of VCF and BCF files. The tools	provided will be  used	mainly
       to  summarize data, run calculations on data, filter out	data, and con-
       vert data into other useful file	formats.

EXAMPLES
       Output allele frequency for all sites in	the input vcf file from	 chro-
       mosome 1
	 vcftools --gzvcf input_file.vcf.gz --freq --chr 1 --out chr1_analysis

       Output  a  new  vcf file	from the input vcf file	that removes any indel
       sites
	 vcftools --vcf	input_file.vcf --remove-indels --recode	--recode-INFO-
	 all --out SNPs_only

       Output file comparing the sites in two vcf files
	 vcftools   --gzvcf   input_file1.vcf.gz  --gzdiff  input_file2.vcf.gz
	 --diff-site --out in1_v_in2

       Output a	new vcf	file to	standard out without any  sites	 that  have  a
       filter tag, then	compress it with gzip
	 vcftools  --gzvcf  input_file.vcf.gz  --remove-filtered-all  --recode
	 --stdout | gzip -c > output_PASS_only.vcf.gz

       Output a	Hardy-Weinberg p-value for every site in  the  bcf  file  that
       does not	have any missing genotypes
	 vcftools  --bcf  input_file.bcf  --hardy --max-missing	1.0 --out out-
	 put_noMissing

       Output nucleotide diversity at a	list of	positions
	 zcat input_file.vcf.gz	 |  vcftools  --vcf  -	--site-pi  --positions
	 SNP_list.txt --out nucleotide_diversity

BASIC OPTIONS
       These options are used to specify the input and output files.

   INPUT FILE OPTIONS
	 --vcf _input_filename_
	   This	 option	defines	the VCF	file to	be processed. VCFtools expects
	   files in VCF	format v4.0, v4.1 or v4.2. The	latter	two  are  sup-
	   ported  with	 some  small  limitations. If the user provides	a dash
	   character '-' as a file name, the program expects a VCF file	to  be
	   piped in through standard in.

	 --gzvcf _input_filename_
	   This	 option	 can be	used in	place of the --vcf option to read com-
	   pressed (gzipped) VCF files directly.

	 --bcf _input_filename_
	   This	option can be used in place of the --vcf option	to  read  BCF2
	   files  directly.  You  do  not need to specify if this file is com-
	   pressed with	BGZF encoding. If the user provides a  dash  character
	   '-'	as a file name,	the program expects a BCF2 file	to be piped in
	   through standard in.

   OUTPUT FILE OPTIONS
	 --out _output_prefix_
	   This	option defines the output filename prefix for all files	gener-
	   ated	 by  vcftools. For example, if <prefix>	is set to output_file-
	   name, then all output files will be of the form output_filename.***
	   .  If this option is	omitted, all output files will have the	prefix
	   "out." in the current working directory.

	 --stdout
	 -c
	   These options direct	the vcftools output to standard	out so it  can
	   be  piped into another program or written directly to a filename of
	   choice. However, a select few output	functions cannot be written to
	   standard out.

	 --temp	_temporary_directory_
	   This	 option	 can  be  used	to  redirect  any temporary files that
	   vcftools creates into a specified directory.

SITE FILTERING OPTIONS
       These options are used to include or exclude  certain  sites  from  any
       analysis	being performed	by the program.

   POSITION FILTERING
	 --chr _chromosome_
	 --not-chr _chromosome_
	   Includes or excludes	sites with indentifiers	matching <chromosome>.
	   These options may be	used multiple times to include or exclude more
	   than	one chromosome.

	 --from-bp _integer_
	 --to-bp _integer_
	   These  options specify a lower bound	and upper bound	for a range of
	   sites to be processed. Sites	with positions less  than  or  greater
	   than	 these values will be excluded.	These options can only be used
	   in conjunction with a single	usage of --chr.	 Using	one  of	 these
	   does	not require use	of the other.

	 --positions _filename_
	 --exclude-positions _filename_
	   Include  or	exclude	a set of sites on the basis of a list of posi-
	   tions in a file. Each line of the input file	should contain a (tab-
	   separated) chromosome and position. The file	can have comment lines
	   that	start with a "#", they will be ignored.

	 --positions-overlap _filename_
	 --exclude-positions-overlap _filename_
	   Include or exclude a	set of sites on	the basis of the reference al-
	   lele	 overlapping  with a list of positions in a file. Each line of
	   the input file should contain a (tab-separated) chromosome and  po-
	   sition. The file can	have comment lines that	start with a "#", they
	   will	be ignored.

	 --bed _filename_
	 --exclude-bed _filename_
	   Include or exclude a	set of sites on	the basis of a BED file.  Only
	   the	first  three  columns (chrom, chromStart and chromEnd) are re-
	   quired. The BED file	is expected to have a header line. A site will
	   be  kept  or	 excluded  if any part of any allele (REF or ALT) at a
	   site	is within the range of one of the BED entries.

	 --thin	_integer_
	   Thin	sites so that no two sites are within the  specified  distance
	   from	one another.

	 --mask	_filename_
	 --invert-mask _filename_
	 --mask-min _integer_
	   These  options are used to specify a	FASTA-like mask	file to	filter
	   with. The mask file contains	a sequence of integer digits  (between
	   0  and  9) for each position	on a chromosome	that specify if	a site
	   at that position should be filtered or not.
	   An example mask file	would look like:
	     _1
	     0000011111222...
	     _2
	     2222211111000...
	   In this example, sites in the VCF file located within the  first  5
	   bases  of the start of chromosome 1 would be	kept, whereas sites at
	   position 6 onwards would be filtered	out. And sites after the  11th
	   position on chromosome 2 would be filtered out as well.
	   The	"--invert-mask"	 option	takes the same format mask file	as the
	   "--mask" option, however it inverts the mask	file before  filtering
	   with	it.
	   And	the  "--mask-min"  option specifies a threshold	mask value be-
	   tween 0 and 9 to filter positions by. The default threshold	is  0,
	   meaning only	sites with that	value or lower will be kept.

   SITE	ID FILTERING
	 --snp _string_
	   Include  SNP(s)  with matching ID (e.g. a dbSNP rsID). This command
	   can be used multiple	times in order to include more than one	SNP.

	 --snps	_filename_
	 --exclude _filename_
	   Include or exclude a	list of	SNPs given in a	file. The file	should
	   contain a list of SNP IDs (e.g. dbSNP rsIDs), with one ID per line.
	   No header line is expected.

   VARIANT TYPE	FILTERING
	 --keep-only-indels
	 --remove-indels
	   Include or exclude sites that contain an indel. For	these  options
	   "indel" means any variant that alters the length of the REF allele.

   FILTER FLAG FILTERING
	 --remove-filtered-all
	   Removes all sites with a FILTER flag	other than PASS.

	 --keep-filtered _string_
	 --remove-filtered _string_
	   Includes  or	excludes all sites marked with a specific FILTER flag.
	   These options may be	used more than once to specify multiple	FILTER
	   flags.

   INFO	FIELD FILTERING
	 --keep-INFO _string_
	 --remove-INFO _string_
	   Includes or excludes	all sites with a specific INFO flag. These op-
	   tions only filter on	the presence of	the flag and  not  its	value.
	   These  options  can be used multiple	times to specify multiple INFO
	   flags.

   ALLELE FILTERING
	 --maf _float_
	 --max-maf _float_
	   Include only	sites with a Minor Allele Frequency  greater  than  or
	   equal  to  the  "--maf" value and less than or equal	to the "--max-
	   maf"	value. One of these options may	be used	without	the other. Al-
	   lele	 frequency is defined as the number of times an	allele appears
	   over	all individuals	at that	site, divided by the total  number  of
	   non-missing alleles at that site.

	 --non-ref-af _float_
	 --max-non-ref-af _float_
	 --non-ref-ac _integer_
	 --max-non-ref-ac _integer_

	 --non-ref-af-any _float_
	 --max-non-ref-af-any _float_
	 --non-ref-ac-any _integer_
	 --max-non-ref-ac-any _integer_
	   Include  only sites with all	Non-Reference (ALT) Allele Frequencies
	   (af)	or Counts (ac) within the range	specified, and	including  the
	   specified  value.  The  default options require all alleles to meet
	   the specified criteria, whereas the options appended	with "any" re-
	   quire only one allele to meet the criteria. The Allele frequency is
	   defined as the number of times an allele appears over all individu-
	   als	at that	site, divided by the total number of non-missing alle-
	   les at that site.

	 --mac _integer_
	 --max-mac _integer_
	   Include only	sites with Minor Allele	Count greater than or equal to
	   the	"--mac"	value and less than or equal to	the "--max-mac"	value.
	   One of these	options	may be used without the	other. Allele count is
	   simply the number of	times that allele appears over all individuals
	   at that site.

	 --min-alleles _integer_
	 --max-alleles _integer_
	   Include only	sites with a number of alleles greater than  or	 equal
	   to  the "--min-alleles" value and less than or equal	to the "--max-
	   alleles" value. One of these	options	may be used without the	other.
	   For example,	to include only	bi-allelic sites, one could use:
	     vcftools --vcf file1.vcf --min-alleles 2 --max-alleles 2

   GENOTYPE VALUE FILTERING
	 --min-meanDP _float_
	 --max-meanDP _float_
	   Includes only sites with mean depth values (over all	included indi-
	   viduals) greater than or equal to the "--min-meanDP"	value and less
	   than	or equal to the	"--max-meanDP" value. One of these options may
	   be used without the other. These options require that the "DP" FOR-
	   MAT tag is included for each	site.

	 --hwe _float_
	   Assesses sites for Hardy-Weinberg Equilibrium using an exact	 test,
	   as  defined	by Wigginton, Cutler and Abecasis (2005). Sites	with a
	   p-value below the threshold defined by this option are taken	to  be
	   out of HWE, and therefore excluded.

	 --max-missing _float_
	   Exclude  sites  on the basis	of the proportion of missing data (de-
	   fined to be between 0 and 1,	where 0	allows	sites  that  are  com-
	   pletely missing and 1 indicates no missing data allowed).

	 --max-missing-count _integer_
	   Exclude  sites with more than this number of	missing	genotypes over
	   all individuals.

	 --phased
	   Excludes all	sites that contain unphased genotypes.

   MISCELLANEOUS FILTERING
	 --minQ	_float_
	   Includes only sites with Quality value above	this threshold.

INDIVIDUAL FILTERING OPTIONS
       These options are used to include or exclude certain  individuals  from
       any analysis being performed by the program.
	 --indv	_string_
	 --remove-indv _string_
	   Specify an individual to be kept or removed from the	analysis. This
	   option can be used multiple times to	specify	multiple  individuals.
	   If both options are specified, then the "--indv" option is executed
	   before the "--remove-indv option".

	 --keep	_filename_
	 --remove _filename_
	   Provide files containing a list of individuals to either include or
	   exclude  in	subsequent analysis. Each individual ID	(as defined in
	   the VCF headerline) should be included on a separate	line. If  both
	   options  are	 used, then the	"--keep" option	is executed before the
	   "--remove" option. When multiple files are provided,	the  union  of
	   individuals from all	keep files subtracted by the union of individ-
	   uals	from all remove	files are kept.	No header line is expected.

	 --max-indv _integer_
	   Randomly thins individuals so that only the	specified  number  are
	   retained.

GENOTYPE FILTERING OPTIONS
       These  options  are  used  to exclude genotypes from any	analysis being
       performed by the	program. If excluded, these values will	be treated  as
       missing.
	 --remove-filtered-geno-all
	   Excludes all	genotypes with a FILTER	flag not equal to "." (a miss-
	   ing value) or PASS.

	 --remove-filtered-geno	_string_
	   Excludes genotypes with a specific FILTER flag.

	 --minGQ _float_
	   Exclude all genotypes with a	quality	below the threshold specified.
	   This	 option	requires that the "GQ" FORMAT tag is specified for all
	   sites.

	 --minDP _float_
	 --maxDP _float_
	   Includes only genotypes greater than	 or  equal  to	the  "--minDP"
	   value  and  less  than or equal to the "--maxDP" value. This	option
	   requires that the "DP" FORMAT tag is	specified for all sites.

OUTPUT OPTIONS
       These options specify which analyses or conversions to perform  on  the
       data that passed	through	all specified filters.

   OUTPUT ALLELE STATISTICS
	 --freq
	 --freq2
	   Outputs  the	allele frequency for each site in a file with the suf-
	   fix ".frq". The second option is used to suppress output of any in-
	   formation about the alleles.

	 --counts
	 --counts2
	   Outputs the raw allele counts for each site in a file with the suf-
	   fix ".frq.count". The second	option is used to suppress  output  of
	   any information about the alleles.

	 --derived
	   For	use  with  the previous	four frequency and count options only.
	   Re-orders the output	file columns so	that the ancestral allele  ap-
	   pears first.	This option relies on the ancestral allele being spec-
	   ified in the	VCF file using the AA tag in the INFO field.

   OUTPUT DEPTH	STATISTICS
	 --depth
	   Generates a file containing the mean	 depth	per  individual.  This
	   file	has the	suffix ".idepth".

	 --site-depth
	   Generates  a	 file  containing the depth per	site summed across all
	   individuals.	This output file has the suffix	".ldepth".

	 --site-mean-depth
	   Generates a file containing the mean	depth per site averaged	across
	   all individuals. This output	file has the suffix ".ldepth.mean".

	 --geno-depth
	   Generates  a	 (possibly  very  large) file containing the depth for
	   each	genotype in the	VCF file. Missing entries are given the	 value
	   -1. The file	has the	suffix ".gdepth".

   OUTPUT LD STATISTICS
	 --hap-r2
	   Outputs  a file reporting the r2, D,	and D' statistics using	phased
	   haplotypes. These are the traditional measures of LD	often reported
	   in the population genetics literature. The output file has the suf-
	   fix ".hap.ld". This option assumes that  the	 VCF  input  file  has
	   phased haplotypes.

	 --geno-r2
	   Calculates  the  squared  correlation coefficient between genotypes
	   encoded as 0, 1 and 2 to represent the number of non-reference  al-
	   leles  in  each  individual.	This is	the same as the	LD measure re-
	   ported by PLINK. The	D and D' statistics  are  only	available  for
	   phased genotypes. The output	file has the suffix ".geno.ld".

	 --geno-chisq
	   If  your  data contains sites with more than	two alleles, then this
	   option can be used to test for genotype independence	via  the  chi-
	   squared statistic. The output file has the suffix ".geno.chisq".

	 --hap-r2-positions _positions list file_
	 --geno-r2-positions _positions	list file_
	   Outputs  a  file reporting the r2 statistics	of the sites contained
	   in the provided file	verses all other sites.	The output files  have
	   the	suffix	".list.hap.ld"	or ".list.geno.ld", depending on which
	   option is used.

	 --ld-window _integer_
	   This	optional parameter defines the maximum number of SNPs  between
	   the	SNPs  being  tested for	LD in the "--hap-r2", "--geno-r2", and
	   "--geno-chisq" functions.

	 --ld-window-bp	_integer_
	   This	optional parameter defines  the	 maximum  number  of  physical
	   bases  between  the	SNPs  being  tested  for LD in the "--hap-r2",
	   "--geno-r2",	and "--geno-chisq" functions.

	 --ld-window-min _integer_
	   This	optional parameter defines the minimum number of SNPs  between
	   the	SNPs  being  tested for	LD in the "--hap-r2", "--geno-r2", and
	   "--geno-chisq" functions.

	 --ld-window-bp-min _integer_
	   This	optional parameter defines  the	 minimum  number  of  physical
	   bases  between  the	SNPs  being  tested  for LD in the "--hap-r2",
	   "--geno-r2",	and "--geno-chisq" functions.

	 --min-r2 _float_
	   This	optional parameter sets	a minimum value	for  r2,  below	 which
	   the	LD  statistic  is not reported by the "--hap-r2", "--geno-r2",
	   and "--geno-chisq" functions.

	 --interchrom-hap-r2
	 --interchrom-geno-r2
	   Outputs a file reporting the	r2 statistics for sites	 on  different
	   chromosomes.	 The output files have the suffix ".interchrom.hap.ld"
	   or ".interchrom.geno.ld", depending on the option used.

   OUTPUT TRANSITION/TRANSVERSION STATISTICS
	 --TsTv	_integer_
	   Calculates the Transition / Transversion ratio in bins of size  de-
	   fined by this option. Only uses bi-allelic SNPs. The	resulting out-
	   put file has	the suffix ".TsTv".

	 --TsTv-summary
	   Calculates a	simple summary of all Transitions  and	Transversions.
	   The output file has the suffix ".TsTv.summary".

	 --TsTv-by-count
	   Calculates the Transition / Transversion ratio as a function	of al-
	   ternative allele count. Only	uses bi-allelic	 SNPs.	The  resulting
	   output file has the suffix ".TsTv.count".

	 --TsTv-by-qual
	   Calculates the Transition / Transversion ratio as a function	of SNP
	   quality threshold. Only uses	bi-allelic SNPs. The resulting	output
	   file	has the	suffix ".TsTv.qual".

	 --FILTER-summary
	   Generates  a	summary	of the number of SNPs and Ts/Tv	ratio for each
	   FILTER category. The	output file has	the suffix ".FILTER.summary".

   OUTPUT NUCLEOTIDE DIVERGENCE	STATISTICS
	 --site-pi
	   Measures nucleotide divergency on a per-site	basis. The output file
	   has the suffix ".sites.pi".

	 --window-pi _integer_
	 --window-pi-step _integer_
	   Measures  the nucleotide diversity in windows, with the number pro-
	   vided as the	window size. The output	file  has  the	suffix	".win-
	   dowed.pi".  The  latter is an optional argument used	to specify the
	   step	size in	between	windows.

   OUTPUT FST STATISTICS
	 --weir-fst-pop	_filename_
	   This	option is used to calculate an	Fst  estimate  from  Weir  and
	   Cockerham's	1984  paper. This is the preferred calculation of Fst.
	   The provided	file must contain a list of individuals	(one  individ-
	   ual	per line) from the VCF file that correspond to one population.
	   This	option can be used multiple times to calculate	Fst  for  more
	   than	two populations. These files will also be included as "--keep"
	   options. By default,	calculations are done on a per-site basis. The
	   output file has the suffix ".weir.fst".

	 --fst-window-size _integer_
	 --fst-window-step _integer_
	   These  options can be used with "--weir-fst-pop" to do the Fst cal-
	   culations on	a windowed basis instead of a  per-site	 basis.	 These
	   arguments specify the desired window	size and the desired step size
	   between windows.

   OUTPUT OTHER	STATISTICS
	 --het
	   Calculates a	measure	of heterozygosity on a	per-individual	basis.
	   Specfically,	 the  inbreeding coefficient, F, is estimated for each
	   individual using a method of	moments. The resulting	file  has  the
	   suffix ".het".

	 --hardy
	   Reports  a  p-value for each	site from a Hardy-Weinberg Equilibrium
	   test	(as defined by Wigginton, Cutler and Abecasis (2005)). The re-
	   sulting  file  (with	suffix ".hwe") also contains the Observed num-
	   bers	of Homozygotes and Heterozygotes  and  the  corresponding  Ex-
	   pected numbers under	HWE.

	 --TajimaD _integer_
	   Outputs  Tajima's  D	 statistic  in bins with size of the specified
	   number. The output file has the suffix ".Tajima.D".

	 --indv-freq-burden
	   This	option calculates the number of	variants within	each  individ-
	   ual	of  a  specific	 frequency.  The resulting file	has the	suffix
	   ".ifreqburden".

	 --LROH
	   This	option will identify and output	Long Runs of Homozygosity. The
	   output  file	has the	suffix ".LROH".	This function is experimental,
	   and will use	a lot of memory	if applied to large datasets.

	 --relatedness
	   This	option is used to calculate and	output a relatedness statistic
	   based   on	the  method  of	 Yang  et  al,	Nature	Genetics  2010
	   (doi:10.1038/ng.608). Specifically, calculate  the  unadjusted  Ajk
	   statistic. Expectation of Ajk is zero for individuals within	a pop-
	   ulations, and one for an individual	with  themselves.  The	output
	   file	has the	suffix ".relatedness".

	 --relatedness2
	   This	option is used to calculate and	output a relatedness statistic
	   based on the	method of  Manichaikul	et  al.,  BIOINFORMATICS  2010
	   (doi:10.1093/bioinformatics/btq559).	The output file	has the	suffix
	   ".relatedness2".

	 --site-quality
	   Generates a file containing the per-site SNP	quality, as  found  in
	   the QUAL column of the VCF file. This file has the suffix ".lqual".

	 --missing-indv
	   Generates  a	file reporting the missingness on a per-individual ba-
	   sis.	The file has the suffix	".imiss".

	 --missing-site
	   Generates a file reporting the missingness on a per-site basis. The
	   file	has the	suffix ".lmiss".

	 --SNPdensity _integer_
	   Calculates  the  number and density of SNPs in bins of size defined
	   by this option. The resulting output	file has the suffix ".snpden".

	 --kept-sites
	   Creates a file listing all sites that have been kept	after  filter-
	   ing.	The file has the suffix	".kept.sites".

	 --removed-sites
	   Creates  a file listing all sites that have been removed after fil-
	   tering. The file has	the suffix ".removed.sites".

	 --singletons
	   This	option will generate a file detailing the location of  single-
	   tons,  and the individual they occur	in. The	file reports both true
	   singletons, and private doubletons (i.e. SNPs where the  minor  al-
	   lele	 only occurs in	a single individual and	that individual	is ho-
	   mozygotic for that allele). The output file has the	suffix	".sin-
	   gletons".

	 --hist-indel-len
	   This	option will generate a histogram file of the length of all in-
	   dels	(including SNPs). It shows both	the count and  the  percentage
	   of all indels for indel lengths that	occur at least once in the in-
	   put file. SNPs are considered indels	with length zero.  The	output
	   file	has the	suffix ".indel.hist".

	 --hapcount _BED file_
	   This	option will output the number of unique	haplotypes within user
	   specified bins, as defined by the BED file. The output file has the
	   suffix ".hapcount".

	 --mendel _PED file_
	   This	option is use to report	mendel errors identified in trios. The
	   command requires a PLINK-style PED file, with the first  four  col-
	   umns	 specifying  a family ID, the child ID,	the father ID, and the
	   mother ID. The output of this command has the suffix	".mendel".

	 --extract-FORMAT-info _string_
	   Extract information from the	genotype fields	in the VCF file	relat-
	   ing	to a specfied FORMAT identifier. The resulting output file has
	   the suffix ".<FORMAT_ID>.FORMAT". For example, the  following  com-
	   mand	would extract the all of the GT	(i.e. Genotype)	entries:
	     vcftools --vcf file1.vcf --extract-FORMAT-info GT

	 --get-INFO _string_
	   This	 option	 is used to extract information	from the INFO field in
	   the VCF file. The <string> argument specifies the INFO  tag	to  be
	   extracted,  and  the	 option	can be used multiple times in order to
	   extract multiple INFO entries.  The	resulting  file,  with	suffix
	   ".INFO",  contains the required INFO	information in a tab-separated
	   table. For example, to extract the NS and DB	flags, one  would  use
	   the command:
	     vcftools --vcf file1.vcf --get-INFO NS --get-INFO DB

   OUTPUT VCF FORMAT
	 --recode
	 --recode-bcf
	   These  options are used to generate a new file in either VCF	or BCF
	   from	the input VCF or BCF file after	applying the filtering options
	   specified by	the user. The output file has the suffix ".recode.vcf"
	   or ".recode.bcf". By	default, the INFO fields are removed from  the
	   output  file, as the	INFO values may	be invalidated by the recoding
	   (e.g. the total depth may need to be	 recalculated  if  individuals
	   are	removed).  This	behavior may be	overriden by the following op-
	   tions. By default, BCF files	are written  out  as  BGZF  compressed
	   files.

	 --recode-INFO _string_
	 --recode-INFO-all
	   These  options  can be used with the	above recode options to	define
	   an INFO key name to keep in the output file.	 This  option  can  be
	   used	multiple times to keep more of the INFO	fields.	The second op-
	   tion	is used	to keep	all INFO values	in the original	file.

	 --contigs _string_
	   This	option can be used in conjuction with  the  --recode-bcf  when
	   the	input  file does not have any contig declarations. This	option
	   expects a file name with one	contig header per  line.  These	 lines
	   are included	in the output file.

   OUTPUT OTHER	FORMATS
	 --012
	   This	 option	 outputs  the genotypes	as a large matrix. Three files
	   are produced. The first, with suffix	".012",	contains the genotypes
	   of each individual on a separate line. Genotypes are	represented as
	   0, 1	and 2, where the number	represent that number of non-reference
	   alleles.  Missing genotypes are represented by -1. The second file,
	   with	suffix ".012.indv" details the	individuals  included  in  the
	   main	 file. The third file, with suffix ".012.pos" details the site
	   locations included in the main file.

	 --IMPUTE
	   This	option outputs phased  haplotypes  in  IMPUTE  reference-panel
	   format.  As IMPUTE requires phased data, using this option also im-
	   plies --phased. Unphased individuals	and  genotypes	are  therefore
	   excluded.  Only  bi-allelic sites are included in the output. Using
	   this	option generates three files. The IMPUTE  haplotype  file  has
	   the suffix ".impute.hap", and the IMPUTE legend file	has the	suffix
	   ".impute.hap.legend".   The	 third	 file,	 with	suffix	 ".im-
	   pute.hap.indv",  details  the individuals included in the haplotype
	   file, although this file is not needed by IMPUTE.

	 --ldhat
	 --ldhat-geno
	   These options output	data in	LDhat format. This option requires the
	   "--chr"  filter  option  to	also be	used. The first	option outputs
	   phased data only, and therefore also	implies	 "--phased"  be	 used,
	   leading  to	unphased individuals and genotypes being excluded. The
	   second option treats	all of the data	 as  unphased,	and  therefore
	   outputs  LDhat  files in genotype/unphased format. Two output files
	   are generated with the suffixes ".ldhat.sites"  and	".ldhat.locs",
	   which  correspond  to  the LDhat "sites" and	"locs" input files re-
	   spectively.

	 --BEAGLE-GL
	 --BEAGLE-PL
	   These options output	genotype likelihood information	for input into
	   the	BEAGLE	program.  The  VCF  file is required to	contain	FORMAT
	   fields with "GL" or "PL" tags, which	can generally be output	by SNP
	   callers  such as the	GATK. Use of this option requires a chromosome
	   to be specified via the "--chr" option. The resulting  output  file
	   has	the  suffix ".BEAGLE.GL" or ".BEAGLE.PL" and contains genotype
	   likelihoods for biallelic sites. This file is  suitable  for	 input
	   into	BEAGLE via the "like=" argument.

	 --plink
	 --plink-tped
	 --chrom-map
	   These  options  output  the genotype	data in	PLINK PED format. With
	   the first option, two files are generated, with suffixes ".ped" and
	   ".map".  Note that only bi-allelic loci will	be output. Further de-
	   tails of these files	can be found in	the PLINK documentation.
	   Note: The first option can be very slow on  large  datasets.	 Using
	   the	--chr  option to divide	up the dataset is advised, or alterna-
	   tively use the --plink-tped option which outputs the	files  in  the
	   PLINK transposed format with	suffixes ".tped" and ".tfam".
	   For	usage  with  variant  sites  in	species	other than humans, the
	   --chrom-map option may be used to specify a file name  that	has  a
	   tab-delimited mapping of chromosome name to a desired integer value
	   with	one line per chromosome. This file must	contain	a mapping  for
	   every chromosome value found	in the file.

COMPARISON OPTIONS
       These  options are used to compare the original variant file to another
       variant file and	output the results. All	of the diff functions  require
       both files to contain the same chromosomes and that the files be	sorted
       in the same order. If one of the	files contains	chromosomes  that  the
       other  file  does not, use the --not-chr	filter to remove them from the
       analysis.

   DIFF	VCF FILE
	 --diff	_filename_
	 --gzdiff _filename_
	 --diff-bcf _filename_
	   These options compare the original input  file  to  this  specified
	   VCF,	gzipped	VCF, or	BCF file. These	options	must be	specified with
	   one additional option described below in order to specify what type
	   of comparison is to be performed. See the examples section for typ-
	   ical	usage.

   DIFF	OPTIONS
	 --diff-site
	   Outputs the sites that are common / unique to each file. The	output
	   file	has the	suffix ".diff.sites_in_files".

	 --diff-indv
	   Outputs  the	individuals that are common / unique to	each file. The
	   output file has the suffix ".diff.indv_in_files".

	 --diff-site-discordance
	   This	option calculates discordance on a site	by site	basis. The re-
	   sulting output file has the suffix ".diff.sites".

	 --diff-indv-discordance
	   This	 option	 calculates discordance	on a per-individual basis. The
	   resulting output file has the suffix	".diff.indv".

	 --diff-indv-map _filename_
	   This	option allows the user to specify a mapping of individual  IDs
	   in  the second file to those	in the first file. The program expects
	   the file to contain a tab-delimited line containing an individual's
	   name	 in  file  one followed	by that	same individual's name in file
	   two with one	mapping	per line.

	 --diff-discordance-matrix
	   This	option calculates a discordance	matrix.	This option only works
	   with	bi-allelic loci	with matching alleles that are present in both
	   files. The resulting	output	file  has  the	suffix	".diff.discor-
	   dance.matrix".

	 --diff-switch-error
	   This	 option	 calculates  phasing  errors (specifically "switch er-
	   rors"). This	option creates an output file describing switch	errors
	   found between sites,	with suffix ".diff.switch".

AUTHORS
       Adam Auton (adam.auton@einstein.yu.edu)
       Anthony Marcketta (anthony.marcketta@einstein.yu.edu)

1				     page			 vcftools(man)

NAME | SYNOPSIS | DESCRIPTION | EXAMPLES | BASIC OPTIONS | SITE FILTERING OPTIONS | INDIVIDUAL FILTERING OPTIONS | GENOTYPE FILTERING OPTIONS | OUTPUT OPTIONS | COMPARISON OPTIONS | AUTHORS

Want to link to this manual page? Use this URL:
<https://www.freebsd.org/cgi/man.cgi?query=vcftools&sektion=1&manpath=FreeBSD+12.0-RELEASE+and+Ports>

home | help