Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
hmmsim(1)			 HMMER Manual			     hmmsim(1)

NAME
       hmmsim -	collect	profile	score distributions on random sequences

SYNOPSIS
       hmmsim [options]	hmmfile

DESCRIPTION
       The  hmmsim  program  generates	random sequences, scores them with the
       model(s)	in hmmfile, and	outputs	various	sorts  of  histograms,	plots,
       and fitted distributions	for the	resulting scores.

       hmmsim  is  not	a  mainstream part of the HMMER	package	and most users
       would have no reason to use it. It is used to develop and test the sta-
       tistical	methods	used to	determine P-values and E-values	in HMMER3. For
       example,	it was used to generate	most of	the results in a 2008 paper on
       H3's  local  alignment  statistics  (PLoS  Comp	Bio  4:e1000069, 2008;
       http://www.ploscompbiol.org/doi/pcbi.1000069).

       Because it is a research	testbed, you should not	expect it to be	as ro-
       bust  as	other programs in the package. For example, options may	inter-
       act in weird ways; we haven't tested nor	tried to anticipate  all  dif-
       ferent possible combinations.

       The  main  task	is  to fit a maximum likelihood	Gumbel distribution to
       Viterbi scores or an maximum likelihood exponential tail	to  high-scor-
       ing  Forward  scores,  and to test that these fitted distributions obey
       the conjecture that lambda ~ log_2 for both the Viterbi Gumbel and  the
       Forward exponential tail.

       The  output is a	table of numbers, one row for each model. Four differ-
       ent parametric fits to the score	data are tested: (1)  maximum  likeli-
       hood  fits to both location (mu/tau) and	slope (lambda) parameters; (2)
       assuming	lambda=log_2, maximum likelihood fit to	the location parameter
       only;  (3)  same	 but  assuming an edge-corrected lambda, using current
       procedures in H3	[Eddy, 2008]; and (4) using both parameters determined
       by  H3's	 current  procedures.  The  standard  simple,  quick and dirty
       statistic for goodness-of-fit is	'E@10',	the calculated E-value of  the
       10th ranked top hit, which we expect to be about	10.

       In detail, the columns of the output are:

       name   Name of the model.

       tailp  Fraction of the highest scores used to fit the distribution. For
	      Viterbi, MSV, and	Hybrid scores, this defaults to	1.0 (a	Gumbel
	      distribution  is	fitted	to  all	the data). For Forward scores,
	      this defaults to 0.02 (an	exponential  tail  is  fitted  to  the
	      highest 2% scores).

       mu/tau Location parameter for the maximum likelihood fit	to the data.

       lambda Slope parameter for the maximum likelihood fit to	the data.

       E@10   The  E-value  calculated for the 10th ranked high	score ('E@10')
	      using the	ML mu/tau and lambda. By definition, this expected  to
	      be about 10, if E-value estimation were accurate.

       mufix  Location	parameter,  for	 a maximum likelihood fit with a known
	      (fixed) slope parameter lambda of	log_2 (0.693).

       E@10fix
	      The E-value calculated for the 10th ranked score using mufix and
	      the expected lambda = log_2 = 0.693.

       mufix2 Location	parameter,  for	a maximum likelihood fit with an edge-
	      effect-corrected lambda.

       E@10fix2
	      The E-value calculated for the 10th ranked  score	 using	mufix2
	      and the edge-effect-corrected lambda.

       pmu    Location parameter as determined by H3's estimation procedures.

       plambda
	      Slope parameter as determined by H3's estimation procedures.

       pE@10  The  E-value  calculated	for  the  10th ranked score using pmu,
	      plambda.

       At the end of this table, one more line is printed, starting with # and
       summarizing the overall CPU time	used by	the simulations.

       Some  of	the optional output files are in xmgrace xy format. xmgrace is
       powerful	and freely available graph-plotting software.

OPTIONS
       -h     Help; print a brief reminder  of	command	 line  usage  and  all
	      available	options.

       -a     Collect  expected	 Viterbi alignment length statistics from each
	      simulated	sequence. This only works with Viterbi scores (the de-
	      fault;  see  --vit).   Two  additional fields are	printed	in the
	      output table for each model: the mean length of  Viterbi	align-
	      ments, and the standard deviation.

       -v     (Verbose). Print the scores too, one score per line.

       -L _n_ Set the length of	the randomly sampled (nonhomologous) sequences
	      to _n_.  The default is 100.

       -N _n_ Set the number of	randomly sampled sequences to  _n_.   The  de-
	      fault is 1000.

       --mpi  Run  under MPI control with master/worker	parallelization	(using
	      mpirun, for example, or equivalent). Only	available if  optional
	      MPI support was enabled at compile-time.

	      It is parallelized at the	level of sending one profile at	a time
	      to an MPI	worker process,	so parallelization only	helps  if  you
	      have  more than one profile in the hmmfile, and you want to have
	      at least as many profiles	as MPI worker processes.

OPTIONS	CONTROLLING OUTPUT
       -o _f_ Save the main output table to a file _f_ rather than sending  it
	      to stdout.

       --afile _f_
	      When  collecting	Viterbi	 alignment statistics (the -a option),
	      for each sampled sequence, output	two fields per line to a  file
	      _f_:  the	 length	 of the	optimal	alignment, and the Viterbi bit
	      score.  Requires that the	-a option is also used.

       --efile _f_
	      Output a rank vs.	E-value	plot in	XMGRACE	xy format to file _f_.
	      The  x-axis  is the rank of this sequence, from highest score to
	      lowest; the y-axis is the	E-value	calculated for this  sequence.
	      E-values	are calculated using H3's default procedures (i.e. the
	      pmu, plambda parameters in the output table). You	expect a rough
	      match  between rank and E-value if E-values are accurately esti-
	      mated.

       --ffile _f_
	      Output a "filter power" file to _f_: for each model, a line with
	      three  fields:  model  name,  number of sequences	passing	the P-
	      value threshold, and fraction of sequences passing  the  P-value
	      threshold.  See  --pthresh  for  setting	the P-value threshold,
	      which defaults to	0.02 (the default MSV filter threshold in H3).
	      The  P-values  are as determined by H3's default procedures (the
	      pmu,plambda parameters in	the output table).  If	all  is	 well,
	      you  expect  to  see filter power	equal to the predicted P-value
	      setting of the threshold.

       --pfile _f_
	      Output cumulative	survival plots (P(S>x))	to file	_f_ in XMGRACE
	      xy format. There are three plots:	(1) the	observed score distri-
	      bution; (2) the maximum likelihood fitted	 distribution;	(3)  a
	      maximum likelihood fit to	the location parameter (mu/tau)	while
		  assuming lambda=log_2.

       --xfile _f_
	      Output  the  bit	scores	as  a binary array of double-precision
	      floats (8	bytes per score) to file _f_.  Programs	 like  Easel's
	      esl-histplot  can	 read  such  binary files. This	is useful when
	      generating extremely large sample	sizes.

OPTIONS	CONTROLLING MODEL CONFIGURATION	(MODE)
       H3 only uses multihit local alignment ( --fs mode), and this  is	 where
       we  believe  the	 statistical  fits.   Unihit  local  alignment	scores
       (Smith/Waterman;	--sw mode)  also  obey	our  statistical  conjectures.
       Glocal  alignment  statistics (either multihit or unihit) are still not
       adequately understood nor adequately fitted.

       --fs   Collect multihit local alignment scores. This  is	 the  default.
	      "fs" comes from HMMER2's historical terminology for multihit lo-
	      cal alignment as 'fragment search	mode'.

       --sw   Collect unihit local alignment scores. The H3 J  state  is  dis-
	      abled.  "sw" comes from HMMER2's historical terminology for uni-
	      hit local	alignment as 'Smith/Waterman search mode'.

       --ls   Collect multihit glocal alignment	scores.	In glocal  (global/lo-
	      cal) alignment, the entire model must align, to a	subsequence of
	      the target. The H3 local entry/exit transition probabilities are
	      disabled.	 'ls'  comes  from HMMER2's historical terminology for
	      multihit local alignment as 'local search	mode'.

       --s    Collect unihit glocal alignment scores.  Both the	H3 J state and
	      local  entry/exit	 transition  probabilities  are	 disabled. 's'
	      comes from HMMER2's historical  terminology  for	unihit	glocal
	      alignment.

OPTIONS	CONTROLLING SCORING ALGORITHM
       --vit  Collect Viterbi maximum likelihood alignment scores. This	is the
	      default.

       --fwd  Collect Forward log-odds likelihood scores, summed  over	align-
	      ment ensemble.

       --hyb  Collect  'Hybrid'	 scores,  as described in papers by Yu and Hwa
	      (for instance, Bioinformatics 18:864, 2002). These involve  cal-
	      culating a Forward matrix	and taking the maximum cell value. The
	      number itself is statistically  somewhat	unmotivated,  but  the
	      distribution is expected be a well-behaved extreme value distri-
	      bution (Gumbel).

       --msv  Collect MSV (multiple ungapped segment  Viterbi)	scores,	 using
	      H3's main	acceleration heuristic.

       --fast For  any of the above options, use H3's optimized	production im-
	      plementation (using SIMD vectorization). The default is  to  use
	      the  "generic" implementation (slow and non-vectorized). The op-
	      timized implementations sacrifice	a small	 amount	 of  numerical
	      precision. This can introduce confounding	noise into statistical
	      simulations and fits, so when one	gets super-concerned about ex-
	      act  details,  it's  better  to be able to factor	that source of
	      noise out.

OPTIONS	CONTROLLING FITTED TAIL	MASSES FOR FORWARD
       In some experiments, it was useful to fit Forward scores	to a range  of
       different  tail	masses,	 rather	than just one. These options provide a
       mechanism for fitting an	evenly-spaced range of different tail  masses.
       For each	different tail mass, a line is generated in the	output.

       --tmin _x_
	      Set  the lower bound on the tail mass distribution. (The default
	      is 0.02 for the default single tail mass.)

       --tmax _x_
	      Set the upper bound on the tail mass distribution. (The  default
	      is 0.02 for the default single tail mass.)

       --tpoints _n_
	      Set  the	number	of tail	masses to sample, starting from	--tmin
	      and ending at --tmax.  (The default is 1,	for the	 default  0.02
	      single tail mass.)

       --tlinear
	      Sample  a	 range of tail masses with uniform linear spacing. The
	      default is to use	uniform	logarithmic spacing.

OPTIONS	CONTROLLING H3 PARAMETER ESTIMATION METHODS
       H3 uses three short random sequence simulations to estimating the loca-
       tion  parameters	 for  the expected score distributions for MSV scores,
       Viterbi scores, and Forward scores. These options allow	these  simula-
       tions to	be modified.

       --EmL _n_
	      Sets  the	sequence length	in simulation that estimates the loca-
	      tion parameter mu	for MSV	E-values. Default is 200.

       --EmN _n_
	      Sets the number of sequences in simulation  that	estimates  the
	      location parameter mu for	MSV E-values. Default is 200.

       --EvL _n_
	      Sets  the	sequence length	in simulation that estimates the loca-
	      tion parameter mu	for Viterbi E-values. Default is 200.

       --EvN _n_
	      Sets the number of sequences in simulation  that	estimates  the
	      location parameter mu for	Viterbi	E-values. Default is 200.

       --EfL _n_
	      Sets  the	sequence length	in simulation that estimates the loca-
	      tion parameter tau for Forward E-values. Default is 100.

       --EfN _n_
	      Sets the number of sequences in simulation  that	estimates  the
	      location parameter tau for Forward E-values. Default is 200.

       --Eft _x_
	      Sets  the	tail mass fraction to fit in the simulation that esti-
	      mates the	location parameter tau for Forward evalues. Default is
	      0.04.

DEBUGGING OPTIONS
       --stall
	      For  debugging the MPI master/worker version: pause after	start,
	      to enable	the developer to attach	debuggers to the running  mas-
	      ter  and worker(s) processes. Send SIGCONT signal	to release the
	      pause.  (Under gdb: (gdb)	signal SIGCONT)	(Only available	if op-
	      tional MPI support was enabled at	compile-time.)

       --seed _n_
	      Set  the	random	number	seed  to _n_.  The default is 0, which
	      makes the	random number generator	use an arbitrary seed, so that
	      different	 runs  of hmmsim will almost certainly generate	a dif-
	      ferent statistical sample.  For debugging, it is useful to force
	      reproducible results, by fixing a	random number seed.

EXPERIMENTAL OPTIONS
       These options were used in a small variety of different exploratory ex-
       periments.

       --bgflat
	      Set the background residue distribution to a  uniform  distribu-
	      tion,  both  for	purposes of the	null model used	in calculating
	      scores, and for generating the random sequences. The default  is
	      to use a standard	amino acid background frequency	distribution.

       --bgcomp
	      Set  the background residue distribution to the mean composition
	      of the profile. This was used in exploring some of  the  effects
	      of biased	composition.

       --x-no-lengthmodel
	      Turn the H3 target sequence length model off. Set	the self-tran-
	      sitions for N,C,J	and the	null model to  350/351	instead;  this
	      emulates	HMMER2.	  Not a	good idea in general. This was used to
	      demonstrate one of the main H2 vs. H3 differences.

       --nu _x_
	      Set the nu parameter for the MSV algorithm -- the	expected  num-
	      ber  of  ungapped	 local alignments per target sequence. The de-
	      fault is 2.0, corresponding to a E->J transition probability  of
	      0.5.  This  was  used to test whether varying nu has significant
	      effect on	result (it doesn't seem	to, within reason).  This  op-
	      tion  only works if --msv	is selected (it	only affects MSV), and
	      it will not work with --fast (because the	optimized  implementa-
	      tions are	hardwired to assume nu=2.0).

       --pthresh _x_
	      Set  the	filter	P-value	 threshold to use in generating	filter
	      power files with --ffile.	 The default is	0.02 (which  would  be
	      appropriate  for	testing	 MSV scores, since this	is the default
	      MSV filter threshold in H3's acceleration	pipeline.)  Other  ap-
	      propriate	 choices  (matching defaults in	the acceleration pipe-
	      line) would be 0.001 for Viterbi,	and 1e-5 for Forward.

SEE ALSO
       See hmmer(1) for	a master man page with a list of  all  the  individual
       man pages for programs in the HMMER package.

       For  complete documentation, see	the user guide that came with your HM-
       MER distribution	(Userguide.pdf); or see	the HMMER web page (http://hm-
       mer.org/).

COPYRIGHT
       Copyright (C) 2019 Howard Hughes	Medical	Institute.
       Freely distributed under	the BSD	open source license.

       For  additional	information  on	 copyright and licensing, see the file
       called COPYRIGHT	in your	HMMER source distribution, or  see  the	 HMMER
       web page	(http://hmmer.org/).

AUTHOR
       http://eddylab.org

HMMER 3.3			   Nov 2019			     hmmsim(1)

NAME | SYNOPSIS | DESCRIPTION | OPTIONS | OPTIONS CONTROLLING OUTPUT | OPTIONS CONTROLLING MODEL CONFIGURATION (MODE) | OPTIONS CONTROLLING SCORING ALGORITHM | OPTIONS CONTROLLING FITTED TAIL MASSES FOR FORWARD | OPTIONS CONTROLLING H3 PARAMETER ESTIMATION METHODS | DEBUGGING OPTIONS | EXPERIMENTAL OPTIONS | SEE ALSO | COPYRIGHT | AUTHOR

Want to link to this manual page? Use this URL:
<https://www.freebsd.org/cgi/man.cgi?query=hmmsim&sektion=1&manpath=FreeBSD+12.2-RELEASE+and+Ports>

home | help