# FreeBSD Manual Pages

cmcalibrate(1) Infernal Manual cmcalibrate(1)NAMEcmcalibrate - fit exponential tails for covariance model E-value deter- minationSYNOPSIScmcalibrate[options]cmfileDESCRIPTIONcmcalibratedetermines exponential tail parameters for E-value determi- nation by generating random sequences, searching them with the CM and collecting the scores of the resulting hits. A histogram of the bit scores of the hits is fit to an exponential tail, and the parameters of the fitted tail are saved to the CM file. The exponential tail parame- ters are then used to estimate the statistical significance of hits found incmsearchandcmscan.A CM file must be calibrated withcmcalibratebefore it can be used incmsearchorcmscan,with a single exception: it is not necessary to calibrate CM files that include only models with zero basepairs before runningcmsearch.cmcalibrateis very slow. It takes a couple of hours to calibrate a single average sized CM on a single CPU.cmcalibratewill run in par- allel on all available cores if Infernal was built on a system that supports POSIX threading (see the Installation section of the user guide for more information). Using <n> cores will result in roughly <n> -fold acceleration versus a single CPU. MPI (Message Passing Inter- face) can be also be used for parallelization with the--mpioption if Infernal was built with MPI enabled, but using more than 161 processors is not recommended because increasing past 161 won't accelerate the calibration. See the Installation seciton of the user guide for more information. The--forecastoption can be used to estimate how long the program will take to run for a givencmfileon the current machine. To predict the running time on _n_ processors with MPI, additionally use the--nfore-cast_n_ option. The random sequences searched incmcalibrateare generated by an HMM that was trained on real genomic sequences with various GC contents. The goal is to have the GC distributions in the random sequences be similar to those in actual genomic sequences. Four rounds of searches and subsequent exponential tail fits are per- formed, one each for the four different CM algorithms that can be used incmsearchandcmscan:glocal CYK, glocal Inside, local CYK and local Inside. The E-values parameters determined bycmcalibrateare only used by thecmsearchandcmscanprograms. If you are not going to use these pro- grams then do not waste time calibrating your models.OPTIONS-hHelp; print a brief reminder of command line usage and available options.-L_x_ Set the total length of random sequences to search to _x_ megabases (Mb). By default, _x_is1.6 Mb. Increasing _x_ will make the exponential tail fits more precise and E-values more accurate, but will take longer (doubling _x_ will roughly double the running time). Decreasing _x_ is not recommended as it will make the fits less precise and the E-values less accurate.OPTIONS FOR PREDICTING REQUIRED TIME AND MEMORY--forecastPredict the running time of the calibration ofcmfile(with pro- vided options) on the current machine and exit. The calibration is not performed. The predictions should be considered rough estimates. If multithreading is enabled (see Installation sec- tion of user guide), the timing will take into account the num- ber of available cores.--nforecast_n_ With--forecast,specify that _n_ processors will be used for the calibration. This might be useful for predicting the run- ning time of an MPI run with _n_ processors.--memreqPredict the amount of required memory for calibratingcmfile(with provided options) on the current machine and exit. The calibration is not performed.OPTIONS CONTROLLING EXPONENTIAL TAIL FITS--gtailn_x_ fit the exponential tail for glocal Inside and glocal CYK to the _n_ highest scores in the histogram tail, where _n_ is _x_ times the number of Mb searched. The default value of _x_ is 250. The value 250 was chosen because it works well empirically relative to other values.--ltailn_x_ fit the exponential tail for local Inside and local CYK to the _n_ highest scores in the histogram tail, where _n_ is _x_ times the number of Mb searched. The default value of _x_ is 750. The value 750 was chosen because it works well empirically relative to other values.--tailp_x_ Ignore the--gtailnand--ltailnprefixed options and fit the _x_ fraction tail of the histogram to an exponential tail, for all search modes.OPTIONAL OUTPUT FILES--hfile_f_ Save the histograms fit to file _f_.The format of this file is two space delimited columns per line. The first column is the x- axis values of bit scores of each bin. The second column is the y-axis values of number of hits per bin. Each series is delim- ited by a line with a single character "&". The file will con- tain one series for each of the four exponential tail fits in the following order: glocal CYK, glocal Inside, local CYK, and local Inside.--sfile_f_ Save survival plot information to file _f_.The format of this file is two space delimited columns per line. The first column is the x-axis values of bit scores of each bin. The second col- umn is the y-axis values of fraction of hits that meet or exceed the score for each bin. Each series is delimited by a line with a single character "&". The file will contain three series of data for each of the four CM search modes in the following or- der: glocal CYK, glocal Inside, local CYK, and local Inside. The first series is the empirical survival plot from the his- togram of hits to the random sequence. The second series is the exponential tail fit to the empirical distribution. The third series is the exponential tail fit if lambda were fixed and set as the natural log of 2 (0.691314718).--qqfile_f_ Save quantile-quantile plot information to file _f_.The format of this file is two space delimited columns per line. The first column is the x-axis values, and the second column is the y-axis values. The distance of the points from the identity line (y=x) is a measure of how good the exponential tail fit is, the closer the points are to the identity line, the better the fit is. Each series is delimited by a line with a single character "&". The file will contain one series of empirical data for each of the four exponential tail fits in the following order: glocal CYK, glocal Inside, local CYK and local Inside.--ffile_f_ Save space delimited statistics of different exponential tail fits to file _f_.The file will contain the lambda and mu val- ues for exponential tails fit to histogram tails of different sizes. The fields in the file are labelled informatively.--xfile_f_ Save a list of the scores in each fit histogram tail to file _f_.Each line of this file will have a different score indi- cating one hit existed in the tail with that score. Each series is delimited by a line with a single character "&". The file will contain one series for each of the four exponential tail fits in the following order: glocal CYK, glocal Inside, local CYK, and local Inside.OTHER OPTIONS--seed_n_ Seed the random number generator with _n_,an integer >= 0. If _n_ is nonzero, stochastic simulations will be reproducible; the same command will give the same results. If _n_ is 0, the ran- dom number generator is seeded arbitrarily, and stochastic simu- lations will vary from run to run of the same command. The de- fault seed is 181.--beta_x_ By default query-dependent banding (QDB) is used to accelerate the CM search algorithms with a beta tail loss probability of 1E-15. This beta value can be changed to _x_ with--beta_x_.The beta parameter is the amount of probability mass excluded during band calculation, higher values of beta give greater speedups but sacrifice more accuracy than lower values. The de- fault value used is 1E-15. (For more information on QDB see Nawrocki and Eddy, PLoS Computational Biology 3(3): e56.)--nonbandedTurn off QDB during E-value calibration. This will slow down calibration.--nonull3Turn off the null3 post hoc additional null model. This is not recommended unless you plan on using the same option tocmsearchand/orcmscan.--randomUse the background null model of the CM to generate the random sequences, instead of the more realistic HMM. Unless the CM was built using the--nulloption tocmbuild,the background null model will be 25% each A, C, G and U.--gc_f_ Generate the random sequences using the nucleotide distribution from the sequence file _f_.--cpu_n_ Specify that _n_ parallel CPU workers be used. If _n_ is set as "0", then the program will be run in serial mode, without using threads. You can also control this number by setting an envi- ronment variable,INFERNAL_NCPU.This option will only be available if the machine on which Infernal was built is capable of using POSIX threading (see the Installation section of the user guide for more information).--mpiRun as an MPI parallel program. This option will only be avail- able if Infernal has been configured and built with the "--en- able-mpi" flag (see the Installation section of the user guide for more information).SEE ALSOSeeinfernal(1)for a master man page with a list of all the individual man pages for programs in the Infernal package. For complete documentation, see the user guide that came with your In- fernal distribution (Userguide.pdf); or see the Infernal web page ().COPYRIGHTCopyright (C) 2019 Howard Hughes Medical Institute. Freely distributed under the BSD open source license. For additional information on copyright and licensing, see the file called COPYRIGHT in your Infernal source distribution, or see the In- fernal web page ().AUTHORThe Eddy/Rivas Laboratory Janelia Farm Research Campus 19700 Helix Drive Ashburn VA 20147 USA http://eddylab.org Infernal 1.1.3 Nov 2019 cmcalibrate(1)

NAME | SYNOPSIS | DESCRIPTION | OPTIONS | OPTIONS FOR PREDICTING REQUIRED TIME AND MEMORY | OPTIONS CONTROLLING EXPONENTIAL TAIL FITS | OPTIONAL OUTPUT FILES | OTHER OPTIONS | SEE ALSO | COPYRIGHT | AUTHOR

Want to link to this manual page? Use this URL:

<https://www.freebsd.org/cgi/man.cgi?query=cmcalibrate&sektion=1&manpath=FreeBSD+13.0-RELEASE+and+Ports>