FreeBSD Manual Pages
cmalign(1) Infernal Manual cmalign(1) NAME cmalign - align sequences to a covariance model SYNOPSIS cmalign [options] _cmfile_ _seqfile_ DESCRIPTION cmalign aligns the RNA sequences in _seqfile_ to the covariance model (CM) in _cmfile_. The new alignment is output to stdout in Stockholm format, but can be redirected to a file _f_ with the -o _f_ option. Either _cmfile_ or _seqfile_ (but not both) may be '-' (dash), which means reading this input from stdin rather than a file. The sequence file _seqfile_ must be in FASTA or Genbank format. cmalign uses an HMM banding technique to accelerate alignment by de- fault as described below for the --hbanded option. HMM banding can be turned off with the --nonbanded option. By default, cmalign computes the alignment with maximum expected accu- racy that is consistent with constraints (bands) derived from an HMM, using a banded version of the Durbin/Holmes optimal accuracy algorithm. This behavior can be changed with the --cyk or --sample options. cmalign takes special care to correctly align truncated sequences, where some nucleotides from the beginning (5') and/or end (3') of the actual full length biological sequence are not present in the input se- quence (see DL Kolbe and SR Eddy, Bioinformatics, 25:1236-1243, 2009). This behavior is on by default, but can be turned off with --notrunc. In previous versions of cmalign the --sub option was required to appro- priately handle truncated sequences. The --sub option is still avail- able in this version, but the new default method for handling truncated sequences should be as good or superior to the sub method in nearly all cases. The --mapali _s_ option allows inclusion of the fixed training align- ment used to build the CM from file _s_ within the output alignment of cmalign. It is possible to merge two or more alignments created by the same CM using the Easel miniapp esl-alimerge (included in the easel/miniapps/ subdirectory of Infernal). Previous versions of cmalign included op- tions to merge alignments but they were deprecated upon development of esl-alimerge, which is significantly more memory efficient. By default, cmalign will output the alignment to stdout. The alignment can be redirected to an output file _f_ with the -o _f_ option. With -o, information on each aligned sequence, including score and model alignment boundaries will be printed to stdout (more on this below). The output alignment will be in Stockholm format by default. This can be changed to Pfam, aligned FASTA (AFA), A2M, Clustal, or Phylip format using the --outformat _s_ option, where _s_ is the name of the desired format. As a special case, if the output alignment is large (more than 10,000 sequences or more than 10,000,000 total nucleotides) than the output format will be Pfam format, with each sequence appearing on a single line, for reasons of memory efficiency. For alignments larger than this, using --ileaved will force interleaved Stockholm format, but the user should be aware that this may require a lot of memory. --ileaved will only work for alignments up to 100,000 sequences or 100,000,000 total nucleotides. If the output alignment format is Stockholm or Pfam, the output align- ment will be annotated with posterior probabilities which estimate the confidence level of each aligned nucleotide. This annotation appears as lines beginning with "#=GR <seq name> PP", one per sequence, each immediately below the corresponding aligned sequence "<seq name>". Characters in PP lines have 12 possible values: "0-9", "*", or ".". If ".", the position corresponds to a gap in the sequence. A value of "0" indicates a posterior probability of between 0.0 and 0.05, "1" indi- cates between 0.05 and 0.15, "2" indicates between 0.15 and 0.25 and so on up to "9" which indicates between 0.85 and 0.95. A value of "*" in- dicates a posterior probability of between 0.95 and 1.0. Higher poste- rior probabilities correspond to greater confidence that the aligned nucleotide belongs where it appears in the alignment. With --non- banded, the calculation of the posterior probabilities considers all possible alignments of the target sequence to the CM. Without --non- banded (i.e. in default mode), the calculation considers only possible alignments within the HMM bands. Further, the posterior probabilities are conditional on the truncation mode of the alignment. For example, if the sequence alignment is truncated 5', a PP value of "9" indicates between 0.85 and 0.95 of all 5' truncated alignments include the given nucleotide at the given position. The posterior annotation can be turned off with the --noprob option. If --small is enabled, posterior annotation must also be turned off using --noprob. The tabular output that is printed to stdout if the -o option is used includes one line per sequence and twelve fields per line: "idx": the index of the sequence in the input file, "seq name": the sequence name; "length": the length of the sequence; "cm from" and "cm to": the model start and end positions of the alignment; "trunc": "no" if the sequence is not truncated, "5'" if the beginning of the sequence truncated 5', "3'" if the end of the sequence is truncated, and "5'&3'" if both the beginning and the end are truncated; "bit sc": the bit score of the alignment, "avg pp" the average posterior probability of all aligned nucleotides in the alignment; "band calc", "alignment" and "total": the time in seconds required for calculating HMM bands, computing the alignment, and complete processing of the sequence, respectively; "mem (Mb)": the size in Mb of all dynamic programming matrices required for aligning the sequence. This tabular data can be saved to file _f_ with the --sfile _f_ option. OPTIONS -h Help; print a brief reminder of command line usage and available options. -o _f_ Save the alignment in Stockholm format to a file _f_. The de- fault is to write it to standard output. -g Configure the model for global alignment of the query model to the target sequences. By default, the model is configured for local alignment. Local alignments can contain large insertions and deletions called "local ends" in the structure to be penal- ized differently than normal indels. These are annotated as "~" columns in the RF line of the output alignment. The -g option can be used to disallow these local ends. The -g option is re- quired if the --sub option is also used. OPTIONS FOR CONTROLLING THE ALIGNMENT ALGORITHM --optacc Align sequences using the Durbin/Holmes optimal accuracy algo- rithm. This is the default. The optimal accuracy alignment will be constrained by HMM bands for acceleration unless the --non- banded option is enabled. The optimal accuracy algorithm deter- mines the alignment that maximizes the posterior probabilities of the aligned nucleotides within it. The posterior proba- bilites are determined using (possibly HMM banded) variants of the Inside and Outside algorithms. --cyk Do not use the Durbin/Holmes optimal accuracy alignment to align the sequences, instead use the CYK algorithm which determines the optimally scoring (maximum likelihood) alignment of the se- quence to the model, given the HMM bands (unless --nonbanded is also enabled). --sample Sample an alignment from the posterior distribution of align- ments. The posterior distribution is determined using an HMM banded (unless --nonbanded) variant of the Inside algorithm. --seed _n_ Seed the random number generator with _n_, an integer >= 0. This option can only be used in combination with --sample. If _n_ is nonzero, stochastic sampling of alignments will be repro- ducible; the same command will give the same results. If _n_ is 0, the random number generator is seeded arbitrarily, and sto- chastic samplings may vary from run to run of the same command. The default seed is 181. --notrunc Turn off truncated alignment algorithms. All sequences in the input file will be assumed to be full length, unless --sub is also used, in which case the program can still handle truncated sequences but will use an alternative strategy for their align- ment. --sub Turn on the sub model construction and alignment procedure. For each sequence, an HMM is first used to predict the model start and end consensus columns, and a new sub CM is constructed that only models consensus columns from start to end. The sequence is then aligned to this sub CM. Sub alignment is an older method than the default one for aligning sequences that are possibly truncated. By default, cmalign uses special DP algorithms to handle truncated sequences which should be more accurate than the sub method in most cases. --sub is still included as an op- tion mainly for testing against this default truncated sequence handling. This "sub CM" procedure is not the same as the "sub CMs" described by Weinberg and Ruzzo. OPTIONS FOR CONTROLLING SPEED AND MEMORY REQUIREMENTS --hbanded This option is turned on by default. Accelerate alignment by pruning away regions of the CM DP matrix that are deemed negli- gible by an HMM. First, each sequence is scored with a CM plan 9 HMM derived from the CM using the Forward and Backward HMM al- gorithms to calculate posterior probabilities that each nucleo- tide aligns to each state of the HMM. These posterior probabili- ties are used to derive constraints (bands) on the CM DP matrix. Finally, the target sequence is aligned to the CM using the banded DP matrix, during which cells outside the bands are ig- nored. Usually most of the full DP matrix lies outside the bands (often more than 95%), making this technique faster because fewer DP calculations are required, and more memory efficient because only cells within the bands need be allocated. Importantly, HMM banding sacrifices the guarantee of determining the optimally accurarte or optimal alignment, which will be missed if it lies outside the bands. The tau paramater is the amount of probability mass considered negligible during HMM band calculation; lower values of tau yield greater speedups but also a greater chance of missing the optimal alignment. The default tau is 1E-7, determined empirically as a good tradeoff between sensitivity and speed, though this value can be changed with the --tau <x> option. The level of acceleration increases with both the length and primary sequence conservation level of the fam- ily. For example, with the default tau of 1E-7, tRNA models (low primary sequence conservation with length of about 75 nucleo- tides) show about 10X acceleration, and SSU bacterial rRNA mod- els (high primary sequence conservation with length of about 1500 nucleotides) show about 700X. HMM banding can be turned off with the --nonbanded option. --tau _x_ Set the tail loss probability used during HMM band calculation to _x_. This is the amount of probability mass within the HMM posterior probabilities that is considered negligible. The de- fault value is 1E-7. In general, higher values will result in greater acceleration, but increase the chance of missing the op- timal alignment due to the HMM bands. --mxsize _x_ Set the maximum allowable total DP matrix size to _x_ megabytes. By default this size is 1028 Mb. This should be large enough for the vast majority of alignments, however if it is not cma- lign will attempt to iteratively tighten the HMM bands it uses to constrain the alignment by raising the tau parameter and re- calculating the bands until the total matrix size needed falls below _x_ megabytes or the maximum allowable tau value (0.05 by default, but changeable with --maxtau) is reached. At each iter- ation of band tightening, tau is multiplied by a 2.0. The band tightening strategy can be turned off with the --fixedtau op- tion. If the maximum tau is reached and the required matrix size still exceeds _x_ or if HMM banding is not being used and the required matrix size exceeds _x_ then cmalign will exit pre- maturely and report an error message that the matrix exceeded its maximum allowable size. In this case, the --mxsize can be used to raise the size limit or the maximum tau can be raised with --maxtau. The limit will commonly be exceeded when the --nonbanded option is used without the --small option, but can still occur when --nonbanded is not used. Note that if cmalign is being run in _n_ multiple threads on a multicore machine then each thread may have an allocated matrix of up to size _x_ Mb at any given time. --fixedtau Turn off the HMM band tightening strategy described in the ex- planation of the --mxsize option above. --maxtau _x_ Set the maximum allowed value for tau during band tightening, described in the explanation of --mxsize above, to _x_. By de- fault this value is 0.05. --nonbanded Turns off HMM banding. The returned alignment is guaranteed to be the globally optimally accurate one (by default) or the glob- ally optimally scoring one (if --cyk is enabled). The --small option is recommended in combination with this option, because standard alignment without HMM banding requires a lot of memory (see --small ). --small Use the divide and conquer CYK alignment algorithm described in SR Eddy, BMC Bioinformatics 3:18, 2002. The --nonbanded option must be used in combination with this options. Also, it is rec- ommended whenever --nonbanded is used that --small is also used because standard CM alignment without HMM banding requires a lot of memory, especially for large RNAs. --small allows CM align- ment within practical memory limits, reducing the memory re- quired for alignment LSU rRNA, the largest known RNAs, from 150 Gb to less than 300 Mb. This option can only be used in combi- nation with --nonbanded, --notrunc, and --cyk. OPTIONAL OUTPUT FILES --sfile _f_ Dump per-sequence alignment score and timig information to file _f_. The format of this file is described above (it's the same data in the same format as the tabular stdout output when the -o option is used). --tfile _f_ Dump tabular sequence tracebacks for each individual sequence to a file _f_. Primarily useful for debugging. --ifile _f_ Dump per-sequence insert information to file _f_. The format of the file is described by "#"-prefixed comment lines included at the top of the file _f_. The insert information is valid even when the --matchonly option is used. --elfile _f_ Dump per-sequence EL state (local end) insert information to file _f_. The format of the file is described by "#"-prefixed comment lines included at the top of the file _f_. The EL in- sert information is valid even when the --matchonly option is used. OTHER OPTIONS --mapali _f_ Reads the alignment from file _f_ used to build the model aligns it as a single object to the CM; e.g. the alignment in _f_ is held fixed. This allows you to align sequences to a model with cmalign and view them in the context of an existing trusted mul- tiple alignment. _f_ must be the alignment file that the CM was built from. The program verifies that the checksum of the file matches that of the file used to construct the CM. A similar op- tion to this one was called --withali in previous versions of cmalign. --mapstr Must be used in combination with --mapali _f_. Propogate struc- tural information for any pseudoknots that exist in _f_ to the output alignment. A similar option to this one was called --withstr in previous versions of cmalign. --informat _s_ Assert that the input _seqfile_ is in format _s_. Do not run Babelfish format autodection. This increases the reliability of the program somewhat, because the Babelfish can make mistakes; particularly recommended for unattended, high-throughput runs of Infernal. Acceptable formats are: FASTA, GENBANK, and DDBJ. _s_ is case-insensitive. --outformat _s_ Specify the output alignment format as _s_. Acceptable formats are: Pfam, AFA, A2M, Clustal, and Phylip. AFA is aligned fasta. Only Pfam and Stockholm alignment formats will include consensus structure annotation and posterior probability annotation of aligned residues. --dnaout Output the alignments as DNA sequence alignments, instead of RNA ones. --noprob Do not annotate the output alignment with posterior probabili- ties. --matchonly Only include match columns in the output alignment, do not in- clude any insertions relative to the consensus model. This op- tion may be useful when creating very large alignments that re- quire a lot of memory and disk space, most of which is necessary only to deal with insert columns that are gaps in most se- quences. --ileaved Output the alignment in interleaved Stockholm format of a fixed width that may be more convenient for examination. This was the default output alignment format of previous versions of cmalign. Note that cmalign requires more memory when this option is used. For this reason, --ileaved will only work for alignments of up to 100,000 sequences or a total of 100,000,000 aligned nucleo- tides. --regress _s_ Save an additional copy of the output alignment with no author information to file _s_. --verbose Output additional information in the tabular scores output (out- put to stdout if -o is used, or to _f_ if --sfile _f_ is used). These are mainly useful for testing and debugging. --cpu _n_ Specify that _n_ parallel CPU workers be used. If _n_ is set as "0", then the program will be run in serial mode, without using threads. You can also control this number by setting an envi- ronment variable, INFERNAL_NCPU. This option will only be available if the machine on which Infernal was built is capable of using POSIX threading (see the Installation section of the user guide for more information). --mpi Run as an MPI parallel program. This option will only be avail- able if Infernal has been configured and built with the "--en- able-mpi" flag (see the Installation section of the user guide for more information). SEE ALSO See infernal(1) for a master man page with a list of all the individual man pages for programs in the Infernal package. For complete documentation, see the user guide that came with your In- fernal distribution (Userguide.pdf); or see the Infernal web page (). COPYRIGHT Copyright (C) 2019 Howard Hughes Medical Institute. Freely distributed under the BSD open source license. For additional information on copyright and licensing, see the file called COPYRIGHT in your Infernal source distribution, or see the In- fernal web page (). AUTHOR The Eddy/Rivas Laboratory Janelia Farm Research Campus 19700 Helix Drive Ashburn VA 20147 USA http://eddylab.org Infernal 1.1.3 Nov 2019 cmalign(1)
NAME | SYNOPSIS | DESCRIPTION | OPTIONS | OPTIONS FOR CONTROLLING THE ALIGNMENT ALGORITHM | OPTIONS FOR CONTROLLING SPEED AND MEMORY REQUIREMENTS | OPTIONAL OUTPUT FILES | OTHER OPTIONS | SEE ALSO | COPYRIGHT | AUTHOR
Want to link to this manual page? Use this URL:
<https://www.freebsd.org/cgi/man.cgi?query=cmalign&sektion=1&manpath=FreeBSD+12.2-RELEASE+and+Ports>