Bugs? Ideas? Need for improvement? soeding@genzentrum.lmu.de _______________________________________________________________________________ HH-suite 2.0.16 June 2012 * cstranslate: Understand now -i stdin and -o stdout. * hhblitsdb.pl: Some help list updates and stdin and stdout fixes. Added options for adding (-a) and removing (-r) files from the database. * hhblits: Fixed -i stdin, -oa3m stdout and -v 0. * hhsearch/hhalign: Fixed small bug that removed all sequences from output alignment when trying to use append HH-suite 2.0.15 June 2012 * Fix hhalign regression introduced in 2.0.14. * Fix a bug in the counting of the number of sequences found. * Code cleanup * Removed a regression in -nodiff option: sequences in query MSA were filtered by normal, active filters such as -diff and -id. Only db MSAs were not filtered. * Removed a regression that read sequences from query HHM or .a3m even when a sequence file query.seq was given. * Simplified code in hhblits. * Renamed option -nodiff to -all. The name -nodiff is very misleading as the diff option is not actually switched off on the query and db alignments. * Improved the filtering section of the help lists * Fixed problem for small -e thesholds by declaring par.e as double instead of float. _______________________________________________________________________________ HH-suite 2.0.14 May 2012 * All error messages are now written to stderr (not stdout) and are of the form "Error in : " * Improved error handling in hhblitsdb.pl, addss.pl, multithread.pl * Added more flexible format recognition in cstranslate binary which is called in hhblistdb.pl * Added -nodiff and -diff to standard help * In hhblitsdb.pl we force the MSA files to have extension a3m now. Same for hhm (HHsuite model) and hmm (HMMER model) files. _______________________________________________________________________________ HH-suite 2.0.13 February 2012 * Added script splitfasta.pl to split multiple-sequence FASTA file in multiple single-sequence FASTA files. These are needed to run hhblits in parallel using multithread.pl. * Implemented -maxmem option specifying maximum memory in GB. This changes the maximum length, up to which database sequences will be realigned using the memory-hungry maximum accuracy algorithm. This change will lead to more accurate results for longer HMMs (>5000 match states). * more sensible memory size defaults in init_prefilter(), fixes a segfault * Improved error messages if databases are too big (e.g. >2GB on 32bit systems). * Made code segfault-safe when name lengths and other strings exceed usual sizes. Now, all strings should be cut at maximum length. E.g., name lengths of sequences and HMMs at >=512 characters (const int NAMELEN in util.C) * Merged some Makefile changes for Debian from Laszlo Kajan * Corrected calculation of "Identities" for pairwise query-template alignments. Now, gaps '-' in aligned match states of query and template master sequences are not counted as identical residue anymore. (hhfullalignment.C, line 182) * Added detailed description in user guide of how Identities, Similarity, and Sum_probs are calculated * Fixed bug in hhblitsdb.pl (printf on line 272) introduced in 2.0.12 _______________________________________________________________________________ HH-suite 2.0.12 February 2012 * Reworked SSE flags: Support compilation without SSE3 (make NO_SSE3=1), without SSE2 (make NO_SSE2=1) with SSE4.1 (make WITH_SSE41=1) and enable all other instructions avalable on compilation host (make WITH_NATIVE=1). * hhblitsdb.pl: The requirement for identical file names and names of master sequences is dropped. Violation of this requirement manifested itself as file-not-found error during Viterbi-stage of hhblits. The code was streamlined. * Added progress dots ... for prefiltering stage * Cleaned up output for -v 1 and -v 0 options in hhblits and hhsearch: -v 1 should print out only warnings and errors * Fixed segfault occuring with very long sequence names * Minor improvements of documentation on installation _______________________________________________________________________________ HH-suite 2.0.11 February 2012 * Corrected wrong paths for calls of ffindex binaries in hhblitsdb.pl * Elminated rare segfault in realignment step that occurred in particular with repeat proteins as queries * Minor improvements of documentation on installatin, HHblits databases, FAQs etc. _______________________________________________________________________________ HH-suite 2.0.2 January 2012 * The iterative HMM-HMM search method HHblits has been added and the entire package is now called HH-suite. HHblits brings the power of HMM-HMM comparison to mainstream, general-purpose sequence searching and sequence analysis. * The speed of HHsearch was further increased through the use of SSE3 instructions. * A local amino acid compositional bias correction was introduced. Improvements are slight (<=1%) on a standard SCOP single domain benchmark. However, the improvement for more realistic sequences containing multiple domains, repeats, and regions of strong compositional bias will is probably more pronounced. The score shift parameter was changed from -0.01 to -0.03 bits and the mact parameter from 0.5 to 0.35. * The memory requirement for HMM-HMM aligment in hhsearch, hhblits, and hhalign was reduced by a factor of ~2 by a better implementation of the Forward- Backward algorithm * Removed hhalign's -sto option for generating stochastic alignments. Sorry! A better, deterministic replacement is planned for the future. _______________________________________________________________________________ HHsearch 1.6.1 April 2011 * Increased speed by using SSE3 instructions in some functions * Adding new option "-atab" for writing alignment as a table (with posteriors) to file * HHsearch is now able to read HMMER3 profiles (but should not be used due to a loss of sensitivity) * Removed a bug related with "neutralize HIS-tag" option * Removed a bug related with empty alignments (no match states) _______________________________________________________________________________ HHsearch 1.6.0 November 2010 * A new procedure for estimation of P- and E-values has been implemented that circumvents the need to calibrate HMMs. Calibration can still be done if desired. By default, however, HHsearch now estimates the lamda and mu parameters of the extreme value distribution (EVD) for each pair of query and database HMMs from the lengths of both HMMs and the diversities of their underlying alignments. Apart from saving the time for calibration, this procedure is more reliable and noise-resistant. This change only applies to the default local search mode. For global searches, nothing has changed. Note that E-values in global search mode are unreliable and that sensitivity is reduced. Old calibrations can still be used: -calm 0 : use empirical query HMM calibration (old default) -calm 1 : use empirical db HMM calibration -calm 2 : use both query and db HMM calibration -calm 3 : use neural network calibration (new default) * Previous versions of HHsearch sometimes showed non-homologous hits with high probabilities by matching long stretches of secondary structure states, in particular long helices, in the absence of any similarity in the amino acid profiles. Capping the SS score by a linear function of the profile score now effectively suppresses these spurious high-scoring false positives. * The output format for the query-template alignments has slightly changed. A 'Confidence' line at the bottom of each alignment block now reports the posterior probabilities for each alignment column when the -realign option is active (which it is by default). These probabilities are calculated in the Forward-Backward algorithm that is needed as input for the Maximium ACcuracy alignment algorithm. Also, the lines 'ss_conf' with the confidence values for the secondary structure prediction are omitted by default. (They can be displayed with option '-showssconf'). To compensate, secondary structure predictions with confidence values between 7 and 9 are given in capital letters, while for the predictions with values between 0 and 6 lower-case letters are used. * In the hhsearch output file in the header lines before each query-database alignment, the substitution matrix score (without gap penalties) of the query with the database sequence is now reported in bits per column. Also, the sum of probabilities for each pair of aligned residues from the MAC algorithm is reported here (0 if no MAC alignment is performed). * The buildali.pl script now uses context-specific iterative BLAST (CSI-BLAST) instead of PSI-BLAST. This considerably increases the sensitivity of buildali.pl/HHsearch. * Removed a bug which produced a segfault for input alignments with more than 15000 match columns. Now, the HHsearch binaries will issue a warning and will transform only the first 15000 match columns into an HMM. * Removed a bug in the multi-threading code that could lead to occasional hang-ups (race condition) in situations where slow file access was impeding program execution and inter-thread signaling was unreliable. * Removed a memory leak and optimized memory management. * Removed a bug in hhalign that could lead to unreasonably significant E-values and probabilities due to calibration problems. * HHsearch now performs realign with MAC-alignment only around Viterbi-hit _______________________________________________________________________________ HHsearch 1.5.0 August 2007 * By default, HHsearch realigns all displayed alignments in a second stage using the more accurate Maximum Accuracy (MAC) alignment algorithm (Durbin, Eddy, Krough, Mitchison: Biological sequence analysis, page 95; HMM-HMM version: J. S\"oding, unpublished). As before, the Viterbi algorithm is employed for searching and ranking the matches. The realignment step is parallelized (`-cpu `) and typically takes a few seconds only. You can switch off the MAC realignment with the -norealign option. The posterior probability threshold is controlled with the -mact [0,1[ option. This parameter controls the alignment algorithm's greediness. More precisely, the MAC algorithm finds the alignment that maximizes the sum of posterior probabilities minus mact for each aligned pair. Global alignments are generated with -mact 0, whereas -mact 0.5 will produce quite conservative local alignments. Default value is -mact 0.35, which produces alignments of roughly the same length as the Viterbi algorithm. The -global and -local (default) option now refer to both the Viterbi search stage as well as the MAC realignment stage. With -global (-local), the posterior probability matrix will be calculated for global (local) alignment. Note that '-local -mact 0' will produce global alignments from a local posterior probability matrix (which is not at all unreasonable). * An amino acid compositional bias correction is now performed by default. This increases the sensitivity by 25% at 0.01 errors per query and by 5% at 0.1 errors per query. By recalibrating the Probabilities, the increased selectivity of this new version allows to give higher probabilities for the same P-values. Also, the score offset could be increased from -0.1 bits to 0 as a consequence. * The algorithm that filters the set of the most diverse sequences (option -diff) has been improved. Before, it determined the set of the N most diverse sequences. In the case of multi-domain alignments, this could lead to severely underrepresented regions. E.g. when the first domain is only covered by a few fairly similar sequences and the second by hundreds of very diverse ones, most or all of the similar ones were removed. The '-diff N' option now filters the most diverse set of sequences, keeping at least N sequences in each block of >50 columns. This generally leads to a total number of sequences that is larger than N. Speed is similar. The default is '-diff 100' for hhmake and hhsearch. Speed is similar. Use -diff 0 to switch this filter off. * The sensitivity for the -global alignment option has been significantly increased by a more robust statistical treatment. The sensitivity in -global mode is now only 0-10% lower than for the default -local option on a SCOP benchmark, i.e. when the query or the templates represent single structural domains. The E-values are now more realistic, although still not as reliable as for -local. The Probabilities were recalibrated. * A new binary HHalign has been added. It is similar to hhsearch, but performs only pairwise comparisons. It can produce dot plots, tables of aligned residues, and it can sample alternative alignments stochastically. It uses the MAC algorithm by default. * HHsearch and hhalign can generate query-template multiple alignments in FASTA, A2M, or A3M format with the -ofas, -oa2m, -oa3m options * Return values were changed to comply with convention that 0 means no errors: 0: Finished successfully 1: Format error in input files 2: File access error 3: Out of memory 4: Syntax error on command line 6: Internal logic error (please report) 7: Internal numeric error (please report) 8: Other * Added script buildali.pl to automatically build PSI-BLAST multiple sequence alignments, including predicted and DSSP secondary structure. Buildali.pl prunes sequences individually from both ends to reduce corruption by non-homologous fragments (J. Soeding, unpublished). * Added script hhmakemodel.pl that parses hhsearch results files and can generate FASTA or PIR multiple alignments or build rough 3D models. * Removed a bug due to which pseudocounts where added to HMMer HMMs (which already have their own pseudocounts added). This bug severely reduced sensitivity for HMMs read in HMMer format. * Moved a lot of memory allocation from stack to heap to avoid segmentation faults under some Windows systems. * Removed a bug due to which the query-template alignments where not displayed on some platforms when output was directed to stdout * Removed a bug that caused occasional segfaults under SunOS when reading HMMer files * Added multi-threading (-cpu ) for Windows x86 platform * Cleaned up output formatting of summary list for Windows x86 * Stopped support for the Alpha/DEC platform * Anyone still interested in Mac OSX/PPC or SunOS support? _______________________________________________________________________________ HHsearch 1.2.0 January 2006 * Parallelized hhsearch for SMPs: using option -cpu , hhsearch can be run in parallel on several CPUs of a symmetric multiprocessor (SMP) machine. Speed with two CPUs is ~1.9x, with two Dual Core AMDs ~3.4x. Memory requirements for dynamic programming matrix is now ~100kb*(Query_length+2) * number of queue bins. Number of queue bins = floor(1.2*CPUs+1) (for cpu>1), 1 otherwise * HHsearch can now automatically detect and read HMMER format (ASCII). Predicted secondary structure (or consensus secondary structure) can be used in HMMER format in the same way as in HHsearch format. The script addpsipred.pl is able to add the information for predicted secondary structure to a HMMER file (as well as to a FASTA alignment). However, representative sequences for query-template alignments are not supported by HMMER's format. Performance is not as good HMMer-format as for hhm format, so use hhsearch's hhm format as far as possible. * Multiple databases can now be searched with -d 'db1.hhm db2.hhm ...' * Added option -alt for displaying up to alternative alignments that have no residue pairs in common (e.g. when aligning repeat proteins) * Added percent sequence identity to alignment information * Increased speed of hhsearch database searching by ~15% * Added possibility to read from standard input and write to standard output * Removed the limit 30 000 on maximum number of HMMs in database. * Increased maximum number of match states in HMM from 6000 to 15000. * Checked code sanity with Valgrind (a great tool! http://valgring.org) and removed a bug that was responsible for rare segmentation faults. * Removed a bug in the maximum sequence identity filtering routine. Before, when two sequences had no overlap, the shorter of the two was filtered out. This could have affected HMMs for multidomain proteins quite badly. * Changed options for minimum and maximum numbers of sequences in output file (options -p, -E, -B, -b, -Z, -z) * Impoved sensitivity somewhat (~10%) by using option '-diff 100' as default for hhmake and hhsearch (use '-diff 0' to turn this option off). Optimized internal pseudocount admixture parameter with -diff 100 setting. Results between HHsearch 1.2 and earlier versions may therefore differ slightly. * Perl scripts: Added -Q option to alignblast.pl that allows to include query sequence as first sequence of extracted alignment. Script addpsipred.pl can now add PSIPRED-predicted secondary structure also to HMMER model files. * Included binaries for Windows32 (a bit slow, no threads supported) _______________________________________________________________________________ HHsearch 1.1.4 March 2005 * Removed small memory allocation bug in function that reads .hhdefaults file. * Reserve dynamic programming memory dynamically. Memory requirement is now 6*MAXRES*(Query_length+2) bytes (+ space for ALL sequences in SEQ records of HMM database). * Changed maximum number of sequences in input alignments from 20 000 to 30 000 * Changed maximum length of sequences in input alignments from 20 000 to 30 000 * Changed maximum number of match states MAXRES in input alignments from 5000 to 6000 * In output alignments, changed the name of the consensus sequences from 'Cons-' to 'Consensus'. * Added secondary structure-dependent gap penalties ('-ssgap' option). Gap opening within SS elements costs X bits after 1st, 2X bits after 2nd and 3X bits after third SS residue, where X is 1.0 bit by default and can be changed with the '-ssgapd X' option. => A small benchmark on the CAFASP4 targets showed NO improvement! * Included binaries for MacOSX (Darwin) and OSF1/Tru64_UNIX (thanks to Paul Sarokas, GSK) _______________________________________________________________________________ HHsearch 1.1.3 November 2004 * Allowed up to 10000 sequences to be displayed in output alignments * Added warning message when not enough memory available * Corrected warning message when not enough different superfamilies are contained in the searched database to fit the score distribution. * Removed a bug in the part that builds the output alignments that could cause a segmentation fault on rare occasions. * Remove carriage return (CR) symbols in names of input sequences that could mess up output alignments when sequences were read in via a web form. * Corrected a bug in the calculation of transition probabilities for insert states. (Improvement in sensitivity by ~2%.) Need to recalculate old HMMs. * Changed format of hitlist in output. Now, the first 30 characters of the sequence name and description are shown, not only the ID and the family code: (No Hit Prob E-value...). This is more suitable for databases other than SCOP * Removed limit of 200 characters for sequence descriptions to be able to display the extensive annotation for Pfam and SMART alignments, for instance. * In the output alignment the numbering of the consensus sequence did not correspond to the match columns anymore. This has been corrected. _______________________________________________________________________________ HHsearch 1.1 August 2004 * Format of *.hhm files was changed. Transition pseudocounts are now calculated on the basis of the effective number of sequences that are in M states, I states and M states, respectively. These numbers are listed as Neff, Neff_I, Neff_D in seperate columns of the hhm files. (Improvement in sensitivity by a few percent.) * Reduced memory requirements by a factor of four. Memory requirement now is approx. 6*MAXRES^2 bytes ( + space for ALL sequences in SEQ records of HMM database). MAXRES is maximum number of query residues, 5000 at the moment _______________________________________________________________________________ HHsearch 1.0 May 2004 * Version used for benchmark published in Soding J, Bioinformatics (2005).