==================================================================================================== Homology-Based Secondary Structure & Solvent Accessibility Prediction (HOMOLpro) Method Description & Project Documentation ==================================================================================================== Author(s) : Christophe Magnan (cmagnan@ics.uci.edu) Copyright : Institute for Genomics and Bioinformatics University of California, Irvine Modified : 2015/07/02 ==================================================================================================== Method Description ==================================================================================================== HOMOLpro is a standalone program developed to improve the protein secondary structure and solvent accessibility ab-initio predictions in regions where homologs exist in the Protein Data Bank (PDB). This tool is used by all the predictors in SCRATCH-1D (SSpro, SSpro8, ACCpro, ACCpro20) to improve their initial ab-initio predictions. Note that HOMOLpro can also be used with any other secondary structure or solvent accessibility predictor. Currently, HOMOLpro supports: - Secondary structure predictions 3-class (e.g. SSpro) - Secondary structure predictions 8-class (e.g. SSpro8) - Solvent accessibility predictions at the 25% threshold (e.g. ACCpro) - Solvent accessibility predictions for thresholds 0% to 95% (e.g. ACCpro20) HOMOLpro takes in input the protein amino-acid sequence and corresponding ab-initio predictions and returns new secondary structure or/and solvent accessibility predictions by replacing the input ab-initio predictions by homology-based predictions computed in regions where homologs can be found in the PDB. Protein regions with no homologs in the PDB are returned as provided in input (i.e. the input ab-initio predictions are reported unchanged). ==================================================================================================== Project Documentation ==================================================================================================== This section provides a description of the project folder and how to use HOMOLpro. ========================================= Project Folder ========================================= A brief description of the project folders is given below. - bin Main scripts to run HOMOLpro - data Reference protein datasets for HOMOLpro - data/pdb_full Filtered PDB with DSSP classes - doc Documentation of the software - env Bash profiles for running and retraining HOMOLpro - lib HOMOLpro scripts to combine ab-initio & homology predictions - tmp Temporary work folder for the software - tools Third-party tools used by HOMOLpro ========================================= Software Usage ========================================= HOMOLpro comes with only one script located in the 'bin' folder: add_homology_predictions.sh Usage: ./add_homology_predictions.sh fasta predictors abinitio homology [num_threads] With: - fasta Input protein sequences in FASTA file format - predictors Comma-separated list of predictions to compute among: - ss : secondary structure 3 class - ss8 : secondary structure 8 class - acc : solvent accessibility at the 25% threshold - acc20 : solvent accessibility for thresholds 0% to 95% - abinitio Comma-separated list of input files containing the ab-initio predictions for the sequences in the input fasta file. Prediction type (ss, ss8, acc, acc20) must match with the predictors indicated with 'predictors'. - homology Comma-separated list of output files containing the outputs of HOMOLpro for each input ab-initio file. - num_threads Number of cores to use to process the dataset (default=1) ======================================== Detailed Example ======================================== Let's consider a dataset D of protein sequences where ab-initio predictions are available for: - secondary structure 3-class (using SSpro for instance) - solvent accessibility at the 25% threshold (using ACCpro for instance) Then, to improve these predictions by adding homology-based predictions with HOMOLpro : ./add_homology_predictions.sh D.fa ss,acc D.ss.ab,D.acc.ab D.ss.hom,D.acc.hom 4 WHERE: - "D.fa" contains the protein sequences in D - "ss,acc" will tell HOMOLpro the type of the two ab-initio predictions provided in the files "D.ss.ab" and "D.acc.ab" (next argument) - "D.ss.ab,D.acc.ab" are the ab-initio predictions for all the proteins in "D.fa" see section "Input Files Format" for a description of the file formats. - "D.ss.hom,D.acc.hom" are the two files where HOMOLpro will write the output homology-based predictions corresponding to each input ab-initio file. - "4" HOMOLpro will use 4 cores to process the dataset NOTE 1 : the same results can be obtained in this example by running the two command lines: ./add_homology_predictions.sh D.fa ss D.ss.ab D.ss.hom 4 ./add_homology_predictions.sh D.fa acc D.acc.ab D.acc.hom 4 but the computation time will be ~twice as high since blast will need to run twice instead of once. NOTE 2 : nothing prevents to process several ab-initio predictions of the same type : ./add_homology_predictions.sh D.fa ss,ss,ss D.ss.ab#1,D.ss.ab#2,D.ss.ab#3 ... this can be useful if ab-initio predictions of several predictors of the same type are available. ======================================= Input Files Format ======================================= 1 - Protein Sequences Protein sequences must be provided in the standard FASTA file format. There is no limit for the number of input sequences to process beside the amount of RAM memory available on the machine running the program. 2 - AB-INITIO Predictions HOMOLpro ab-initio input files must be in the same file format as the output files of SSpro, SSpro8, ACCpro, and ACCpro20. They are basically FASTA files where the protein sequence is replaced by the predicted secondary structure or solvent accessibility. Predictions must be reported on a single line, no space or extra character. Sequences must be reported in the same order than the protein sequences in the input fasta file, one after each other and using the same sequence headers. Classes must be noted as described below: - Secondary structure 3-class : C E H - Secondary structure 8-class : C E H B I G S T - Solvent Accessibility 25% : e (exposed) - (buried) - Solvent Accessibility 20-class : 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 (for acc20, classes must be space-separated) Examples are provided in the 'doc' folder of the software for each type of predictor. ==================================== Output Files Description ==================================== Output files of HOMOLpro have the exact same file format as the input ab-initio files. ==================================================================================================== Release Notes ==================================================================================================== Version 1.1 (2015) Author : Christophe Magnan Description : Bug fixes for version 1.0 Comments : Non-standard amino acids replaced by X Sequences of length greater than 10,000 ignored Version 1.0 (2013) Author : Christophe Magnan Description : First release of the software Comments : Shared tool for SCRATCH-1D, SSpro, SSpro8, ACCpro, ACCpro20. ====================================================================================================