====================================================================================================

          Homology-Based Secondary Structure & Solvent Accessibility Prediction (HOMOLpro)

                             Method Description & Project Documentation

====================================================================================================

Author(s) :  Christophe Magnan (cmagnan@ics.uci.edu)
Copyright :  Institute for Genomics and Bioinformatics
             University of California, Irvine
Modified  :  2015/07/02

====================================================================================================
                                         Method Description
====================================================================================================

HOMOLpro is a standalone program developed to improve the protein secondary structure and solvent
accessibility ab-initio predictions in regions where homologs exist in the Protein Data Bank (PDB).
This tool is used by all the predictors in SCRATCH-1D (SSpro, SSpro8, ACCpro, ACCpro20) to improve
their initial ab-initio predictions. Note that HOMOLpro can also be used with any other secondary
structure or solvent accessibility predictor. Currently, HOMOLpro supports:

- Secondary structure predictions 3-class (e.g. SSpro)
- Secondary structure predictions 8-class (e.g. SSpro8)
- Solvent accessibility predictions at the 25% threshold (e.g. ACCpro)
- Solvent accessibility predictions for thresholds 0% to 95% (e.g. ACCpro20)

HOMOLpro takes in input the protein amino-acid sequence and corresponding ab-initio predictions and
returns new secondary structure or/and solvent accessibility predictions by replacing the input
ab-initio predictions by homology-based predictions computed in regions where homologs can be found
in the PDB. Protein regions with no homologs in the PDB are returned as provided in input (i.e. the
input ab-initio predictions are reported unchanged).

====================================================================================================
                                        Project Documentation
====================================================================================================

This section provides a description of the project folder and how to use HOMOLpro.

=========================================  Project Folder  =========================================

A brief description of the project folders is given below.

- bin             Main scripts to run HOMOLpro
- data            Reference protein datasets for HOMOLpro
- data/pdb_full   Filtered PDB with DSSP classes
- doc             Documentation of the software
- env             Bash profiles for running and retraining HOMOLpro
- lib             HOMOLpro scripts to combine ab-initio & homology predictions
- tmp             Temporary work folder for the software
- tools           Third-party tools used by HOMOLpro

=========================================  Software Usage  =========================================

HOMOLpro comes with only one script located in the 'bin' folder: add_homology_predictions.sh

   Usage:  ./add_homology_predictions.sh  fasta  predictors  abinitio  homology  [num_threads]

With:

- fasta              Input protein sequences in FASTA file format

- predictors         Comma-separated list of predictions to compute among:

                        - ss    : secondary structure 3 class
                        - ss8   : secondary structure 8 class
                        - acc   : solvent accessibility at the 25% threshold
                        - acc20 : solvent accessibility for thresholds 0% to 95%

- abinitio           Comma-separated list of input files containing the ab-initio predictions
                     for the sequences in the input fasta file. Prediction type (ss, ss8,
                     acc, acc20) must match with the predictors indicated with 'predictors'.

- homology           Comma-separated list of output files containing the outputs of HOMOLpro
                     for each input ab-initio file.

- num_threads        Number of cores to use to process the dataset (default=1)

========================================  Detailed Example  ========================================

Let's consider a dataset D of protein sequences where ab-initio predictions are available for:

   - secondary structure 3-class (using SSpro for instance)
   - solvent accessibility at the 25% threshold (using ACCpro for instance)

Then, to improve these predictions by adding homology-based predictions with HOMOLpro :

   ./add_homology_predictions.sh   D.fa   ss,acc   D.ss.ab,D.acc.ab   D.ss.hom,D.acc.hom   4

WHERE:

- "D.fa" contains the protein sequences in D

- "ss,acc" will tell HOMOLpro the type of the two ab-initio predictions provided
  in the files "D.ss.ab" and "D.acc.ab" (next argument)

- "D.ss.ab,D.acc.ab" are the ab-initio predictions for all the proteins in "D.fa"
  see section "Input Files Format" for a description of the file formats.

- "D.ss.hom,D.acc.hom" are the two files where HOMOLpro will write the output
  homology-based predictions corresponding to each input ab-initio file.

- "4" HOMOLpro will use 4 cores to process the dataset


NOTE 1 : the same results can be obtained in this example by running the two command lines:

./add_homology_predictions.sh   D.fa   ss    D.ss.ab    D.ss.hom    4
./add_homology_predictions.sh   D.fa   acc   D.acc.ab   D.acc.hom   4

but the computation time will be ~twice as high since blast will need to run twice instead of once.


NOTE 2 : nothing prevents to process several ab-initio predictions of the same type :

./add_homology_predictions.sh  D.fa  ss,ss,ss  D.ss.ab#1,D.ss.ab#2,D.ss.ab#3  ...

this can be useful if ab-initio predictions of several predictors of the same type are available.


=======================================  Input Files Format  =======================================

1 - Protein Sequences

Protein sequences must be provided in the standard FASTA file format.
There is no limit for the number of input sequences to process beside
the amount of RAM memory available on the machine running the program.

2 - AB-INITIO Predictions

HOMOLpro ab-initio input files must be in the same file format as the output files
of SSpro, SSpro8, ACCpro, and ACCpro20. They are basically FASTA files where the protein
sequence is replaced by the predicted secondary structure or solvent accessibility.
Predictions must be reported on a single line, no space or extra character. Sequences
must be reported in the same order than the protein sequences in the input fasta file,
one after each other and using the same sequence headers. Classes must be noted as
described below:

  - Secondary structure 3-class    :  C  E  H
  - Secondary structure 8-class    :  C  E  H  B  I  G  S  T
  - Solvent Accessibility 25%      :  e (exposed) - (buried)
  - Solvent Accessibility 20-class :  0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95
                                      (for acc20, classes must be space-separated)

Examples are provided in the 'doc' folder of the software for each type of predictor.

====================================  Output Files Description  ====================================

Output files of HOMOLpro have the exact same file format as the input ab-initio files.

====================================================================================================
                                           Release Notes
====================================================================================================

Version 1.1 (2015)

Author      :  Christophe Magnan
Description :  Bug fixes for version 1.0
Comments    :  Non-standard amino acids replaced by X
               Sequences of length greater than 10,000 ignored

Version 1.0 (2013)

Author      :  Christophe Magnan
Description :  First release of the software
Comments    :  Shared tool for SCRATCH-1D, SSpro, SSpro8, ACCpro, ACCpro20.

====================================================================================================