==================================================================================================== Sequence Profiles for Secondary Structure & Solvent Accessibility Prediction (PROFILpro) Method Description & Project Documentation ==================================================================================================== Author(s) : Christophe Magnan (cmagnan@ics.uci.edu) Copyright : Institute for Genomics and Bioinformatics University of California, Irvine Modified : 2015/07/02 ==================================================================================================== Method Description ==================================================================================================== PROFILpro is a small utility to generate the input sequence profiles for secondary structure and solvent accessibility prediction tools SSpro, SSpro8, ACCpro, and ACCpro20. PROFILpro is basically a simple extension of the BlastPGP software accepting large input fasta files and processing the sequences on several cores instead of a single thread. Outputs of the blastpgp program are also post-processed to generate a normalized profile sum of 1 for any amino acid in input. The number of iterations (or rounds) to perform on the reference protein database is an input of the program and is limited to 2, 3, or 4. The database used to generate the sequence profiles is UNIREF50 but other databases can be added by simply editing the project profile and a few bash scripts. ==================================================================================================== Project Documentation ==================================================================================================== This section provides a description of the project folder and how to use PROFILpro. ========================================= Project Folder ========================================= A brief description of the project folders is given below. - bin Main script to run PROFILpro - data Reference protein datasets for PROFILpro - doc Documentation of the software - env Bash profile for running PROFILpro - lib PROFILpro scripts to generate sequence profiles - tmp Temporary work folder for the software - tools Third-party tools used by PROFILpro ========================================= Software Usage ========================================= PROFILpro comes with only one script located in the 'bin' folder: generate_profiles.sh Usage: ./generate_profiles.sh input_fasta output_profiles num_rounds [num_threads] With: - input_fasta Input protein sequences in FASTA file format - output_profiles Output file for the sequence profiles - num_rounds Number of blastpgp iterations on UNIREF50 (2, 3, or 4 only) - num_threads Number of cores to use to process the input dataset (default=1) ======================================= Input Files Format ======================================= Input files must be in the standard FASTA file format. There is no limit for the number of input sequences to process beside the amount of RAM memory available on the machine running the program. ==================================== Output Files Description ==================================== Sequence profiles are written in the following file format for each input protein sequence: >sequence header // As provided in the input fasta file amino_acid_1 profile_values // Profile for the first amino acid of the protein sequence amino_acid_2 profile_values // Profile for the second amino acid of the protein sequence ... // ... amino_acid_n profile_values // Profile for the last amino acid of the protein sequence The 20 profile values reported after the input amino-acid character are space-separated and given in the same order than blastpgp output files: A R N D C Q E G H I L K M F P S T W Y V The profiles are reported in the same order than the sequences in the input fasta file, one after each other following the format described above for each sequence. ==================================================================================================== Release Notes ==================================================================================================== Version 1.1 (2015) Author : Christophe Magnan Description : Bug fixes for version 1.0 Comments : Non-standard amino acids replaced by X Sequences of length greater than 10,000 ignored Version 1.0 (2013) Author : Christophe Magnan Description : First release of the software Comments : Shared tool for SCRATCH-1D, SSpro, SSpro8, ACCpro, ACCpro20, and HOMOLpro. ====================================================================================================