DeepMSA2 

Deep multiple sequence alignment (version 2.2) generation for protein structure prediction.

==== Overview ====

DeepMSA2 is a composite approach to generate high quality multiple sequence alignment for protein 
monomer or protein complex based on huge genomics and metagenomics databased with a structure model-based 
multi-MSAs ranking system (or contact-map-based scoring system, only for protein monomer). 
The monomer MSAs are produced by three iterative MSA generation pipelines with large alignment depth
and diverse sequence sources by merging sequences from whole-genome sequence databases (Uniclust30 and UniRef90) 
and from metagenome databases (Metaclust, BFD, Mgnify, Tara DB, MetaSource DB and IMG/M). 
For protein multimer, the top N ranked MSAs for each constituent protein are selected for generating 
potential paired MSAs. Each selected MSA for one constituent protein can be paired with the MSA of 
another constituent. 
    Large-scale benchmark data show that the performance of a couple of protein-related research, including 
protein monomer and protein complex structure prediction, template detection, contact or distance prediction, 
can be significantly improved after utilizing DeepMSA2's MSAs.

This package was developed by Wei Zheng. If you have any questions or find any bugs, please contact 
zhengwei@umich.edu or jlspzw139@sina.com.

==== Release note =====
v2.2 (2024/03/22) 
     1. Fix a MSA combination scoring function issue in MSA_combination.py

v2.1 (2024/02/09)
     1. Fix a JGI search bug when dMSA does have enough sequences but qMSA does not
     2. Update Install_af2_env.sh with the numpy and tensorflow

v2.0 (2024/01/01)
     original version of DeepMSA2

==== Installing the package ====
This package can be downloaded from https://zhanggroup.org/DeepMSA2/download/

This package is developed for 64-bit Linux only. If you use 64-bit Linux, you
should be able to use this package without compiling anything.

The package has been tested on CentOS Linux release 7.7.1908 with V100 GPU, python 3.8, 
CUDA 11.3.1, GCC 10.3.0, zip/unzip, tar, xz and all other required dependencies.

The package could be run either with GPU mode or CPU mode. If your cluster supports SLURM
system, we recommend to use it, but the package could also be run without SLURM system.

In theory, the package should also work for Mac OS, provided that you compile
the respective components appropriately. In particular, this package contains source code 
and/or binary executables from the following packages: HHsuite, PSI-BLAST, AlphaFold2, and AlphaFold2-Multimer.

Before you use the DeepMSA2 package, you need do the following three steps:

1. Change the rootpath = "xxx" to the absolute path of your local DeepMSA2 package in config.py. 

2. Install the AlphaFold2 environments by ./Install_af2_env.sh. You may need to manually type 'y' when showing 'Proceed ([y]/n)?'. If you have any issue, please go to https://github.com/kalininalab/alphafold_non_docker for advanced help. All the required dependencies are listed in Install_af2_env.sh. If any dependency does not work in your system, you need to modify that script. This step will take around 30 minutes. (If you get issues when running AlphaFold2, check and change the version of dependencies that fit your system)

3. Download the required databases by "python3 Download_lib.py"
or go to https://zhanggroup.org/DeepMSA2/download/ to check the libraries.
In total, 3~4TB hard disk is required for storing the databases. 
To uncompress the files, you need to have unzip, tar, and xz installed in your system.
It will download the DeepMSA2 sequence databases and AlphaFold2 databases, including:

uniclust30
uniref90
metaclust
UniRef30
BFD
MGnify
JGIclust collected from IMG/M
MetaSourceDB
TaraDB

PDB70 and sequences for PDB70 (if you want to change PDB70 with different version, you need to change the sequences files accordingly)
MMCIF from PDB
parameters for AlphaFold2

Notice! If you want to use your own databases, please change the config.py. 
Then go to the alphafold and alphafold_multimer folders in bin folder, and
change the corresponding databases in run_alphafold_*.sh

This step will take one to several days depending on the speed of your internet bandwidth.

==== Usage ====
Two examples are included in the package: example is a heteromer protein complex containing two chains, example2 is a protein monomer. The result files are also included in the example folders.
Normally, the run time is based on your system, number of GPUs or CPUs, IO speed, the setting of the modeling system (config.py), and protein (protein complex) size.
If you have SLURM system and all jobs are able to run parallelly with GPU, the example will be completed in ten hours.  
If you want to run on SLURM system, you need to change your sbatch parameters in config.py and utils.py 

The following python program require python 3.8 or above.

==== Usage for protein monomer ====
For monomer MSA construction, put the single-chain FASTA-formatted sequence to the data folder (see example2).

Step 1: DeepMSA2_noIMG.py, this step will run dMSA and qMSA to collect MSAs from uniclust30, UniRef30, uniref90, BFD, metaclust and mgnify databases.

  Usage: 
    DeepMSA2_noIMG.py [option]

    required options:
      -i=/home/simth/test/seq.fasta
      -o=/home/simth/test

    optional options:
      -run_type=local (default) or sbatch

After step 1, you will see a folder named as MSA containing aMSA.*a3m/aln, dMSA.*a3m/aln, and qMSA.*a3m/aln.


Step 2: DeepMSA2_IMG.py, this step will run mMSA to collect MSAs from JGIclust, MetaSourceDB and TaraDB databases.
You may need to run this step twice until you see "xxx does not need additional JGI search" or "DeepMSA_IMG for xxx is finished." and there is no job running. 
(if you do not want to run JGI step, you can skip this step and run step 3 after step 1)

  Usage: DeepMSA2_IMG.py [option]

    required options:
      -i=/home/simth/test/seq.fasta
      -o=/home/simth/test (This should be the same output directory as DeepMSA2_noIMG.py step)

    optional options:
      -run_type=local (default) or sbatch

After this step, you will see a folder named as JGI containing DB.fasta.*, DeepJGI.*a3m/aln, q3JGI.*a3m/aln, and q4JGI.*a3m/aln.

Step 3: MSA_selection.py, this step will run MSA ranking system to all MSAs collected from step 1 (and step 2).
We have implemented two MSA ranking methods, one is based on DeepPotential contact map probability score, the other is based on AlphaFold2 pLDDT score.

  Usage: MSA_selection [option]

    required options:
      -i=/home/simth/seq.txt
      -o=/home/simth/test

    optional options:
      -run_type=local (default) or sbatch
      -method=deeppotential or alphafold2 (default)

After this step, you will find a folder named as finalMSAs containing MSA_ranking.info, final.a3m/aln and *.*a3m/aln. 

==== Usage for protein multimer ====
for multimer, put all-chains into one fasta file. For example, a A2B1 complex, you need
to put two A sequences and one B sequence into one fasta file. (see example for A1B1 complex).

Step 1: Parse_sequence.py, this step will parse the fasta format sequence contained in the data folder, and check all non-reduandant sequences.

  Usage: Parse-sequence.py [option]

    required options:
    -o=/home/simth/test (/home/simith/test should contain the fasta sequence file seq.fasta)

Step 2: Run_DeepMSA2.py, this step will run DeepMSA2 to collect MSAs from uniclust30, UniRef30, 
uniref90, BFD, metaclust, mgnify, JGIclust, MetaSourceDB and TaraDB databases for the component proteins.

  Usage: Run_DeepMSA2.py [option]

    required options:
    -o=/home/simth/test (/home/simith/test should contain the fasta sequence file seq.fasta)

    optional options:
    -run_type=local (default) or sbatch

After step 2, for each component protein, you will see a folder named as MSA containing aMSA.*a3m/aln, dMSA.*a3m/aln, and qMSA.*a3m/aln,
and a folder named as JGI containing DB.fasta.*, DeepJGI.*a3m/aln, q3JGI.*a3m/aln, and q4JGI.*a3m/aln, 
and a folder named as finalMSAs containing MSA_ranking.info, final.a3m/aln and *.*a3m/aln.

Step 3: MSA_combination.py, this step will combine all MSAs from different components and rank MSAs based on a scoring function combined 
joint MSA Neff value and pLDDT score of MSA for component protein. MSA filter is only applied when 'joint_MSA_filter=True' in config.py.

  Usage: MSA_combination.py [option]

    required options:
    -o=/home/simth/test (/home/simith/test should contain the fasta sequence file seq.fasta)

After step 3, you will see a folder named as finalMSAs that contains all combined MSAs, and a ranking file MSA_ranking.info (if the joint_MSA_filter=True in config.py).
In each ranked joint MSA folder, the individual chain MSAs are in A/B/C/... folder and named as "alphafold2.a3m", there are three kinds of joint MSAs,

paired.aln

paired.aln is a linked MSA where all chains should contain the same species. For an A1B1C1 complex, the MSA should be like

AAAAAAAAAAAABBBBBBBBBBBBCCCCCCCCCCCCCC (linked by specie 1)
AAAAAAAAAAAABBBBBBBBBBBBCCCCCCCCCCCCCC (linked by specie 1)
AAAAAAAAAAAABBBBBBBBBBBBCCCCCCCCCCCCCC (linked by specie 2)

full_paired.aln

full_paired.aln is a linked MSA where at least two chains could be linked within a same specie 
(then other chains without this specie will be filled with gap). For an A1B1C1 complex, the MSA should be like

AAAAAAAAAAAABBBBBBBBBBBBCCCCCCCCCCCCCC (linked by specie 1)
AAAAAAAAAAAABBBBBBBBBBBBCCCCCCCCCCCCCC (linked by specie 1)
AAAAAAAAAAAABBBBBBBBBBBBCCCCCCCCCCCCCC (linked by specie 2)
AAAAAAAAAAAA------------CCCCCCCCCCCCCC (linked by specie 3)
AAAAAAAAAAAA------------CCCCCCCCCCCCCC (linked by specie 4)
AAAAAAAAAAAABBBBBBBBBBBB-------------- (linked by specie 5)

full_paired_pad.aln

full_paired_pad.aln is similar to full_paired.aln, but non-paired sequences are padded at the end of the MSA. For an A1B1C1 complex, the MSA should be like

AAAAAAAAAAAABBBBBBBBBBBBCCCCCCCCCCCCCC (linked by specie 1)
AAAAAAAAAAAABBBBBBBBBBBBCCCCCCCCCCCCCC (linked by specie 1)
AAAAAAAAAAAABBBBBBBBBBBBCCCCCCCCCCCCCC (linked by specie 2)
AAAAAAAAAAAA------------CCCCCCCCCCCCCC (linked by specie 3)
AAAAAAAAAAAA------------CCCCCCCCCCCCCC (linked by specie 4)
AAAAAAAAAAAABBBBBBBBBBBB-------------- (linked by specie 5)
AAAAAAAAAAAA--------------------------
AAAAAAAAAAAA--------------------------
------------BBBBBBBBBBBB--------------
------------BBBBBBBBBBBB--------------
------------------------CCCCCCCCCCCCCC

######## Usage of DeepMSA ########
If you want to use DeepMSA (version 1), you could go to bin/dMSA folder, and dMSA pipeline is the same pipeline as DeepMSA (version 1).
To run DeepMSA, you can check the program ./bin/dMSA/scripts/build_MSA.py

##### Reference #######
Wei Zheng, Qiqige Wuyun, Yang Li, Chengxin Zhang, P Lydia Freddolino, Yang Zhang. 
Improving deep learning protein monomer and complex structure prediction using DeepMSA2 with huge metagenomics data. 
Nature Methods (2024). https://www.nature.com/articles/s41592-023-02130-4