The optE_parallel application

Metadata

Author: Andrew Leaver-Fay

This document was written by Andrew Leaver-Fay in October 2013. The optE_parallel code is maintained in the Kuhlman lab at UNC. Send questions to (aleaverfay@gmail.com).

Code and Demo

The optE application lives in Rosetta/main/source/src/apps/public/weight_optimization/optE_parallel.cc which serves merely as a wrapper for the code in Rosetta/main/source/src/protocols/optimize_weights/. In particular, the main driver class is IterativeOptEDriver, which is implemented in the .cc file with the same name.

A demo for the sequence-profile recovery procedure can be found in the Rosetta/demos/public/optE directory. This directory contains an "inputs" directory and a "run" directory. See the "launch_optE.scr" csh script for an example of how to set up and launch an optE job.

References

Andrew Leaver-Fay, Matthew J. O'Meara, Mike Tyka, Ron Jacak, Yifan Song, Elizabeth H. Kellogg, James Thompson, Ian W. Davis, Roland A. Pache, Sergey Lyskov, Jeffrey J. Gray, Tanja Kortemme, Jane S. Richardson, James J. Havranek, Jack Snoeyink, David Baker, Brian Kuhlman. (2013) "Scientific Benchmarks for Guiding Macromolecular Energy Function Improvement." Methods in Enzymology. Volume 523. Chapter 6. Pages 109-143.

Purpose

The optE_parallel application serves two purposes. a) To fit the reference energies to maximize sequence recovery and minimize the KL-Divergence of the amino-acid sequence profile by performing iterative rounds of (usually, complete-protein) redesign. b) To fit the weights themselves (as well as the reference energies) using a maximum-likelihood approach, where the weights are optimized to maximize the likelihood of various experimental observations (e.g. native sequences, native rotamers, ddGs of mutation). The first scenario produces high quality reference energies which, when fed into sequence-recovery benchmarks, yeild high sequence recovery rates. The second scenario is not worth very much and is not recommended for use. Its shortcomings are documented extensively in the Methods In Enzymology paper referenced above.

Algorithm

Sequence profile recovery proceedure

Given an input set of PDB files, the reference energy optimization procedure performs iterative redesign on these PDB files, changing the reference energies at each round. After each round of redesign, optE alters the reference energies, increasing the reference energies for amino acids that appeared in the designed structures more than they appeared in the input structures, and decreasing the reference energies for amino acids that appeared in the designed structures less often than they appeared in the input structures. For this reason, this proceedure is called the "amino acid profile recovery" or "profile recovery" procedure. Optionally, a RosettaScripts XML file which gives a set of TaskOperations can be used to restrict design to a subset of the positions on the input structures. This can be used to optimize reference energies for specific regions of proteins, e.g., at protein-protein interfaces.

There are 20 amino acids for which reference energies are fit, but since the only important property of reference energies is the difference between them for two amino acids, there are effectively only 19 free parameters. For this reason, the reference energies may all be scaled upwards or downwards by a constant value without affecting their use in design. We chose to shift the reference energies so that they sum to 0.

The code operates in an inner and an outer loop. In the inner loop, the reference energies are adjusted upwards or downwards based on the results from the previous round of design, then all input proteins are redesigned. If you are running within MPI, then this step will be parallelized across the available processors. Once design has completed, the sequence recovery rate and the cross entropy (the KL-divergence + the entropy) is measured. A score built from the weighted sum SRR + w*XE (SRR = sequence recovery rate, XE = cross entropy, w = -0.1) is computed for the inner loop. If this score is greater than the score from the previous iteration through the outer loop, then the inner loop exits. Whenever the inner loop exits (either because it has reached six iterations, or because the score has increased over the previous iteration through the outer loop) its weight set is writen to the weightdir/ directory, and its score is assigned as the score for the current iteration through the outer loop. Each weight file that is written to disk represents either a point at which the score improved, or a significant amount of effort has been expended. On weight file is output for each iteration in the outer loop. For the first two iterations of the outer loop, the inner loop proceeds through at least four iterations. Both the number of inner and outer-loop iterations may be controlled by the command line.

Maximum-likelihood weight and reference-energy optimization

OptE was originally designed to fit both weights and reference energies using an iterative procedure, and the structure of the reference-energy fitting routine takes its shape from the maximum-likelihood procedure. The structure of this procedure oscilates between objective-function optimization and complete-protein redesign. Each oscilation represents one iteration of the outer loop. (The "inner loop" describes the behavior of the complete-protein redesign phase). In the objective-function optimization phase, the weights and reference energies are optimized using a combination of particle-swarm and gradient-based minimization procedures. The nature of the objective function is described in detail in the Methods in Enzymology paper. The weight set that results from the objective-function-optimization phase is then mixed at varying levels with the weight set from the previous iteration through the outer loop, and the sequence-recovery rate for the resulting weight set is measured. This mixing procedure and the rationale behind it is described in greater detail in the Methods In Enzymology paper. The inner loop continues until the sequence recovery rate improves over the previous iteration through the outer loop, or until six iterations have completed.

When run in this manner, the user must provide a list of the weights that are to be held fixed (the fixed weights) and a list of the weights that are allowed to change (the free weights). OptE does not directly place any limitations on where the free and fixed weights may explore which can result in weights that get very large or that go negative. In particular, the weight on both the Ramachandran term (rama) and the repulsive component of the Lennard-Jones term (fa_rep), when allowed to float, often go negative. To avoid this problem, the user may provide arbitrary flat-bottom harmonic constraints to apply to the weights, which will be included in the objective function.

Limitations

Limitations of the reference-energy optimization procedure

OptE continues iterating even after it has begun to converge. Sometimes the best weight set (the one with the highest score) appears after the 10th iteration through the outer loop, sometimes after the 6th iteration. The recommended weight set is the one with the highest score. The iteration can be deduced from the log file by looping for the last occurrence of the word "accepting."

Independent executions of optE will produce very similar reference energies, and these reference energies may produce similar sequence recovery rates, but the differences between these rates are probably statistically significant. That is, the stochastic sequence recovery benchmark is surprisingly precise so that the standard deviation is often less than 0.05%. On the other hand, the sequence recovery rates resulting from independent runs of optE will have a standard deviation of about 0.5%. Triplicate optE runs are recommended.

Surprisingly, it turns out that sampling strategies are closely connected with the set of reference energies you need. If you want to use the min-packer, then optE has to be run using the min packer. If you want to use the discrete packer with the -ex1 -ex2 flags, then optE has to be run using the discrete packer -ex1 -ex2. OptE will use the discrete packer by default (extra rotamer flags have to be provided on the command line) and has to be instructed to use the min-packer.

Limitations of the maximum-likelihood weight and reference-energy optimization procedure

Though the maximum-likelihood optimization procedure selects weight sets based solely on sequence recovery, and the reference-energy optimization procedure selects weights based on both sequence recovery and sequence profile recovery, the reference-energy optimization procedure yeilds (dramatically) higher sequence recovery rates.

Options

Options for the sequence profile recovery optimization proceedure

The following options are recommended for the reference energy fitting procedure:

Other useful flags for the sequence-profile recovery procedure

The following options are sometimes useful when running the reference energy fitting procedure

Tips for running OptE

Here are a set of recommendations for managing optE job control and generally getting the most out of optE.

/path/to/Z.pdb
/path/to/A.pdb
/path/to/Y.pdb
/path/to/B.pdb
...
#!/bin/csh -f

@ num_procs = 27;
@ i = 0

#clean up
rm -rf workdir_* weightdir logdir

while ( $i < $num_procs )
   mkdir workdir_$i
   #echo "making directory workdir_" $i
   @ i = $i + 1
end
mkdir weightdir
mkdir logdir

bsub -n $num_procs -q week -o logdir/optE_log -a mvapich mpirun /nas02/home/l/e/leaverfa/GIT/main/source/bin/optE_parallel.mpi.linuxgccrelease @optE_seqprof.flags @score.flags > submission.log

Expected Outputs

OptE expects to be run in a directory that contains several subdirectories:

The weight files are all written to the weightdir directory.

Post Processing

The script find_best_weight_set.py identifies the weight set with the highest score that should be considered the output from optE. This script can be found in the Rosetta/demos/public/optE/scripts/ directory.

New things since last release

More coming soon.