Features Extracting

FeaturesReporter

Implmenting a FeaturesReporter involves the following steps:

  1. Implement the FeaturesReporter class interface (see directly below).
  2. Add FeatureReporter to FeaturesReporterFactory (rosetta/main/source/src/protocols/features/FeaturesReporterFactory.cc).
  3. Add the FeatureReporter to the FeatureReporterTests (rosetta/main/source/test/protocols/features/FeaturesReporterTests.cxxtest.hh) Unit Test.
  4. Consider adding the FeatureReporter to the features integration test (rosetta/main/tests/integration/tests/features).
  5. Document the FeatureReporter in the Features Database Schema page.
  6. Add new types in organizational page

FeatureReporter Class Interface

The FeatureReporter (rosetta/main/source/src/protocols/features/FeaturesReporter.hh) base class interface has the following components, which should be implemented by a new FeaturesReporter:

Required Methods

Optional Methods

As an example consider the PoseCommentsFeatures (rosetta/main/source/protocols/features/PoseCommentsFeatures.hh) feature reporter. Arbitrary textual information may be associated with a pose in the form of (key, val) comments (See rosetta/main/source/src/core/pose/util.hh). The PoseCommentsFeatures FeaturesReporter extracts all defined comments to a table pose_comments using the struct_id and key as the primary key. The struct_id references the the structures table that identifies each of the structures in the database.

In the report_features function, sessionOP is an owning pointer to the database where the features should be written. See the database interface for how to obtain and interact with database sessions.

string
PoseCommentsFeatures::type_name() const { return "PoseCommentsFeatures"; }

string
PoseCommentsFeatures::schema() const {
  return
    "CREATE TABLE IF NOT EXISTS pose_comments (\n"
    " struct_id INTEGER,\n"
    " key TEXT,\n"
    " value TEXT,\n"
    " FOREIGN KEY (struct_id)\n"
    " REFERENCES structures (struct_id)\n"
    " DEFERRABLE INITIALLY DEFERRED,\n"
    " PRIMARY KEY(struct_id, key));";
}

Size
PoseCommentsFeatures::report_features(
  Pose const & pose,
  Size struct_id,
  sessionOP db_session
){  
  typedef map< string, string >::value_type kv_pair;
  foreach(kv_pair const & kv, get_all_comments(pose)){
    statement stmt = (*db_session) <<
      "INSERT INTO pose_comments VALUES (?,?,?)" <<
      struct_id << kv.first << kv.second;
    stmt.exec();
  }

  return 0;
}

ReportToDB

Use the ReportToDB mover with the Rosetta XML scripting to specify which features should be extracted to the features database.

    <ROSETTASCRIPTS>
        <SCOREFXNS>
            <s weights=score12_w_corrections/>
        </SCOREFXNS>
        <MOVERS>
            <ReportToDB name=features database_name=scores.db3>
                <feature name=ScoreTypeFeatures/>
                <feature name=StructureScoresFeatures scfxn=s/>
            </ReportToDB>
        </MOVERS>
        <PROTOCOLS>
                <Add mover_name=features/>
        </PROTOCOLS>
    </ROSETTASCRIPTS>

Since ReportToDB is simply a mover, it can be included in any Rosetta Protocol. For example, to extract the features from a set of pdb files listed in structures.list , and the above script saved in parser_script.xml , execute the following command:

   rosetta_scripts.linuxgccrelease -output:nooutput -l structures.list -parser:protocol parser_script.xml

This will generate an SQLite3 database file scores.db3 containing the features defined in each of the specified FeatureReporters for each structure in structures.list . See the features integration test (rosetta/main/test/integration/tests/features) for a working example.

Extracting Features In Parallel

Currently the ReportToDB mover is not compatible with MPI runs. There is support however for partitioning a sample source into batches, generating features database for each batch and merging them together. See the features_parallel integration test (rosetta/main/test/integration/tests/features_parallel) for a working example.

For example if there are 1000 structures split into 4 batches then the scripts for the run processing the first batch would contain:

   <ReportToDB name=features_reporter db="features.db3_01" sample_source="batch1" protocol_id=1 first_struct_id=1>
      ...
   </ReportToDB>

and the script for the run processsing the second batch would contain:

   <ReportToDB name=features_reporter db="features.db3_02" sample_source="batch2" protocol_id=2 first_struct_id=26>
      ...
   </ReportToDB>

After the runs are complete, locate the merge_databases.py script (rosetta/main/test/scientific/cluster/features/sample_sources/merge_databases.py) and run

   python merge_database.py features.db3 features.db3_*

Which will merge the features from each of the features.db3_xx database into features.db3 .