====================================================================================================

                  One-Dimensional Bidirectional Recurrent Neural Network (1D-BRNN)

                             Method Description & Project Documentation

====================================================================================================

Author(s) :  Christophe Magnan (cmagnan@ics.uci.edu)
Copyright :  Institute for Genomics and Bioinformatics
             University of California, Irvine
Modified  :  2015/07/01

====================================================================================================
                                         Method Description
====================================================================================================

1D-BRNN (One-Dimensional Bidirectional Recurrent Neural Network) is a recurrent neural network with
a structure specifically designed for machine learning problems where the examples of interest are
naturally organized in sequences and where the class to predict for each position in the sequence is
likely to be also dependent on the adjacent positions in the sequence.

A BRNN is a set of three neural networks with shared and specific properties, respectively called
forward network (or left context), backward network (or right context) and output network (or main
network), and respectively noted FWDnet, BWDnet and MAINnet afterwards. Each network has exactly one
hidden layer and takes in input for each position t in a given sequence the features associated with
the position t plus a set of additional features described below. The number of nodes in the hidden
layers and in the output layers of FWDnet and BWDnet are options of the model to set when training
a new model. The number of nodes in the output layer of MAINnet is the number of target classes for
the prediction problem. Specificities of each network are described below:

- FWDnet takes as additional inputs for a position t in a sequence its own outputs
  for the position t-1 (left context) of the same sequence (0 for the first position).
  The sequence propagation in the network is thus made forward.

- BWDnet takes as additional inputs for a position t in a sequence its own outputs
  for the position t+1 (right context) of the same sequence (0 for the last position).
  The sequence propagation in the network is thus made backward.

- MAINnet takes as additional inputs for a position t in a sequence all the outputs of FWDnet
  and BWDnet for the position t and for a given number of adjacent positions in the sequence
  (window centered on position t). The size of the window is one of the model parameters.

Note that the output layers of FWDnet and BWDnet are actually inner layers of the complete network
and thus are processed as such in the algorithms. The BRNNs are trained using the back-propagation
method with a tanh sigmoid approximation of the outputs for all the nodes in the inner layers and a
normalized exponential approximation of the outputs of MAINnet. The outputs of the BRNN (outputs
of MAINnet) for a position t in a sequence are thus the predicted probabilities of each possible
target class for the prediction problem.

====================================================================================================
                                        Project Documentation
====================================================================================================

This section describes the software and how to use the different programs available.

==========================================  Source Code  ===========================================

The source code of the project is located in the src folder of the package. A brief overview of the
different source files located in this folder is given below.

File                         Content

makefile                     Compiles the source code and generates the four binaries
Import.h                     Import the necessary c/c++ libraries
Class Options                Options to train or retrain a BRNN model
Class Sequence               Sequence data and model predictions for the sequence
Class Dataset                Dataset of sequences in input of the program
Class Layer                  Single layer of a neural network
Class Network                Single neural network with one hidden layer
Class Model                  BRNN model with three neural networks
Train_New_Model.cpp          Train a new BRNN model on a dataset of sequences
Train_Existing_Model.cpp     Retrain an existing BRNN model on a dataset of sequences
Predict_Single_Model.cpp     Predictions of a single BRNN model on a dataset of sequences
Predict_Multi_Models.cpp     Predictions of several BRNN models on a dataset of sequences

=======================================  Software Binaries  ========================================

Four different programs/binaries are available in the bin folder of the project:

1) train_model

Description : Train a new BRNN model on a dataset of sequences
Usage       : ./train_model  options_file  train_dataset  test_dataset  output_model

2) retrain_model

Description : Retrain an existing BRNN model on a dataset of sequences, may also be used to restart
            : training an existing BRNN model when more training periods are needed
Usage       : ./retrain_model  options_file  train_dataset  test_dataset  input_model  output_model

3) predict_single

Description : Predictions of a single BRNN model on a dataset of sequences
Usage       : ./predict_single  dataset  model  predictions

4) predict_multi

Description : Predictions of several BRNN models on a dataset of sequences
Usage       : ./predict_multi  dataset  models_list  predictions


The arguments are always file names and can be given in either absolute or relative format.
The content of these files is described below:

a) options_file

Configuration file for training or retraining a BRNN model. The file format and the options are
detailed in the section "Training & Retraining Options" of this documentation.

b) train_dataset, test_dataset, dataset

Datasets in input of the scripts. Note that to train or retrain a BRNN model, a test dataset must
be provided. The training procedure is stopped when the max number of training epochs given in the
option file has been performed but does not check for possible overfitting. The models successively
trained are tested on both datasets and results are displayed on screen after each training period
so that overfitting can be easily detected by the users who may then decide to stop manually the
training procedure. The last trained model is always written in the output file of the script.
The file format of the datasets is described in the section "Datasets" of this documentation.

c) input_model, output_model, model, models_list

BRNN model(s) in input or output of the scripts. The file format used for the models
is not detailed in this documentation and can be found directly in the source code.
"models_list" is a special case and is not a file containing a model but a file containing
a list of files containing BRNN models. The first line must give the number of models in the
list and each line after the first line must give the path/file name of a BRNN model.

d) predictions

Predictions of a single BRNN model or combined predictions of several BRNN models on a dataset of
sequences. Predictions for the sequences are written in the same order than the sequences in the
dataset with the following format: the first line gives the length of the sequence and each line
afterwards (until the next sequence) gives the model predictions for the next position in the
sequence starting by the first position. An example for one position of a sequence for a 3-class
prediction problem is given below:

in 2 out 0 pb 0.45 0.15 0.40

In this example, the target class of the position was 2, the predicted class of the position is 0
and the predicted probabilities of the classes 0, 1 and 2 are respectively 0.45, 0.15 and 0.40.

==================================  Training & Retraining Options  =================================

Options to train or retrain a BRNN are divided in two sets:

  - options to configure the model
  - options to train the model

The first set of options is required to train a new model on a dataset. These options correspond
to the structure of the BRNN (number of inputs, outputs, hidden nodes, etc). This is not necessary
to give these options to retrain an existing model since these parameters are written in the file
containing the model to retrain. A description of these options is given below.

FEATURES      :  Number of data features in input of the BRNN
CLASSES       :  Number of target classes in output of the BRNN
HIDDEN        :  Number of hidden nodes in the main network of the BRNN
CONTEXT_FWD   :  Number of adjacent positions s.t. outputs of FWDnet -> additional inputs of MAINnet
CONTEXT_BWD   :  Number of adjacent positions s.t. outputs of BWDnet -> additional inputs of MAINnet
OUTPUTS_FWD   :  Number of output nodes in the forward network of the BRNN
OUTPUTS_BWD   :  Number of output nodes in the backward network of the BRNN
HIDDEN_FWD    :  Number of hidden nodes in the forward network of the BRNN
HIDDEN_BWD    :  Number of hidden nodes in the backward network of the BRNN

The second set of options is required to train a new model on a dataset or to retrain an existing
model using a new dataset. These options correspond to the configuration of the learning procedure
itself. A description of these options is given below.

LEARN_RATE    :  Learning rate - controls the weights update during the maximization step
NUM_EPOCHS    :  Number of training periods to run with or without the adaptative procedure
NUM_BATCHS    :  Number of model updates by training period (dataset divided in batchs)
ADAP_EPOCHS   :  Number of periods without improvement before decreasing the learning rate
ADAP_RELOAD   :  Reload the last model saved when the adaptative procedure starts
SHUFFLE       :  Shuffle the training dataset before each training period
SEED          :  Seed for randomization functions (0 to generate it automatically)

The options must be written in a single file and the option value must be space-separated
from the option name. An example is given in the doc folder of the project.

============================================  Datasets  ============================================

Datasets of sequences must be written in a single file. The first line must be as follows:

   num_sequences num_features num_classes

num_sequences = total number of sequences in the dataset/file
num_features  = number of input data features for each position in the sequences
num_classes   = number of possible target classes

Next lines are the sequences in the dataset, written one after each other with the following format:

   sequence_length                                     // Number of positions in the sequence
   class_p1 feature1_p1 feature2_p1 ... featureK_p1    // Class and features for position 1
   class_p2 feature1_p2 feature2_p2 ... featureK_p2    // Class and features for position 2
   ...
   class_pn feature1_pn feature2_pn ... featureK_pn    // Class and features for position n

The classes must be named using consecutive integers starting by 0. For instance, if there are 3
possible classes in the prediction problem, they must be respectively noted 0, 1 and 2. The class
must be provided for all the positions of all the sequences in a dataset. To get the predictions of
a BRNN model on a sequence where the classes are not known, fake classes must be provided in the
dataset. When using the scripts "predict_single" or "predict_multi", the classes given in input are
ignored but will be reported in the output file together with the predictions.
Features are recommended to be normalized in [0,1].
An example of a dataset is given in the 'doc' folder of the project.

====================================================================================================
                                           Release Notes
====================================================================================================

Version 3.3 (2015)

Author      :  Christophe Magnan
Description :  Minor revision
Comments    :  Repackaged for SCRATCH-1D release 1.1

Version 3.2 (2013)

Author      :  Christophe Magnan
Description :  Bug fixes for version 3.1
Comments    :  Issue with the example dataset provided in the package corrected

Version 3.1 (2012)

Author      :  Christophe Magnan
Description :  Bug fixes for version 3.0
Comments    :  Improved compatibility with the models generated by versions < 3.0

Version 3.0 (2011)

Author      :  Christophe Magnan
Description :  New generic version
Comments    :  Source code entirely rewritten to fix the following issues:
               - code incompatible with new c++ compilers
               - large part of the code unused
               - memory usage not optimized
               - small errors in the algorithm
               - no checks performed on the inputs or options

Version 2.1 (2003)

Author      :  Jianlin Cheng
Description :  New custom version for SCRATCH
Comments    :  Source code updated for new generations of c++ compilers

Version 2.0 (2001)

Author      :  Gianluca Pollastri
Description :  Customized version for SCRATCH
Comments    :  Initial code customized for working on biological sequences

Version 1.0 (1997)

Author      :  Paolo Frasconi
Description :  Initial Generic Version
Comments    :  First version developed to train/test a BRNN

====================================================================================================