==================================================================================================== One-Dimensional Bidirectional Recurrent Neural Network (1D-BRNN) Method Description & Project Documentation ==================================================================================================== Author(s) : Christophe Magnan (cmagnan@ics.uci.edu) Copyright : Institute for Genomics and Bioinformatics University of California, Irvine Modified : 2015/07/01 ==================================================================================================== Method Description ==================================================================================================== 1D-BRNN (One-Dimensional Bidirectional Recurrent Neural Network) is a recurrent neural network with a structure specifically designed for machine learning problems where the examples of interest are naturally organized in sequences and where the class to predict for each position in the sequence is likely to be also dependent on the adjacent positions in the sequence. A BRNN is a set of three neural networks with shared and specific properties, respectively called forward network (or left context), backward network (or right context) and output network (or main network), and respectively noted FWDnet, BWDnet and MAINnet afterwards. Each network has exactly one hidden layer and takes in input for each position t in a given sequence the features associated with the position t plus a set of additional features described below. The number of nodes in the hidden layers and in the output layers of FWDnet and BWDnet are options of the model to set when training a new model. The number of nodes in the output layer of MAINnet is the number of target classes for the prediction problem. Specificities of each network are described below: - FWDnet takes as additional inputs for a position t in a sequence its own outputs for the position t-1 (left context) of the same sequence (0 for the first position). The sequence propagation in the network is thus made forward. - BWDnet takes as additional inputs for a position t in a sequence its own outputs for the position t+1 (right context) of the same sequence (0 for the last position). The sequence propagation in the network is thus made backward. - MAINnet takes as additional inputs for a position t in a sequence all the outputs of FWDnet and BWDnet for the position t and for a given number of adjacent positions in the sequence (window centered on position t). The size of the window is one of the model parameters. Note that the output layers of FWDnet and BWDnet are actually inner layers of the complete network and thus are processed as such in the algorithms. The BRNNs are trained using the back-propagation method with a tanh sigmoid approximation of the outputs for all the nodes in the inner layers and a normalized exponential approximation of the outputs of MAINnet. The outputs of the BRNN (outputs of MAINnet) for a position t in a sequence are thus the predicted probabilities of each possible target class for the prediction problem. ==================================================================================================== Project Documentation ==================================================================================================== This section describes the software and how to use the different programs available. ========================================== Source Code =========================================== The source code of the project is located in the src folder of the package. A brief overview of the different source files located in this folder is given below. File Content makefile Compiles the source code and generates the four binaries Import.h Import the necessary c/c++ libraries Class Options Options to train or retrain a BRNN model Class Sequence Sequence data and model predictions for the sequence Class Dataset Dataset of sequences in input of the program Class Layer Single layer of a neural network Class Network Single neural network with one hidden layer Class Model BRNN model with three neural networks Train_New_Model.cpp Train a new BRNN model on a dataset of sequences Train_Existing_Model.cpp Retrain an existing BRNN model on a dataset of sequences Predict_Single_Model.cpp Predictions of a single BRNN model on a dataset of sequences Predict_Multi_Models.cpp Predictions of several BRNN models on a dataset of sequences ======================================= Software Binaries ======================================== Four different programs/binaries are available in the bin folder of the project: 1) train_model Description : Train a new BRNN model on a dataset of sequences Usage : ./train_model options_file train_dataset test_dataset output_model 2) retrain_model Description : Retrain an existing BRNN model on a dataset of sequences, may also be used to restart : training an existing BRNN model when more training periods are needed Usage : ./retrain_model options_file train_dataset test_dataset input_model output_model 3) predict_single Description : Predictions of a single BRNN model on a dataset of sequences Usage : ./predict_single dataset model predictions 4) predict_multi Description : Predictions of several BRNN models on a dataset of sequences Usage : ./predict_multi dataset models_list predictions The arguments are always file names and can be given in either absolute or relative format. The content of these files is described below: a) options_file Configuration file for training or retraining a BRNN model. The file format and the options are detailed in the section "Training & Retraining Options" of this documentation. b) train_dataset, test_dataset, dataset Datasets in input of the scripts. Note that to train or retrain a BRNN model, a test dataset must be provided. The training procedure is stopped when the max number of training epochs given in the option file has been performed but does not check for possible overfitting. The models successively trained are tested on both datasets and results are displayed on screen after each training period so that overfitting can be easily detected by the users who may then decide to stop manually the training procedure. The last trained model is always written in the output file of the script. The file format of the datasets is described in the section "Datasets" of this documentation. c) input_model, output_model, model, models_list BRNN model(s) in input or output of the scripts. The file format used for the models is not detailed in this documentation and can be found directly in the source code. "models_list" is a special case and is not a file containing a model but a file containing a list of files containing BRNN models. The first line must give the number of models in the list and each line after the first line must give the path/file name of a BRNN model. d) predictions Predictions of a single BRNN model or combined predictions of several BRNN models on a dataset of sequences. Predictions for the sequences are written in the same order than the sequences in the dataset with the following format: the first line gives the length of the sequence and each line afterwards (until the next sequence) gives the model predictions for the next position in the sequence starting by the first position. An example for one position of a sequence for a 3-class prediction problem is given below: in 2 out 0 pb 0.45 0.15 0.40 In this example, the target class of the position was 2, the predicted class of the position is 0 and the predicted probabilities of the classes 0, 1 and 2 are respectively 0.45, 0.15 and 0.40. ================================== Training & Retraining Options ================================= Options to train or retrain a BRNN are divided in two sets: - options to configure the model - options to train the model The first set of options is required to train a new model on a dataset. These options correspond to the structure of the BRNN (number of inputs, outputs, hidden nodes, etc). This is not necessary to give these options to retrain an existing model since these parameters are written in the file containing the model to retrain. A description of these options is given below. FEATURES : Number of data features in input of the BRNN CLASSES : Number of target classes in output of the BRNN HIDDEN : Number of hidden nodes in the main network of the BRNN CONTEXT_FWD : Number of adjacent positions s.t. outputs of FWDnet -> additional inputs of MAINnet CONTEXT_BWD : Number of adjacent positions s.t. outputs of BWDnet -> additional inputs of MAINnet OUTPUTS_FWD : Number of output nodes in the forward network of the BRNN OUTPUTS_BWD : Number of output nodes in the backward network of the BRNN HIDDEN_FWD : Number of hidden nodes in the forward network of the BRNN HIDDEN_BWD : Number of hidden nodes in the backward network of the BRNN The second set of options is required to train a new model on a dataset or to retrain an existing model using a new dataset. These options correspond to the configuration of the learning procedure itself. A description of these options is given below. LEARN_RATE : Learning rate - controls the weights update during the maximization step NUM_EPOCHS : Number of training periods to run with or without the adaptative procedure NUM_BATCHS : Number of model updates by training period (dataset divided in batchs) ADAP_EPOCHS : Number of periods without improvement before decreasing the learning rate ADAP_RELOAD : Reload the last model saved when the adaptative procedure starts SHUFFLE : Shuffle the training dataset before each training period SEED : Seed for randomization functions (0 to generate it automatically) The options must be written in a single file and the option value must be space-separated from the option name. An example is given in the doc folder of the project. ============================================ Datasets ============================================ Datasets of sequences must be written in a single file. The first line must be as follows: num_sequences num_features num_classes num_sequences = total number of sequences in the dataset/file num_features = number of input data features for each position in the sequences num_classes = number of possible target classes Next lines are the sequences in the dataset, written one after each other with the following format: sequence_length // Number of positions in the sequence class_p1 feature1_p1 feature2_p1 ... featureK_p1 // Class and features for position 1 class_p2 feature1_p2 feature2_p2 ... featureK_p2 // Class and features for position 2 ... class_pn feature1_pn feature2_pn ... featureK_pn // Class and features for position n The classes must be named using consecutive integers starting by 0. For instance, if there are 3 possible classes in the prediction problem, they must be respectively noted 0, 1 and 2. The class must be provided for all the positions of all the sequences in a dataset. To get the predictions of a BRNN model on a sequence where the classes are not known, fake classes must be provided in the dataset. When using the scripts "predict_single" or "predict_multi", the classes given in input are ignored but will be reported in the output file together with the predictions. Features are recommended to be normalized in [0,1]. An example of a dataset is given in the 'doc' folder of the project. ==================================================================================================== Release Notes ==================================================================================================== Version 3.3 (2015) Author : Christophe Magnan Description : Minor revision Comments : Repackaged for SCRATCH-1D release 1.1 Version 3.2 (2013) Author : Christophe Magnan Description : Bug fixes for version 3.1 Comments : Issue with the example dataset provided in the package corrected Version 3.1 (2012) Author : Christophe Magnan Description : Bug fixes for version 3.0 Comments : Improved compatibility with the models generated by versions < 3.0 Version 3.0 (2011) Author : Christophe Magnan Description : New generic version Comments : Source code entirely rewritten to fix the following issues: - code incompatible with new c++ compilers - large part of the code unused - memory usage not optimized - small errors in the algorithm - no checks performed on the inputs or options Version 2.1 (2003) Author : Jianlin Cheng Description : New custom version for SCRATCH Comments : Source code updated for new generations of c++ compilers Version 2.0 (2001) Author : Gianluca Pollastri Description : Customized version for SCRATCH Comments : Initial code customized for working on biological sequences Version 1.0 (1997) Author : Paolo Frasconi Description : Initial Generic Version Comments : First version developed to train/test a BRNN ====================================================================================================