This repository contains files detailing the process of fitting the model of the enrichment array experiment described in Holmes et al. (2020). These files are intended to enable independent reproduction, exploration and extension of the analysis reported in the paper.
We have used the Jupyter
notebook environment to facilitate literate programming, and to encourage exploration of and experimentation with the code and model. These notebooks have sections of explanatory text that are punctuated by code snippets. In the Jupyter
notebook environment, all of these code snippets are editable and runnable.
data/
: directory containing the raw microarray data, and the genomic data used in the analysismodels/
: directory containingStan
models in plan text formatmultiplexing/
: directory containing scripts used to generate multiplexed data for k-fold cross-validation, and to fit the cross-validation modelsnotebooks/
: directory containingJupyter
notebooks describing and enabling reproduction of the data QA, model fitting and model validationrequirements.txt
: file describing the Python dependencies of the notebooks and scripts, which can be used to create a virtual environment for replication of the analysis from the paper.LICENCE
: a copy of the MIT licence that governs the code contained in this repositoryREADME.md
: this file
Please raise any issues at the GitHub issues page for this repository, by following the link below:
We have provided three ways to work with these analysis notebooks:
You do not need to start a MyBinder
instance, or start your own local Jupyter
instance, to read these notebooks. We provide static versions of the Jupyter
analysis notebooks at the following links:
- Notebook 01: Data QA
- Notebook 02: Full Model Fit
- Notebook 03: Crossvalidation
- Notebook 04: etpD Complementation
- Notebook 05: etpC/etpD fluorescence
To encourage exploration and reproduction, we have tried to make these notebooks compatible, so far as is possible, with MyBinder
, to enable you to run them in the cloud without having to install software on your own machine. To use these notebooks, click on this link, or the button below.
To reproduce our work on your own machine, we recommend using a conda
environment to ensure compatibility of dependencies and to replicate the environment used for the analysis. This environment separates installation of Python
packages from your system's Python
installation (if there is one), enabling the running of these analyses without interfering with any previously-installed setup.
NOTE: You will need to have installed conda
[*] for your system.
$ conda create -n holmes_et_al_2020 python=3.6
$ source activate holmes_et_al_2020
$ conda install --yes --file requirements.txt
From the top level directory of this repository, start the Jupyter
notebook server by issuing the command:
jupyter notebook
A new browser window or tab should open, containing the Jupyter
homepage, which will show a listing of files and directories in the top level of the repository.
Read more
jupyter notebook
: Quick-start guidejupyter notebook
: Tutorial
From the Jupyter
homepage in your browser window, click on the link to the notebooks/
subdirectory. Then click on the link for the notebook you wish to use. The selected notebook should then open in a new browser tab.
When they were committed to the repository, the notebooks contained output from the original runs, so they can be read and understood without needing to rerun the models. If you would like to rerun/reproduce/modify these outputs, we recommend restarting the kernel and clearing all output before beginning. This can be done by clicking on Kernel -> Restart & Clear Output
, in the notebook window.
To replicate the manuscript model from scratch: start the virtual environment (in MyBinder
, or on your own machine), then run the notebooks and scripts as described below (remembering to use Kernel -> Restart & Clear Output
in each notebook to remove the original/existing outputs):
01-data_qa.ipynb
: this will take the input data from thenotebooks/data/
directory, and process it into the output files:notebooks/datasets/normalised_array_data.tab
: used for the full model fit, and to produce the multiplexed output datasets.reduced_probe_data.tab
: a subset ofnotebooks/datasets/normalised_array_data.tab
, used for testing codereduced_locus_data.tab
: a subset ofnotebooks/datasets/normalised_array_data.tab
, used for testing code
02-full_model_fit.ipynb
: this will fit the Stan model described in the notebook to thenotebooks/datasets/normalised_array_data.tab
processed data file, and conduct analyses to produce the final set of genes for which the estimated effect on enrichment due to passage (treatment) was positive, and render the figures used in the paper.
NOTE: the complete fit takes between 5 and 9 hours on my laptop (2013 MacBook Pro, 2.8GHz i7 16GB RAM).
The crossvalidation dataset construction and model fit were conducted in the multiplexing
directory, using Python
scripts rather than Jupyter
notebooks. To reproduce the dataset construction and fits, first change directory to multiplexing
:
cd multiplexing
then build the input datasets with the multiplex_data.py
script:
./multiplex_data.py -v -d ../notebooks/datasets/normalised_array_data.tab \
-k 10 -o 10-fold_CV --seed 123456789 \
-l 10-fold_CV_multiplex.log
This will create a new directory called 10-fold_CV
, containing one new subdirectory for each training/test split of the input dataset.
Next, use the run_multiplex_models.py
script to fit the Stan model to each of the multiplexed training/test sets.
./run_multiplex_models.py -v -i 10-fold_CV --script ./run_model.py \
--seed 123456789 \
-l 10-fold_CV_run_models.log
NOTE: the run_multiplex_models.py
has a dependency on the pysge
module for submission of jobs to our local cluster. This is not included in the requirements.txt
file, so the script will fail at this point. The command-lines that this script produces in the log file can, however, be copied for execution on any system available to you. If you happen to be running on a cluster with SGE scheduling, then installation of pysge
in the virtual environment will enable use of the cluster to fit the multiplexed models.
Finally, use the join_multiplexed_data.py
script to combine prediction output from each of the 10 test sets into a single .tab
file. This will contain predictions for each of the probes from the input dataset, using the model fit to the remaining training data.
./join_multiplexed_data.py -v -i 10-fold_CV -o 10-fold_CV.tab \
-l 10-fold_CV_join_data.log
The combined data produced in this way can then be used as input for the notebook 03-model_validation.ipynb
.
03-model_validation.ipynb
: this will conduct analyses on the combined output from 10-fold crossvalidation on the input dataset innormalised_array_data.tab
. These analyses estimate the ability of the model to predict unseen 'output' array intensities by training it on 90% of the data at any one time, and testing it on the remaining 10% of the dataset.
All random processes in the model building and fitting can take a seed value for the pseudorandom number generator. For replication of the values in the paper, this seed should be set to 123456789
for all processes:
- the seed used for the main Stan fit (in notebook
02-full_model_fit
) - the seed for splitting the input data into multiplexed sets (
multiplex_data.py
) - the seed used for the multiplexing Stan fits (
run_multiplex_models.py
) - the seed used for fitting the etpD complementation data (in notebook
04-etpD
)