Computational Human Endogenous RetroViral Infection Labeler (CHERVIL) is a pipeline for the detection of endogenous retroviral expression patterns that correspond to current or previous viral infection.
This project was developed at the Rocky Mountain Genomics HackCon 2018 by Benjamin Lee (team lead), Jeremy Ash (team lead), Corinne Walsh, and Grant Vagle with support from Ben Busby and Michael Crusoe.
Human endogenous retroviral elements (HERVs) are retroviruses that have integrated themselves into the human germline. Usually, they remain latent in the human genome. However, previous work suggests that some HERVs become actively transcribed upon viral infection.
CHERVIL builds on an existing pipeline built for HERV expression quantification, RetroSpotter, and adds on a machine learning component to identify patterns in HERV expression indicative of pre-symptomatic or historic viral infection.
At a high level, there are two major phases of the CHERVIL pipeline.
The first is the calculation of HERV expression in different populations. To do this, we use RetroSpotter and Magic-BLAST to align RNA-seq data to known HERVs to quantify HERV expression.
The second phase is the automatic development of a machine learning pipeline that uses expression data to predict disease status. We accomplish this using TPOT to identify HERV expression patterns specific to viral infection.
We have provided a Docker image with our pipeline pre-installed. To download it (assuming you already have Docker installed), run:
$ docker pull benjamindlee/chervil
Alternatively, you can build the image yourself from our dockerfile:
$ docker build -t chervil .
Before proceeding, ensure that you have the following installed and functional:
Next, clone a copy of the repository:
$ git clone https://github.com/NCBI-Hackathons/chervil.git
and then cd
into it:
$ cd chervil
Odds are that you will want to run CHERVIL in a virtual environment. If you don't have virtualenv installed, run:
$ pip install virtualenv
And then to set up your shiny new virtual environment:
$ virtualenv env --python=python3.6
$ source env/bin/activate
Next, to install the Python components, run:
(env) $ pip install -r requirements.txt
-
Create blast database with HERV elements
Users will need to create a FASTA file containing the nucleotide sequence of each HERV element. For convenience, we have included a set of known HERV sequences. The
makeblastdb.sh
command creates a blast database:$ makeblastdb.sh reference_genome/her_reference.fasta
This creates a directory
blastdb
containing a reference database calledreferencedb
. -
Input accession numbers and their classifications
This should in the form of a CSV file that looks something like this:
SRR123456, infected SRR789101, infected SRR112131, infected SRR415161, control SRR718192, control SRR021222, control
(note: these are made-up accessions)
-
Generate the HERV classification machine learning model
Assuming you are in a directory with accessions and their classes, run:
$ chervil.sh [path to SRR accession csv] [path to blast database] [number of cores] [output directory] [prefix for SAM files]
Example usage:
$ chervil.sh srr_inf_test.csv ../blast_dbs/referencedb 20 out "test"
This command calls multiple scripts that execute the pipeline we have developed.
-
Uses
magicblast
command align RNA-seq reads to the reference blast database. Generates a SAM file for each accession. (S1_make_acc_file.r
,run_jobs.sh
) -
Takes the SAM files and count the number of reads corresponding to each ERV gene. (
count_hits.sh
) -
Organizes the counts into a dataframe that includes all of the sample numbers (by SRR accession), their class (infected, not infected, etc.) and their read count for each ERV gene, written to a CSV file. (
S2_orgCountsScript.r
) -
Feeds this dataframe into TPOT, an automated machine learning pipeline. The model and an HTML file with a confusion matrix table with performance measures for external data set are then saved for analysis. (
S3_generate_classifier.py
)
-
Bug reports should be submitted here.
If you run into any problems while using CHERVIL, feel free to email Benjamin Lee (GitHub) for support.
- PRJNA349748: Human Tracheobronchial Epithelial (HTBE) cells infected with Influenza
- Data Type: RNA-seq
- Samples:
- 10 H1N1, H5N1, and H3N2 infected cells
- 5 mock-infected controls
2-fold Cross Validation accuracy: .917
Validation accuracy: .75
Actual | Predict
|
95% CI | (0.32565,1.17435) |
Bennett_S | 0.5 |
Chi-Squared | None |
Chi-Squared DF | 1 |
Conditional Entropy | None |
Cramer_V | None |
Cross Entropy | None |
Gwet_AC1 | 0.68 |
Joint Entropy | None |
KL Divergence | None |
Kappa | 0.0 |
Kappa 95% CI | (-1.69741,1.69741) |
Kappa No Prevalence | 0.5 |
Kappa Standard Error | 0.86603 |
Kappa Unbiased | -0.14286 |
Lambda A | None |
Lambda B | None |
Mutual Information | None |
Overall_ACC | 0.75 |
Overall_RACC | 0.75 |
Overall_RACCU | 0.78125 |
PPV_Macro | None |
PPV_Micro | 0.75 |
Phi-Squared | None |
Reference Entropy | 0.81128 |
Response Entropy | None |
Scott_PI | -0.14286 |
Standard Error | 0.21651 |
Strength_Of_Agreement(Altman) | Poor |
Strength_Of_Agreement(Cicchetti) | Poor |
Strength_Of_Agreement(Fleiss) | Poor |
Strength_Of_Agreement(Landis and Koch) | Slight |
TPR_Macro | 0.5 |
TPR_Micro | 0.75 |
Class | infected | not_infected | Description |
ACC | 0.75 | 0.75 | Accuracy |
BM | 0.0 | 0.0 | Informedness or bookmaker informedness |
DOR | None | None | Diagnostic odds ratio |
ERR | 0.25 | 0.25 | Error rate |
F0.5 | 0.78947 | 0.0 | F0.5 score |
F1 | 0.85714 | 0.0 | F1 score - harmonic mean of precision and sensitivity |
F2 | 0.9375 | 0.0 | F2 score |
FDR | 0.25 | None | False discovery rate |
FN | 0 | 1 | False negative/miss/type 2 error |
FNR | 0.0 | 1.0 | Miss rate or false negative rate |
FOR | None | 0.25 | False omission rate |
FP | 1 | 0 | False positive/type 1 error/false alarm |
FPR | 1.0 | 0.0 | Fall-out or false positive rate |
G | 0.86603 | None | G-measure geometric mean of precision and sensitivity |
LR+ | 1.0 | None | Positive likelihood ratio |
LR- | None | 1.0 | Negative likelihood ratio |
MCC | None | None | Matthews correlation coefficient |
MK | None | None | Markedness |
N | 1 | 3 | Condition negative |
NPV | None | 0.75 | Negative predictive value |
P | 3 | 1 | Condition positive |
POP | 4 | 4 | Population |
PPV | 0.75 | None | Precision or positive predictive value |
PRE | 0.75 | 0.25 | Prevalence |
RACC | 0.75 | 0.0 | Random accuracy |
RACCU | 0.76562 | 0.01562 | Random accuracy unbiased |
TN | 0 | 3 | True negative/correct rejection |
TNR | 0.0 | 1.0 | Specificity or true negative rate |
TON | 0 | 4 | Test outcome negative |
TOP | 4 | 0 | Test outcome positive |
TP | 3 | 0 | True positive/hit |
TPR | 1.0 | 0.0 | Sensitivity, recall, hit rate, or true positive rate |