The repository includes code and datasets pertaining to methods we developed for identifying true intrahost single nucleotide variants (iSNVs) in the human papillomavirus (HPV).
The code in this repository is written in Python (v3.8) You will need access to a Linux terminal (even a Windows 10 having Windows Subsystem for Linux enabled and Ubuntu installed) to be able to execute the code. The specific instructions below work best in a Ubuntu 18.02/Ubuntu20.04 platform. Follow the steps below to install the required packages and run the scripts.
- Clone the repo
git clone https://github.com/NCI-CGR/HPV_low_VAF_SNV_prediction.git
- Install miniconda
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
chmod +x Miniconda3-latest-Linux-x86_64.sh
./Miniconda3-latest-Linux-x86_64.sh
Follow the on-screen instructions to complete installation of miniconda and to initialize conda.
- Create a conda environment
Create a conda environment that will include all the required packages necessary for running the repository code. We will use the hpv_triplicate_env.yaml file for this purpose.
conda env create -f hpv_triplicate_env.yaml
- Activate the conda environment
conda activate hpv_triplicate_env
-
Install the vcf_parser package from https://github.com/moonso/vcf_parser
-
Make sure you are good to go
cd HPV_low_VAF_SNV_prediction # go to the repo directory that you just cloned in Step 1
python scripts/split_multi_allelic_var.py --help
The following help menu should be displayed
Usage: split_multi_allelic_var.py [OPTIONS]
Options:
--vcf_dir TEXT The directory containing the vcf files (with
the headers) [required]
--output_dir TEXT The directory to which the parsed files will
be written [required]
--num_metadata_lines INTEGER The number of meta data lines in the vcf files
(excluding the header line). Meta data lines
begin with "##". Default is 70 for the raw
vcfs generated by our pipeline. However, for
the filtered VCF files generated by VCFgenie
it is 86. [required]
--help Show this message and exit.
The contents of the repository are described below.
-
data/ - The data directory includes the raw and the processed variant data that is required for training and testing the machine learning models. Within this directory, the vcf_files/raw_vcf and vcf_files/vcfgenie_vcf subdirectories include the raw VCFs generated by our in-house variant calling pipeline and the filtered VCFs generated upon post-filtering of the raw VCFs by VCFgenie, respectively. The SNV_data_wo_vcf_genie.csv contains all the raw iSNVs from our in-house pipeline and their features. The SNV_data_with_vcf_genie.csv includes iSNVs extracted from the processed VCF files generated by VCFgenie. The PERCENT_OVERLAP indicates the replicate frequency of each iSNV (1 => 3/3, 0.67 => 2/3, 0.33 => 1/3) and the AF is the VAF for a iSNV. The HPV18_not_padded.fasta file includes the reference genome sequence for HPV18. Files initial_iSNVs_p*VCFgenie_maskID.txt and HPV18REF_*mer_counts.tsv are raw input data used by the script
VCFgenie-spectra-analyses.R
to analyze VCFgenie results and molecular spectra of iSNVs. -
scripts/ - The split_multi_allelic_var.py script is used to parse through a directory having VCF files, split multi-allelic positions into separate rows, only extract iSNVs and their sequencing features and write the iSNVs and their features into a csv file. Note: to run this script, you must first download and install the vcf_parser package from https://github.com/moonso/vcf_parser . The data/SNV_data_wo_vcf_genie.csv and data/SNV_data_with_vcf_genie.csv files were generated using this script. Checkout the command
python scripts/split_multi_allelic_var.py --help
to explore how to use this script. Finally,VCFgenie-spectra-analyses.R
is an R script to be run manually line-by-line to analyze VCFgenie results and molecular spectra of iSNVs; input files available from the /data/ directory. -
machine_learning/ - This directory includes code related to machine learning. The xgb_model_training_and_performance_new.ipynb is an interactive jupyter notebook intended to guide the user through step-by-step instructions on how to reproduce the study results. It is intended to demonstrate to the user how training and testing were performed for the 3 scenarios: a. Machine learning with VAF filters (FM models) , b. Machine learning with VCFgenie without any low-VAF filters (VM models) , and c. Machine learning with VCFgenie with low-VAF filters (FVM models). The predict_true_low_vaf_snvs.py script is a more independent script that can be used for predicting true/false iSNVs using the pre-trained models (doesn't require prior model training). More on this script is elaborated in the Usage section of this document The xgb_hyper_param_tuning_parallelized_new.py script is to be used to train models with different hyperparameters (hyperparameter tuning). The merge_hyper_param_perf_data.py script can be used to merge the files generated from hyperparameter tuning and create a single consolidated output. The
get_feature_imp.ipynb
notebook can be used to reproduce the feature importance calculated for the optimum parameter/feature combinations and hyperparameters for the VM strategy. -
FM_models/ - Includes trained XGB models for the FM scenario. Only includes models trained for the optimum parameter/feature combination.
-
VM_models/ - Includes trained XGB models for the VM scenario. Only includes models trained for the optimum parameter/feature combination.
-
FVM_models/ - Includes trained XGB models for the FVM scenario. Only includes models trained for the optimum parameter/feature combination.
-
results/ - Includes performance files for models trained in each of the 3 scenarios - FM, VM and FVM. Files are generated upon running the jupyter notebook (machine_learning/xgb_model_training_and_performance.ipynb). For each scenario, the performance metrics are reported for models trained for each parameter/feature combination. Note that for each combination, performance metrics are reported for 50 testing datasets generated by random sampling using 50 different seeds. When calculating overall performance for a combination, we consider the median performance across all the 50 testing datasets.
-
variant_calling_pipeline/ - Includes the snakemake workflow for calling variants for HPV18
Now we will go through some specific examples of how to use this repository.
To reproduce the results included in our study, simply run the jupyter notebook: machine_learning/xgb_model_training_and_performance_new.ipynb
. For each scenario (FM, VM and FVM), the notebook will train and test XGBoost models on all possible parameter/feature combinations and write the performance metrics into results directory. To reduce the execution time, you can skip running the sections that correspond to saving models for the best performing parameter/feature combination in each scenario (i.e., the section titled "Save the best performing models for each prediction strategy" and all the code following this section). These sections have already been executed and the models saved in their respective directories.
Following is an example of the performance output. Below is a snapshot of the performance results for the FM model as observed in the performance_FM.csv file. Specifically, I am showing the performance for the optimal parameter/feature combination for FM by applying the following filters: True_Var_Def = 1 (i.e., 3/3 replicates), False_Var_Def = 0.33 (i.e., 1/3 replicates), VAF_Lower_Limit = 0.01, VAF_Upper_Limit = 0.5, Feature_Category = Strict. Note: I am only showing the first 10 rows.
As you will observe there are 50 rows after applying the above filters, each row reports performance on a separate testing dataset obtained by randomly sampling with a different seed. The overall performance for this combination is obtained by calculating the median of each metric column (and is reported in the results/cumulative_perf_metrics
directory).
Seed | True_Var_Def | False_Var_Def | VAF_Lower_Limit | VAF_Upper_Limit | Feature_Category | Num_True_Var_Testing | Num_False_Var_Testing | TP | FP | TN | FN | AUC | MCC | F1_score | Accuracy | MSE |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10 | [1] | [0.33] | 0.01 | 0.5 | Strict | 68 | 662 | 53 | 105 | 557 | 15 | 0.810401 | 0.438 | 0.469 | 0.836 | 0.164 |
20 | [1] | [0.33] | 0.01 | 0.5 | Strict | 66 | 603 | 44 | 60 | 543 | 22 | 0.783582 | 0.467 | 0.518 | 0.877 | 0.123 |
3247 | [1] | [0.33] | 0.01 | 0.5 | Strict | 205 | 1581 | 74 | 187 | 1394 | 131 | 0.621348 | 0.219 | 0.318 | 0.822 | 0.178 |
24 | [1] | [0.33] | 0.01 | 0.5 | Strict | 98 | 776 | 79 | 220 | 556 | 19 | 0.761309 | 0.348 | 0.398 | 0.727 | 0.273 |
4501 | [1] | [0.33] | 0.01 | 0.5 | Strict | 66 | 717 | 46 | 111 | 606 | 20 | 0.771079 | 0.376 | 0.413 | 0.833 | 0.167 |
25 | [1] | [0.33] | 0.01 | 0.5 | Strict | 76 | 1026 | 42 | 211 | 815 | 34 | 0.673489 | 0.209 | 0.255 | 0.778 | 0.222 |
79 | [1] | [0.33] | 0.01 | 0.5 | Strict | 63 | 748 | 38 | 232 | 516 | 25 | 0.646507 | 0.166 | 0.228 | 0.683 | 0.317 |
299 | [1] | [0.33] | 0.01 | 0.5 | Strict | 43 | 667 | 16 | 313 | 354 | 27 | 0.451414 | -0.046 | 0.086 | 0.521 | 0.479 |
1001 | [1] | [0.33] | 0.01 | 0.5 | Strict | 156 | 765 | 86 | 163 | 602 | 70 | 0.669105 | 0.286 | 0.425 | 0.747 | 0.253 |
287 | [1] | [0.33] | 0.01 | 0.5 | Strict | 33 | 157 | 33 | 35 | 122 | 0 | 0.888535 | 0.614 | 0.653 | 0.816 | 0.184 |
In this example, we will learn how to predict true/false iSNVs using the raw VCF files generated by our in-house script. We will use the raw vcf files provided in the data/vcf_files/raw_vcf directory for this purpose.
As the first step, you will need to process the raw vcf files and convert them into a csv file using scripts/split_multi_allelic_var.py script. Make sure your current directory is HPV_low_VAF_SNV_prediction and you have activated the conda environment as described in Step 4 of the Installation section. Then use the following commands.
mkdir example2
cd example2
python ../scripts/split_multi_allelic_var.py --vcf_dir ../data/vcf_files/raw_vcf/ --output_dir raw_vcf_parsed --num_metadata_lines 70
You should see the following output (I have truncated the output as it was long!)
Processing file ../data/vcf_files/raw_vcf/sample10_D4.tvc.vcf (1/93)...Done!
Processing file ../data/vcf_files/raw_vcf/sample10_D5.tvc.vcf (2/93)...Done!
Processing file ../data/vcf_files/raw_vcf/sample10_D6.tvc.vcf (3/93)...Done!
Processing file ../data/vcf_files/raw_vcf/sample11_H1.tvc.vcf (4/93)...Done!
Processing file ../data/vcf_files/raw_vcf/sample11_H2.tvc.vcf (5/93)...Done!
...
...
...
Processing file ../data/vcf_files/raw_vcf/sample9_H7.tvc.vcf (91/93)...Done!
Processing file ../data/vcf_files/raw_vcf/sample9_H8.tvc.vcf (92/93)...Done!
Processing file ../data/vcf_files/raw_vcf/sample9_H9.tvc.vcf (93/93)...Done!
Splitting MNPs...Done!
Extracting only SNVs...Done!
Total SNVs = 9225. Wrote SNVs to raw_vcf_parsed/snvs.csv
The total number of iSNVs identified is 9,225 and the iSNVs and their features have been written to the file raw_vcf_parsed/snvs.csv
Each row in raw_vcf_parsed/snvs.csv is a iSNV. We will now consider that the raw_vcf_parsed/snvs.csv file is our hypothetical independent test dataset and our goal is to predict true and false low-VAF SNVs on this dataset. We will achieve this goal by using the machine_learning/predict_true_low_vaf_snvs.py script. Checkout the usage of this script as follows.
python ../machine_learning/predict_true_low_vaf_snvs.py --help
Usage: predict_true_low_vaf_snvs.py [OPTIONS]
Options:
--snv_file TEXT The csv file generated by the
split_multi_allelic_var.py script.For VM and FVM
models, remember to filter the vcf files with
VCFgenie before running the
split_multi_allelic_var.py script. [required]
--model_type [VM|FM|FVM] The model type to be used for making the
prediction [default: VM; required]
--hpv_ref_file TEXT The reference genome file for the HPV type. File
must contain the unpadded genome sequence in fasta
format (i.e., genome sequence should not include a
duplication of bases in the start region.)
[required]
--outfile TEXT Name of the output file that will contain the
prediction results [required]
--help Show this message and exit.
We will make predictions using the FM model. Run the following command and you will see the output that follows the command.
python ../machine_learning/predict_true_low_vaf_snvs.py --snv_file raw_vcf_parsed/snvs.csv --model_type FM --hpv_ref_file ../data/HPV18_not_padded.fasta --outfile FM_predictions.csv
Extracting SNVs...Done!
Calculating features ...Done!
Running predictions...Done!
Wrote predictions to FM_predictions.csv
The prediction output is written to FM_predictions.csv . As you will observe, the script only made predictions for 4,590/9,225 iSNVs. That's because the prediction is only performed for iSNVs with VAF 1% - 60% (the optimal VAF range for FM models), having a minimum FDP of 100 and FAO of 1. The Predicted_SNV_Type column is a binary column that includes the prediction : TRUE => Predicted true iSNVs, FALSE => Predicted false iSNVs. The True_SNV_Probability column provides a probability value for the iSNV to be true - higher probability means higher likelihood of the iSNV to be a true iSNV. iSNVs with probabilities < 0.5 are labelled as false iSNVs.
In this example, we will learn how to predict true/false iSNVs using the filtered VCF files generated by VCFgenie. We will use the raw vcf files provided in the data/vcf_files/raw_vcf directory for this purpose.
The first step is to run VCFgenie on the raw vcf files. VCFgenie requires the user to provide a p value threshold as a runtime argument. The p value in this scenario is the Bonferroni-corrected p value and can be calculated by dividing 0.05 by the total number of iSNVs (in our case it's 9,225 as calculated by the split_multi_allelic_var.py script). The corrected p value would therefore be 0.000005420054201. We will also need the platform error rate - for Ion Torrent it's 0.00012936. We will now use the following commands to install VCFgenie, execute it and generate the filtered VCF files. Before running these commands make sure that you are in the HPV_low_VAF_SNV_prediction directory.
mkdir example3
cd example3
git clone https://github.com/chasewnelson/VCFgenie.git
# Copy the raw vcf files to the current dir
cp -rf ../data/vcf_files/raw_vcf/ .
cd raw_vcf
python ../VCFgenie/VCFgenie.py --error_rate=0.00012936 --p_cutoff=0.000005420054201 --AC_key=FAO --AC_key_new=FAO --AF_key=AF --AF_key_new=AF --DP_key=FDP --overwrite_INFO --VCF_files sample*.vcf > vcfgenie.out
You will notice that once VCFgenie has finished running, a directory VCFgenie_output is created that contains the filtered/processed vcf files. The log output from VCFgenie can be found in the file vcfgenie.out.
Next, we will compile a csv file from the vcf files that will only contain iSNVs. As in Example 2, we will use the split_multi_allelic_var.py script to perform this operation.
# Go back to the parent directory
cd ..
python ../scripts/split_multi_allelic_var.py --vcf_dir ./raw_vcf/VCFgenie_output --output_dir VCFgenie_SNVs --num_metadata_lines 86
Note that we changed the num_metadata_lines
to 86 compared to example 2 as the VCFgenie-processed VCF files have 16 additional metadata lines.
You should see the following output (I have truncated the output as it was long!)
Processing file VCFgenie_output/sample10_D4.tvc_filtered.vcf (1/93)...Done!
Processing file VCFgenie_output/sample10_D5.tvc_filtered.vcf (2/93)...Done!
Processing file VCFgenie_output/sample10_D6.tvc_filtered.vcf (3/93)...Done!
Processing file VCFgenie_output/sample11_H1.tvc_filtered.vcf (4/93)...Done!
Processing file VCFgenie_output/sample11_H2.tvc_filtered.vcf (5/93)...Done!
Processing file VCFgenie_output/sample11_H3.tvc_filtered.vcf (6/93)...Done!
...
...
...
Processing file VCFgenie_output/sample8_G7.tvc_filtered.vcf (88/93)...Done!
Processing file VCFgenie_output/sample8_G8.tvc_filtered.vcf (89/93)...Done!
Processing file VCFgenie_output/sample8_G9.tvc_filtered.vcf (90/93)...Done!
Processing file VCFgenie_output/sample9_H7.tvc_filtered.vcf (91/93)...Done!
Processing file VCFgenie_output/sample9_H8.tvc_filtered.vcf (92/93)...Done!
Processing file VCFgenie_output/sample9_H9.tvc_filtered.vcf (93/93)...Done!
Splitting MNPs...Done!
Extracting only SNVs...Done!
Total SNVs = 9225. Wrote SNVs to VCFgenie_SNVs/snvs.csv
Finally, we will predict true iSNVs with the VM models and observe the output below.
python ../machine_learning/predict_true_low_vaf_snvs.py --snv_file VCFgenie_SNVs/snvs.csv --model_type VM --hpv_ref_file ../data/HPV18_not_padded.fasta --outfile VM_predictions.csv
Extracting SNVs...Done!
Calculating features ...Done!
Running predictions...Done!
Wrote predictions to VM_predictions.csv
Distributed under the MIT License. See LICENSE
for more information.
Sambit Mishra - sambit.mishra@nih.gov
Project Link: https://github.com/NCI-CGR/HPV_low_VAF_SNV_prediction
The work is an outcome of a collaborative effort between the researchers at the Cancer Genomics Research Laboratory (CGR), Frederick National Laboratory and the principal investigators and postdoctoral fellows at the Division of Cancer Epidemiology and Genetics (DCEG). I would like to gratefully acknowledge the scientific efforts and research guidance received from Dr. Lisa Mirabello, Dr. Meredith Yeager, Dr. Laurie Burdette, Dr. Michael Dean, Dr. Bin Zhu, Dr. Chase Nelson, Dr. Hyo Jung Lee and Dr. Maisa Pinheiro without which this work wouldn't be complete.