-
Notifications
You must be signed in to change notification settings - Fork 31
EpiNano wiki
Welcome to the EpiNano wiki!
The commands to use Epi1.2 to detect RNA base modifications with and without using machine learing can be found in test_data/make_predictions/run.sh and test_data/train_models/train_test.sh.
The commands for reads mapping are the same as shown for previous Epinano versions. But we encourage users to try out different mapping parameters and alternative mappers, for instance, using graphmap2 instead of minimap2.
Compared to previouse versions, we also intrduce new metrics, such as delta-errors and sum_errors. The former one capture the difference between paired modified and unmodified samples, while the later one combine mismatches ande indels (and transformed quality score as you wish). These two metrics are composed to overcom (supposedly but need extensive tests) biases in terms of types of errors introduced by different RNA modifications. We provided miscellaneous scripts Epinano_make_delta.py and Epinano_sumErr.py to compute them. We showcase the usage of these new metrics with pU modificaitons.
- Base-Calling
guppy_basecaller --device cuda:0 -c rna_r9.4.1_70bps_hac.cfg --compress_fastq -i path/to/fast5_directory -r -s /path/to/save/basecalling_out --fast5_out
--device will be swithed on or off dependent on gpu availability!
- Reads Mapping
# map to transcriptome
minimap2 --MD -t 6 -ax map-ont ref.fasta sample.reads.fastq | samtools view -hbS -F 3860 - | samtools sort -@ 6 - sample.reads
# map to genome
minimap2 --MD -t 6 -ax splice -k14 -uf ref.fasta sample.reads.fastq | samtools view -hbS -F 3844 - | samtools sort -@ 6 - sample.reads
- generate errors feature table
python $EPINANO_HOME/Epinano_Variants.py -n 4 -R ref.fasta -b sample.reads.bam -s $EPINANO_HOME/misc/sam2tsv.jar --type t
By default, --type t: bam file was derived from mapping reads to transcriptome.
The output will be sample.plus_strand.per.site.csv
Alternatively, --type g: bam file was produced by (splicing) mapping reads to genome; The output consists of two files: sample.plus_strand.per.site.csv and sample.minus_strand.per.site.csv
The results can be re-organized in per windows/kmers format.
python misc/Slide_Variants.py test_data/make_predictions/ko.plus_strand.per.site.csv 5
The output xmer.csv file contains the following columns:
- Kmer: kmer (K consecutive bases)
- Window: positions in the reference where the kmer was extracted
- Ref: Reference sequenc ID
- Strand: to which strand the reads were mapped and features were extracted and computed
- Coverage: how many reads' bases were mapped at the K consecutive positions
- q[i]: i=[123...], mean quality score of all reads' bases that are mapped to the same reference position and corresponding to the the ith position in the kmer
- mis[i]: similar to q[i] but the metric is mismatch frequency
- ins[i]: similar to q[i] but the metric is insertion fequency
- del[i]: similar to q[i] but the metric is deletion fequency
- extract electrical signals
sh $EPINANO_HOME/Epinano_Current.sh -b sample.reads.bam -r sample.reads.fastq -f reference.fasta -t 6 -m g -d fast5_folder/
The output is organized on single-reference-sequence basis and can be combined using $EPINANO_HOME/misc/concat_events.py, and can also be re-organized in kmer basis with $EPINANO_HOME/misc/Slide_Intensity.py. Last but not least, the per kmer electric current information can be combined with per kmer error features using $EPINANO_HOME/misc/Join_variants_currents.py, and can then be used to train models or for preferred statistical analyses.
- train SVM models Suppose we have a pair of samples and know that certain RNA nucleotides in one sample is modified but not in the other one. We can label the samples with modificaiton status.
bash $EPINANO_HOME/misc/Epinano_LabelSamples.sh -m mod.per_site.5mer.csv -u unm.per_site.5mer.csv -o combined.per_site_raw_feature.5mer.csv
Then, it is straightforward to train SVM models:
python $EPINANO_HOME/Epinano_Predict.py \
--train combined.per_site_raw_feature.5mer.csv \
--predict combined.per_site_raw_feature.5mer.csv \
--accuracy_estimation --out_prefix train_and_test \
--columns 8,13,23 --modification_status_column 26
This is an example using single site- quality, mismatch, and deletion to train SVM models. The output files names will depict the feature used for training. Moreover, the accuracy of all trained models are also listed to aid choosing the final most accurate one.
Note that, in this case, the same sample was used for both training and performance assessment, the Epinano_Predict will split the input data in halves for each purpose. It would be nice to have more than one samples.
- detect modificaitons with trained model(s) Given a sample feature table and to predict modificaitons, you only need to run:
python $EPINANO_HOME/Epinano_Predict.py \
--model $EPINANO_HOME/models/q3.mis3.del3.MODEL.linear.model.dump \
--predict sample.per_site.5mer.csv \
--columns 8,13,23 \
--out_prefix some_sample_mod_predictiiton
- detect modificaitons without using trained model(s)
Rscript $EPINANO_HOME/Epinano_DiffErr.R -k unm.per.site.var.csv -w mod.per.site.var.csv -d 0.1 -t 3 -p -o diffErr -f mis
Here, this script uses the mismatch feature (-f mis) to predict modifications.
It is also possible to use deletion and any other derived features for modification detection.
For example, we can compose a sum_err feature and then use it to detect modifications.
python $EPINANO_HOME/misc/Epinano_sumErr.py --file unm.per.site.var.csv --out unm.sum_err.csv --kmer 0
python $EPINANO_HOME/misc/Epinano_sumErr.py --file mod.per.site.var.csv --out mod.sum_err.csv --kmer 0
Rscript $EPINANO_HOME/Epinano_DiffErr.R -k unm.sum_err.csv -w mod.sum_err.csv -d 0.1 -t 3 -p -o sumErr -f sum_err
We also offered Epinano_make_delta.py and Epinano_delta_sumErr.py to compute difference between modified and un-modiifed samples in terms of the raw and summed errors. These new features capturing difference between samples can also be used for training SVM models or directly utilized to detect modifications. We have shown that SVM models trained with deviance features tend to be robust to modification variaties, which means these models, though trained with a specific type of modificaiton, can be used to predict any type of modifications. But we will keep testing this is more types of RNA modifications.
- Visulize results. We provided Epinano_Plot.R to visualize Epinnao predictions. All you need is to run this script on any prediciton result generated with the above mentioned method.
Rscript $EPINANO_HOME/misc/Epinano_Plot.R SVM_prediciton.csv
Rscript $EPINANO_HOME/Epinano_Plot.R diffErr.delta-mis.prediction.csv
Rscript $EPINANO_HOME/Epinano_Plot.R diffErr.linear-regression.prediction.csv
The latest version of epinano was re-written in python3 and re-organized to avoid typing multiple commands as required by the previous versions.
- prepare reference transcriptome
samtools faidx transcriptome.fasta
java -jar picard.jar CreateSequenceDictionary R=transcriptome.fasta
- mapping reads to reference transcriptome (No trimming)
minimap2 -ax map-ont transcriptome.fasta sample.fastq | samtools view -hSb - | samtools sort -@ 4 - sample.bam
samtools index sample.bam
- calling variants
samtools view -h -F 3844 sample.bam | java -jar sam2tsv.jar -r transcriptome.fasta > sample.tsv
- compute per site/read variants frequencies
python /path/to/TSV_to_Variants_Freq.py3 -f sample.tsv -t 10
More functional components will be added to add current intensity information from both albacore and guppy basecalled fast5 files.
The previous version of EpiNano was written in python2 and run in more steps than the new python3 version.
- To extract features from basecalled FASTQ files:
#1 trim the first and last few bad quality bases from raw fastq with NanoFilt (feel free to replace nanofilt with custome script)
cat MOD.fastq|./NanoFilt -q 0 --headcrop 5 --tailcrop 3 --readtype 1D --logfile mod.nanofilt.log > mod.h5t3.fastq
cat UNM.fastq|./NanoFilt -q 0 --headcrop 5 --tailcrop 3 --readtype 1D --logfile unm.nanofilt.log > unm.h5t3.fastq
#2 'U' to 'T' conversion
awk '{ if (NR%4 == 2) {gsub(/U/,"T",$1); print $1} else print }' mod.h5t3.fastq > mod.U2T.fastq
awk '{ if (NR%4 == 2) {gsub(/U/,"T",$1); print $1} else print }' unm.h5t3.fastq > unm.U2T.fastq
#3 mapping to reference using minimap2
minimap2 -ax map-ont ref.fasta mod.U2T.fastq | samtools view -bhS - | samtools sort -@ 6 - mod && samtools index mod.bam
minimap2 -ax map-ont ref.fasta unm.U2T.fastq | samtools view -bhS - | samtools sort -@ 6 - unm && samtools index unm.bam
#4 calling variants for each single read-to-reference alignment
## reads mapped to reverse strand of reference seqeucne will be flipped
samtools view -h mod.bam| java -jar sam2tsv.jar -r ref.fasta > mod.bam.tsv
samtools view -h unm.bam | java -jar sam2tsv.jar -r ref.fasta > unm.bam.tsv
#5 convert results from step 4 and generate per_read variants information; the input file can be splitted based on read into smaller files to speed this step up.
python2 per_read_var.py mod.bam.tsv > mod.per_read.var.csv
python2 per_read_var.py unm.bam.tsv > unm.per_read.var.csv
#6 sumarize results from step 4 and generate variants information according the reference sequences (i.e., per_site variants); the input file can be splitted based on ref into smaller ones to speed this step up.
python2 per_site_var.py mod.bam.tsv > mod.per_site.var.csv
python2 per_site_var.py unm.bam.tsv > unm.per_site.var.csv
#7 slide per_site variants with window size of 5, so that fast5 event table information can be combined
python2 slide_per_site_var.py mod.ref.per_site.var.csv > mod.per_site.var.sliding.win.csv
python2 slide_per_site_var.py unm.ref.per_site.var.csv > unm.per_site.var.sliding.win.csv
- To extract features from FAST5 files:
#1 extract event table from fast5 files; event table has 14 columns:
mean stdv start length model_state move weights p_model_state mp_state p_mp_state p_A p_C p_G p_T.
the meaning of these columns are explained at: https://community.nanoporetech.com/technical_documents/data-analysis/v/datd_5000_v1_revj_22aug2016/basecalled-fast5-files
python2 fast5ToEventTbl.py input.fast5 > output.event.tbl
#2 extract features needed (esp. current intensity) for downstream analyses.
The output contains the kmers and their centre base position (1-absed) in reads.
python2 event_tbl_feature_extraction.py output.event.tbl > output.event.tbl.features
#3 combine extracted features with per_read and per_site variants information
python2 fastq_len.py h5t3.fastq > h5t3.fastq.len
python2 adjust_read_base_positions.py h5t3.fastq.len output.event.tbl.features number_of_chopped_leading_bases number_of_chopped_end_bases > output.event.tbl.features.readposition.adj.csv
python2 assign_current_to_per_read_kmer.py output.event.tbl.features.readposition.adj.csv per_rd_var.del.adjust.csv.summed.oneKmer_oneLine > per_read.var.current.csv
python2 per_read_kmer_intensity_to_per_site_kmer_intensity.py per_read.var.current.csv per_site.varsliding.win.csv > per_site.var.current.csv