EpiNano wiki

Welcome to the EpiNano wiki!

Commands for Epinano 1.2 (hereafter referred to as Epi1.2)

The commands to use Epi1.2 to detect RNA base modifications with and without using machine learing can be found in test_data/make_predictions/run.sh and test_data/train_models/train_test.sh.

The commands for reads mapping are the same as shown for previous Epinano versions. But we encourage users to try out different mapping parameters and alternative mappers, for instance, using graphmap2 instead of minimap2.

Compared to previouse versions, we also intrduce new metrics, such as delta-errors and sum_errors. The former one capture the difference between paired modified and unmodified samples, while the later one combine mismatches ande indels (and transformed quality score as you wish). These two metrics are composed to overcom (supposedly but need extensive tests) biases in terms of types of errors introduced by different RNA modifications. We provided miscellaneous scripts Epinano_make_delta.py and Epinano_sumErr.py to compute them. We showcase the usage of these new metrics with pU modificaitons.

Base-Calling

guppy_basecaller --device cuda:0 -c rna_r9.4.1_70bps_hac.cfg --compress_fastq -i path/to/fast5_directory -r -s /path/to/save/basecalling_out --fast5_out

--device will be swithed on or off dependent on gpu availability!

Reads Mapping

# map to transcriptome
minimap2 --MD -t 6 -ax map-ont ref.fasta sample.reads.fastq | samtools view -hbS -F 3860 - | samtools sort -@ 6 - sample.reads

# map to genome
minimap2 --MD -t 6 -ax splice -k14 -uf  ref.fasta sample.reads.fastq | samtools view -hbS -F 3844 - | samtools sort -@ 6 - sample.reads

generate errors feature table

python $EPINANO_HOME/Epinano_Variants.py -n 4 -R ref.fasta -b sample.reads.bam -s $EPINANO_HOME/misc/sam2tsv.jar --type t

By default, --type t: bam file was derived from mapping reads to transcriptome.

The output will be sample.plus_strand.per.site.csv

Alternatively, --type g: bam file was produced by (splicing) mapping reads to genome; The output consists of two files: sample.plus_strand.per.site.csv and sample.minus_strand.per.site.csv

The results can be re-organized in per windows/kmers format.

python misc/Slide_Variants.py test_data/make_predictions/ko.plus_strand.per.site.csv 5

The output xmer.csv file contains the following columns:

Kmer: kmer (K consecutive bases)
Window: positions in the reference where the kmer was extracted
Ref: Reference sequenc ID
Strand: to which strand the reads were mapped and features were extracted and computed
Coverage: how many reads' bases were mapped at the K consecutive positions
q[i]: i=[123...], mean quality score of all reads' bases that are mapped to the same reference position and corresponding to the the ith position in the kmer
mis[i]: similar to q[i] but the metric is mismatch frequency
ins[i]: similar to q[i] but the metric is insertion fequency
del[i]: similar to q[i] but the metric is deletion fequency

extract electrical signals

sh $EPINANO_HOME/Epinano_Current.sh -b sample.reads.bam -r sample.reads.fastq -f reference.fasta  -t 6  -m g -d fast5_folder/

The output is organized on single-reference-sequence basis and can be combined using $EPINANO_HOME/misc/concat_events.py, and can also be re-organized in kmer basis with $EPINANO_HOME/misc/Slide_Intensity.py. Last but not least, the per kmer electric current information can be combined with per kmer error features using $EPINANO_HOME/misc/Join_variants_currents.py, and can then be used to train models or for preferred statistical analyses.

train SVM models Suppose we have a pair of samples and know that certain RNA nucleotides in one sample is modified but not in the other one. We can label the samples with modificaiton status.

bash $EPINANO_HOME/misc/Epinano_LabelSamples.sh -m mod.per_site.5mer.csv -u unm.per_site.5mer.csv -o combined.per_site_raw_feature.5mer.csv

Then, it is straightforward to train SVM models:

python $EPINANO_HOME/Epinano_Predict.py \
          --train combined.per_site_raw_feature.5mer.csv   \
          --predict combined.per_site_raw_feature.5mer.csv  \
          --accuracy_estimation --out_prefix train_and_test \
          --columns 8,13,23 --modification_status_column 26

This is an example using single site- quality, mismatch, and deletion to train SVM models. The output files names will depict the feature used for training. Moreover, the accuracy of all trained models are also listed to aid choosing the final most accurate one.

Note that, in this case, the same sample was used for both training and performance assessment, the Epinano_Predict will split the input data in halves for each purpose. It would be nice to have more than one samples.

detect modificaitons with trained model(s) Given a sample feature table and to predict modificaitons, you only need to run:

python $EPINANO_HOME/Epinano_Predict.py \
--model $EPINANO_HOME/models/q3.mis3.del3.MODEL.linear.model.dump \
--predict sample.per_site.5mer.csv \
--columns 8,13,23 \
--out_prefix some_sample_mod_predictiiton

detect modificaitons without using trained model(s)

Rscript $EPINANO_HOME/Epinano_DiffErr.R -k unm.per.site.var.csv -w mod.per.site.var.csv -d 0.1 -t 3 -p -o diffErr -f mis

Here, this script uses the mismatch feature (-f mis) to predict modifications.

It is also possible to use deletion and any other derived features for modification detection.

For example, we can compose a sum_err feature and then use it to detect modifications.

python $EPINANO_HOME/misc/Epinano_sumErr.py --file unm.per.site.var.csv --out unm.sum_err.csv --kmer 0 
python $EPINANO_HOME/misc/Epinano_sumErr.py --file mod.per.site.var.csv --out mod.sum_err.csv --kmer 0
Rscript $EPINANO_HOME/Epinano_DiffErr.R -k unm.sum_err.csv -w mod.sum_err.csv -d 0.1 -t 3 -p -o sumErr -f sum_err

We also offered Epinano_make_delta.py and Epinano_delta_sumErr.py to compute difference between modified and un-modiifed samples in terms of the raw and summed errors. These new features capturing difference between samples can also be used for training SVM models or directly utilized to detect modifications. We have shown that SVM models trained with deviance features tend to be robust to modification variaties, which means these models, though trained with a specific type of modificaiton, can be used to predict any type of modifications. But we will keep testing this is more types of RNA modifications.

Visulize results. We provided Epinano_Plot.R to visualize Epinnao predictions. All you need is to run this script on any prediciton result generated with the above mentioned method.

Rscript $EPINANO_HOME/misc/Epinano_Plot.R SVM_prediciton.csv

Rscript $EPINANO_HOME/Epinano_Plot.R diffErr.delta-mis.prediction.csv

Rscript $EPINANO_HOME/Epinano_Plot.R diffErr.linear-regression.prediction.csv

Commands for EpiNano 1.1

The latest version of epinano was re-written in python3 and re-organized to avoid typing multiple commands as required by the previous versions.

prepare reference transcriptome

samtools faidx transcriptome.fasta
java -jar picard.jar CreateSequenceDictionary R=transcriptome.fasta

mapping reads to reference transcriptome (No trimming)

minimap2 -ax map-ont transcriptome.fasta sample.fastq | samtools view -hSb - | samtools sort -@ 4 - sample.bam 
samtools index sample.bam

calling variants

samtools view -h -F 3844 sample.bam |  java -jar sam2tsv.jar -r transcriptome.fasta  > sample.tsv

compute per site/read variants frequencies

python /path/to/TSV_to_Variants_Freq.py3 -f sample.tsv -t 10

More functional components will be added to add current intensity information from both albacore and guppy basecalled fast5 files.

Commands for the first version of EpiNano (version 1.0)

The previous version of EpiNano was written in python2 and run in more steps than the new python3 version.

To extract features from basecalled FASTQ files:

#1 trim the first and last few bad quality bases from raw fastq with NanoFilt (feel free to replace nanofilt with custome script)

cat MOD.fastq|./NanoFilt -q 0 --headcrop 5 --tailcrop 3 --readtype 1D --logfile mod.nanofilt.log > mod.h5t3.fastq
cat UNM.fastq|./NanoFilt -q 0 --headcrop 5 --tailcrop 3 --readtype 1D --logfile unm.nanofilt.log > unm.h5t3.fastq

#2 'U' to 'T' conversion

awk '{ if (NR%4 == 2) {gsub(/U/,"T",$1); print $1} else print }' mod.h5t3.fastq > mod.U2T.fastq
awk '{ if (NR%4 == 2) {gsub(/U/,"T",$1); print $1} else print }' unm.h5t3.fastq > unm.U2T.fastq

#3 mapping to reference using minimap2

minimap2 -ax map-ont ref.fasta mod.U2T.fastq | samtools view -bhS - | samtools sort -@ 6 - mod  && samtools index mod.bam
minimap2 -ax map-ont ref.fasta unm.U2T.fastq | samtools view -bhS - | samtools sort -@ 6 - unm  && samtools index unm.bam

#4 calling variants for each single read-to-reference alignment
## reads mapped to reverse strand of reference seqeucne will be flipped

samtools view -h mod.bam| java -jar sam2tsv.jar -r  ref.fasta  > mod.bam.tsv
samtools view -h unm.bam | java -jar sam2tsv.jar -r  ref.fasta  > unm.bam.tsv

#5 convert results from step 4 and generate per_read variants information; the input file can be splitted based on read into smaller files to speed this step up.

python2 per_read_var.py mod.bam.tsv > mod.per_read.var.csv
python2 per_read_var.py unm.bam.tsv > unm.per_read.var.csv

#6 sumarize results from step 4 and generate variants information according the reference sequences (i.e., per_site variants); the input file can be splitted based on ref into smaller ones to speed this step up.

python2 per_site_var.py mod.bam.tsv > mod.per_site.var.csv
python2 per_site_var.py unm.bam.tsv > unm.per_site.var.csv

#7 slide per_site variants with window size of 5, so that fast5 event table information can be combined

python2 slide_per_site_var.py mod.ref.per_site.var.csv > mod.per_site.var.sliding.win.csv
python2 slide_per_site_var.py unm.ref.per_site.var.csv > unm.per_site.var.sliding.win.csv

To extract features from FAST5 files:

#1 extract event table from fast5 files; event table has 14 columns:
mean    stdv    start   length  model_state     move    weights p_model_state   mp_state        p_mp_state  p_A     p_C     p_G     p_T.
the meaning of these columns are explained at: https://community.nanoporetech.com/technical_documents/data-analysis/v/datd_5000_v1_revj_22aug2016/basecalled-fast5-files


python2 fast5ToEventTbl.py input.fast5 > output.event.tbl

#2 extract features needed (esp. current intensity) for downstream analyses.
The output contains the kmers and their centre base position (1-absed) in reads.

python2 event_tbl_feature_extraction.py output.event.tbl > output.event.tbl.features

#3 combine extracted features with per_read and per_site variants information

python2 fastq_len.py h5t3.fastq > h5t3.fastq.len
python2 adjust_read_base_positions.py  h5t3.fastq.len output.event.tbl.features number_of_chopped_leading_bases number_of_chopped_end_bases > output.event.tbl.features.readposition.adj.csv
python2 assign_current_to_per_read_kmer.py output.event.tbl.features.readposition.adj.csv per_rd_var.del.adjust.csv.summed.oneKmer_oneLine  > per_read.var.current.csv
python2 per_read_kmer_intensity_to_per_site_kmer_intensity.py per_read.var.current.csv per_site.varsliding.win.csv > per_site.var.current.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EpiNano wiki

Commands for Epinano 1.2 (hereafter referred to as Epi1.2)

Commands for EpiNano 1.1

Commands for the first version of EpiNano (version 1.0)

Clone this wiki locally