Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trimming with custom sequences #1174

Open
sarah-buddle opened this issue Dec 13, 2024 · 5 comments
Open

Trimming with custom sequences #1174

sarah-buddle opened this issue Dec 13, 2024 · 5 comments

Comments

@sarah-buddle
Copy link

Issue Report

I have run dorado demux and I've noticed that some primer/adapter sequences are left at the start the reads, so I have tried to run dorado trim to remove them. I've tried providing the sequence of the primers using --primer-sequences and also without this parameter. In both cases, some reads are trimmed but I find that in the output file many reads (100s out of 3195 total reads) still contain these sequences at the start. Is there a setting I should change so the trimming works for all the reads?

Steps to reproduce the issue:

The command I am using is:
dorado trim input.fastq --emit-fastq --primer-sequences primers.fasta > output.fastq

The input file was the output of the dorado demux command (with trimming enabled) run on a previously basecalled fastq file.

Custom primer sequences that are still found in output:

>seq1
TTCAGACGTGTGCTCTTCCGATCT
>seq2
CCTACACGACGCTCTTCCGATCT
>seq3
CCTACACGACG

Run environment:

  • Dorado version: 0.8.3+98456f7
  • Dorado command: see above
  • Operating system: CentOS Linux 7 (Core)
  • Hardware (CPUs, Memory, GPUs): CPUs, 10G memory currently (I'm running a small file for testing)
  • Source data type: fastq, previously basecalled from pod5
  • Source data location (on device or networked drive - NFS, etc.): On HPC
  • Details about data (flow cell, kit, read lengths, number of reads, total dataset size in MB/GB/TB): Test fastq file is 1.3M

Logs

[2024-12-13 16:18:57.672] [info] Running: "trim" "input.fastq" "--primer-sequences" "primers.fasta" "--emit-fastq" "-v"
[2024-12-13 16:18:57.672] [debug] > adapter/primer trimming threads 58, writer threads 6
[2024-12-13 16:18:57.673] [info] - Note: FASTQ output is not recommended as not all data can be preserved.
[2024-12-13 16:18:57.677] [info] > starting adapter/primer trimming
[2024-12-13 16:18:57.727] [debug] Total reads processed: 3195
[2024-12-13 16:18:57.781] [info] > Simplex reads basecalled: 3195
[2024-12-13 16:18:57.781] [info] > finished adapter/primer trimming

@malton-ont
Copy link
Collaborator

Hi @sarah-buddle,

There's nothing obvious in your command that should stop this from working. Are you able to share the test data file with us?

@sarah-buddle
Copy link
Author

Sorry for the delayed reply, here are my test input and output files.
input.fastq.gz
output.fastq.gz

@malton-ont
Copy link
Collaborator

Hi @sarah-buddle,

There are a few things that could be affecting this:

  1. In certain circumstances, barcode trimming and adapter trimming can generate a trim region that is infeasible (e.g. barcode trimming thinks we should retain bases 11-40, but adapter trimming thinks we should retain 52-60) - in these cases no trimming is performed. This can usually be rectified by performing the barcoding during the demux step instead of during basecalling.
  2. For very short reads, trimming can identify that it should remove the entire read - in these cases, no trimming is performed. These reads should probably be discarded, as they are likely just adapter fragments.
  3. Trimming identifies two candidate primers that have very similar alignment scores - in this case we trim the larger region, even if its score is slightly lower.

I think no.3 is what is occurring during your dorado trim calls - seq1 and seq2 are generating quite similar scores (in different locations of the read), and instead of taking the perfect match to seq2 we're taking the slightly worse (but still very good!) match to seq1 (or vice versa). You can resolve this by running dorado trim several times with a single primer sequence each time.

You may also want to try dorado 0.9.0 - this appears to have better results (since it's possible to specify which end the primer sequence relates to, and it's therefore possible to skip the RC search, I suspect). Note that this will require some tweaking to your custom primer file:

>seq1_front type=primer kits=<kit-name>
TTCAGACGTGTGCTCTTCCGATCT
>seq2_front type=primer kits=<kit-name>
CCTACACGACGCTCTTCCGATCT
>seq3_front type=primer kits=<kit-name>
CCTACACGACG

and then calling with:

dorado trim \
  --primer-sequences seqs.fasta \
  --emit-fastq --sequencing-kit <kit-name> \ 
  input.fastq 
> output.fastq

See the docs here for details.

@sarah-buddle
Copy link
Author

Thanks that's very helpful! I suspect it may be mainly down to point 2, possibly in combination with a non-standard adapter pattern. In answer to each of your points:

  1. We've already run SUP basecalling without demultiplexing or trimming, and then ran dorado demux separately with trimming enabled. I believe this doesn't trim the sequences automatically because we're using a non-standard barcode/adaptor arrangement.
  2. The reads are all very short (I am expecting an N50 of around 150 bp after trimming) - this is deliberate and part of a method we're developing. I tried extracting the reads where the first primer was not successfully trimmed and running our downstream analysis pipeline on these reads alone. The reads contain real useful sequence data and are not just adaptor sequences, so ideally I don't want to discard them. Could it be that our short reads often don’t meet some kind of minimum length threshold within dorado?
  3. I tried running the analysis sequentially for each primer and just with one primer alone. This didn't substantially change the results.

I tried running dorado v0.9.0. This didn't trim any of the primers at all (with v0.8.3 it trims most but not all of them), so perhaps I misunderstood the documentation. The command I used was:

dorado trim input.fastq \
--primer-sequences custom_primers.fasta \
--emit-fastq \
--sequencing-kit custom \
> output.fastq

and my custom_primers.fastq looked like this:

>seq1_front type=primer kits=custom
TTCAGACGTGTGCTCTTCCGATCT

@malton-ont
Copy link
Collaborator

Assuming your barcodes are arranged in the "normal" way:

5' --- ADAPTER/PRIMER --- LEADING_FLANK_1 --- BARCODE_1 --- TRAILING_FLANK_1 --- READ --- RC(TRAILING_FLANK_2) --- RC(BARCODE_2) --- RC(LEADING_FLANK_2) --- 3'

then I would expect barcode trimming to remove your adapters as well. If you have something else (like primers and barcodes the other way around) then anything inward of the barcode will remain in place (dorado demux only trims barcodes, not anything else), but dorado trim should be able to remove them (and does in the example you shared with me - 0.9.0 removed all but 1 leading primers in my testing, and that read was primer only).

Re. point 2: you could try running with the -v flag and see if there are any lines along the lines of Invalid trim interval...

Could you share the commands for your entire pipeline, from start to end?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants