Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Polishing index error #1196

Open
ekhtan opened this issue Dec 23, 2024 · 13 comments
Open

Polishing index error #1196

ekhtan opened this issue Dec 23, 2024 · 13 comments
Labels
polish Issues related to polishing

Comments

@ekhtan
Copy link

ekhtan commented Dec 23, 2024

Hello, I'm having issues with the polishing step, even after performing the alignment with dorado (running 0.9.0 on an A100 GPU). Any ideas about thie missing index file?

$ dorado aligner assembly.fasta filtered_sup.fq > aln.bam
[2024-12-23 07:14:55.234] [info] Running: "aligner" "assembly.fasta" "filtered_sup.fq"
[2024-12-23 07:14:55.235] [info] num input files: 1
[2024-12-23 07:14:55.235] [info] > loading index assembly.fasta
[2024-12-23 07:14:55.288] [info] processing filtered_sup.fq -> -
[2024-12-23 07:14:55.305] [info] > starting alignment
[2024-12-23 07:14:57.221] [info] > finished alignment
[2024-12-23 07:14:57.248] [info] > Finished in (ms): 1916
[2024-12-23 07:14:57.248] [info] > Simplex reads basecalled: 10043
[2024-12-23 07:14:57.248] [info] > total/primary/unmapped 12077/10039/4

Error seen below

$ dorado polish aln.bam assembly.fasta > polished.fasta
[2024-12-23 07:18:23.978] [info] Running: "polish" "aln.bam" "assembly.fasta"
[E::idx_find_and_load] Could not retrieve index file for 'aln.bam'
[2024-12-23 07:18:23.987] [error] Caught exception: Could not open index for BAM file: 'aln.bam'!
@malton-ont
Copy link
Collaborator

Hi @ekhtan,

The input bam file for dorado polish must be sorted and indexed. The output from dorado aligner is only sorted if the input is a folder or you have specified --output-dir, not when stdout redirection is used.

You can sort and index the bam file using samtools.

@malton-ont malton-ont added the polish Issues related to polishing label Dec 23, 2024
@michael-olufemi
Copy link

I am getting a different error myself after sorting and indexing
[error] Caught exception: Input BAM file has no basecaller models listed in the header.

@malton-ont
Copy link
Collaborator

@michael-olufemi,

dorado polish is intended to work with files basecalled by dorado. These files should contain @RG lines in the header, which contain a DS tag containing text like basecall_model=.... If these lines are missing, dorado polish will not accept the file.

@michael-olufemi
Copy link

I used the minknow GUI to do live basecalling. Do I need to rebasecall from the pod5s again?

@malton-ont
Copy link
Collaborator

malton-ont commented Dec 23, 2024

dorado polish currently only works with data basecalled using the v5 hac and sup models (see here). These models are not available in the latest minknow, so yes, you will need to rebasecall using dorado.

@michael-olufemi
Copy link

Alright, Thank you @malton-ont

@ekhtan
Copy link
Author

ekhtan commented Dec 24, 2024

Hi @ekhtan,

The input bam file for dorado polish must be sorted and indexed. The output from dorado aligner is only sorted if the input is a folder or you have specified --output-dir, not when stdout redirection is used.

You can sort and index the bam file using samtools.

Thank you!

@ekhtan
Copy link
Author

ekhtan commented Dec 24, 2024

Hello, I encountered similar issues with the other commentor here. The file was basecalled and aligned with with dorado all using sup v5.0.0 -- and it's having some header issues

[2024-12-24 09:54:07.150] [info] Running: "polish" "aln_dorado/calls_sup.bam" "assembly.fasta"
[2024-12-24 09:54:07.191] [error] Caught exception: Input BAM file has no basecaller models listed in the header.

Here's the fq file I used for alignment on dorado, which has the headers.
To me, it looks like dorado aligner removed the @rg in the bam files?

@d8f0ac7e-d18d-42d0-adc8-647e161e6d3b qs:f:20.3027      st:Z:2024-12-19T19:42:42.276+00:00      RG:Z:5af9c1720280fecd6c3a34a912c1addc6de26f5c_dna_r10.4.1_e8.2_400bps_sup@v5.0.0
        DS:Z:gpu:NVIDIA A100-PCIE-40GB
GTTATGTTTAGCCTTTGGTTCGTTTACGTATTGCTGGGTTTGGAGAAATAGAGTAAATCCCATAGGTTTTTCAGAATTAATGTTACCAGTTGCAGAGGATAATCTTCTTAAGGAGAGAGCTAAGGAGGGAAGCATAACTTTAAAGGATTTGTTATTAATGAGTTATGTTTGTGTCGCAGGCATTGACATGGTTGGTATATATTCAGATTTAGTTACATATGAAAATATATTGAAATCAATTTACTATATTCAATATACAAAACGAAAACCCTACGGTGTAAGGATGATACCAACTAATGGAAGCCCAGTATTAACTAAGATGTTTGGAGAAATACCCGAGGTAAAAGTAGTATAGTATAGTATAAAACTACTTT

bam file output from dorado aligner

f59d38ee-e0a5-491c-8771-8045a28d44c6 2064 contig_1 1 60 28024H261M1I149M1D1M3D382M1I288M1I27M1I7M1I8M1D170M3I5M1D1M2I1018M1I1M1I293M4I3M2I220M1I59M1D72M1I2M2I18M1I2M1I1065M2D7M2D654M1D108M1D128M1D64M1D41M2I240M1D606M1D3M1D254M2D230M1D6M1D38M2D6M2D3M1I495M1D3M2I5M1D244M2D1M1I28M1I10M3D9M1D9M1D142M1D370M1I447M1I493M1D2M1I450M2D204M1I43M1I450M1D21M1D53M3I160M1I22M3D75M1D316M3D223M1I239M1D5M1D88M1I211M1D256M3D13M3D66M1D42M1I5M1D2M2I1401M1D472M2I89M1H * 0 0 AGCTCCTGCTGGCGCGTCCCCAAAGGATTCAACACCGTCATTAGTTCTTACGAAAACTCTTATGGTGGGATTGCCCCTAGAATCTATGATTTCTAATCCCTTAACCTTCTCTATGGAAAAACGGTTAATCATATTTACTGAAGCGAATTAGCAATAATTAAATGTTAACTCCTAGCATGGCCTCGATAGTGTTTTTAGCATATGGTTTACTTTCCCTCATTTTCTCTAGATTTTTAAAAG

@michael-olufemi
Copy link

Hello @malton-ont ,

I am having similar errors after basecalling with dorado sup v5.0.0.
[2024-12-24 17:30:06.555] [info] Running: "polish" "FC108_reads_to_draft.sorted.bam" "FC108-scaffold.fasta" "--device" "cuda:0" "--threads" "64"
[2024-12-24 17:30:07.021] [error] Caught exception: Input BAM file has no basecaller models listed in the header.

@gwaldbieser
Copy link

Hello @malton-ont,

I basecalled with dorado v8.3 using model dna_r10.4.1_e8.2_400bps_sup@v5.0.0 then assembled the reads using hifiasm.

Then I aligned reads to the assembled contigs using dorado aligner v.0.9.0.
Input fastq has header line such as:
@74960cfd-0b82-43ed-ae04-05162e3c0a5a qs:f:27.7534 du:f:75.1604 ns:i:375802 ts:i:1858 mx:i:1 ch:i:295 st:Z:2024-08-29T22:06:03.400+00:00 rn:i:585 fn:Z:FBA17175_7da7e070_f8e851a5_5.pod5 sm:f:414.101 sd:f:107.157 sv:Z:pa dx:i:0 RG:Z:f8e851a5d56475e9ecaa43496da18fad316883d8_dna_r10.4.1_e8.2_400bps_sup@v5.0.0

Output bam for corresponding read (with long CIGAR, sequence, and quality fields omitted for space):
74960cfd-0b82-43ed-ae04-05162e3c0a5a 16 h1tg000034l 1 60 * 0 0 st:Z:2024-08-29T22:06:03.400+00:00 fn:Z:FBA17175_7da7e070_f8e851a5_5.pod5 sv:Z:pa RG:Z:f8e851a5d56475e9ecaa43496da18fad316883d8_dna_r10.4.1_e8.2_400bps_sup@v5.0.0

Then I invoked dorado polish and received the basecaller model error:
"Caught exception: Input BAM file has no basecaller models listed in the header"

Do we require a specific SAM header line to include the model? If so, please advise. Thanks!

$samtools view -h ont.bam | head -10
@hd VN:1.6 SO:coordinate
@pg ID:aligner PN:dorado VN:0.9.0+9dc15a8 DS:2.27-r1193
@pg ID:samtools PN:samtools PP:aligner VN:1.17 CL:samtools view -h hap1Align/BCNH.Q15.toCorrect.bam
@sq SN:h1tg000034l LN:84915712
@sq SN:h1tg000032l LN:68433863
@sq SN:h1tg000031l LN:51326835
@sq SN:h1tg000049l LN:50661075
@sq SN:h1tg000009l LN:46315850
@sq SN:h1tg000064l LN:39681836
@sq SN:h1tg000108l LN:39408415

@malton-ont
Copy link
Collaborator

@gwaldbieser, @ekhtan,

The BAM file must contain the @RG headers corresponding to the

dorado polish is intended to work with files basecalled by dorado. These files should contain @RG lines in the header, which contain a DS tag containing text like basecall_model=.... If these lines are missing, dorado polish will not accept the file.

I believe your issue is that you basecalled to fastq (which does not store these headers) prior to the alignment step.

Here's the steps I've just used:

dorado basecaller hac pod5/ > calls.bam
<assemble>
dorado aligner <reference> calls.bam > aligned.bam
samtools sort aligned.bam > sorted.bam
samtools index sorted.bam
samtools view -H sorted.bam

which gives:

@HD     VN:1.6  SO:coordinate
@PG     ID:basecaller   PN:dorado       VN:0.9.0+9dc15a8      CL:dorado basecaller hac pod5
@PG     ID:aligner      PP:basecaller   PN:dorado       VN:0.9.0+9dc15a8      DS:2.27-r1193
@PG     ID:samtools     PN:samtools     PP:aligner      VN:1.10 CL:samtools sort -o sorted.bam aligned.bam
@PG     ID:samtools.1   PN:samtools     PP:samtools     VN:1.10 CL:samtools view -H sorted.bam
@RG     ID:test_dna_r10.4.1_e8.2_400bps_hac@v5.0.0     PU:test PM:test DT:2023-11-25T19:13:44.029+00:00        PL:ONT  DS:basecall_model=dna_r10.4.1_e8.2_400bps_hac@v5.0.0 runid=test        LB:test SM:test

@gwaldbieser
Copy link

gwaldbieser commented Jan 2, 2025

Thanks. I actually have the bam files available. In the event of multiple runs and multiple bams, does it make a difference whether one cats them into a single bam that's aligned, or is it better to align each bam then cat the output into a single bam?

@malton-ont
Copy link
Collaborator

@gwaldbieser,

I don't believe it should make a difference either way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
polish Issues related to polishing
Projects
None yet
Development

No branches or pull requests

4 participants