dorado basecalling bam file end up corrupt during sort #1197

uribertocchi · 2024-12-26T14:27:00Z

I am experiencing an issue with Dorado where the generated BAM files appear corrupted. The basecalling process completes without any reported errors, but subsequent attempts to sort the BAM files (using both Sambamba sort and Samtools sort) result in errors such as:

sambamba-sort: [zlib] buf error
sambamba-sort: [zlib] data error

I expect to produce a correctly formatted and usable BAM file, but the sorting process consistently fails.

Steps to reproduce the issue:

Dorado basecaller dna_r10.4.1_e8.2_400bps_hac@v5.0.0 --modified-bases 6mA --reference /path/to/reference.fa -x cuda: all --recursive /path/to/pod5_pass > output.bam

Run environment:

Dorado version: v0.9.0 (previous versions tested with similar results)

Operating system:
Hardware (CPUs, Memory, GPUs):
CPUs: 96-core processor
Memory: 1.5 TiB
GPUs: NVIDIA L40 (4 GPUs)
Source data location (on device or networked drive - NFS, etc.): networked drive
Details about data: Data with over 30GB of output bam file is usually corrupt.

HalfPhoton · 2024-12-26T22:59:32Z

Hi @uribertocchi,
Can you add some more information please;

Could share some more detailed logs from samtools and more information on your system e.g. OS, CPU architecture?
Can you reproduce the issue with a single read, with/without the reference (alignment)?

Best regards,
Rich

uribertocchi · 2024-12-27T10:44:29Z

Hi @rich,

Thank you for your response! Here is the additional information:

System Details:
Operating System: Linux (specific version: Ubuntu 22.04.1 LTS, kernel 6.8.0-49-generic)
CPU Architecture: AMD EPYC 9654, 96-core processor
Memory: 1.5 TiB
GPUs: NVIDIA L40 (4 GPUs)

Samtools sort logs:
[E::bgzf_uncompress] CRC32 checksum mismatch
[E::bgzf_read_block] BGZF decode jobs returned error 1 for block offset 3605376433
[E::bgzf_read] Read block operation failed with error 1 after 0 of 4 bytes
samtools sort: truncated file. Aborting

HalfPhoton · 2024-12-27T15:06:21Z

Can you reproduce the issue with a single read, with/without the reference (alignment)?

uribertocchi · 2024-12-27T21:02:10Z

Basecalling is fine for individual reads or small pod5/fast5 files. However, the issue arises when attempting to basecall many reads and align them to BAM or SAM files. I have successfully produced functioning BAM or SAM files up to 30GB but encountered issues with larger files. Unfortunately, most of my experiments exceed 60GB, resulting in most files becoming corrupted.

HalfPhoton · 2025-01-02T14:41:04Z

Hi @uribertocchi,
I believe this problem is likely to be an issue with your system or network storage configuration and not an issue with Dorado.

Could you check the following:

Are you using a recent version of samtools?
Can you check that your BAM files pass samtools quickcheck <FILE>?
Are you able to basecall 2+ ~30G datasets and samtools merge them into a valid BAM file?
Does your system / network file store have a maximum file size?

Kind regards,
Rich

uribertocchi · 2025-01-02T15:57:37Z

Thank you, Rich. I'll check everything soon. How can I limit the size of the pod5 folder to 30GB, as it currently generates about 80GB?

HalfPhoton · 2025-01-02T17:48:42Z

No worries @uribertocchi - This is quite a puzzling issue that I've not come across and which hasn't been reported before as far as I can see. Hopefully we can get to the bottom of it.

To create smaller outputs you can either;

Batch your input data into smaller datasets (i.e. avoid using --recursive and collect input data into collections of N pod5 files. This could be easiest if your input data is already collection of small datasets and it's just a matter of basecalling each separately.
Specify batches of reads in the form of lists of read ids to basecall from your overall datasets. Continue using --recursive but also use --read-ids <FILE> for N lists of read ids. You can create batches of read ids using the bash snippet below:

    # Set BATCH_SIZE_SAMPLES  100G samples ~ 10GB output BAM
    BATCH_SIZE_SAMPLES=100000000000 

    pod5 view -H --include "read_id, num_samples" --recursive --output view.txt reads/

    # Tally the 2nd column (num_samples) and calculate the number of sub-batches
    TOTAL_SAMPLES=$(awk --bignum '{sum+=$2;} END {print sum;}' view.txt)
    BATCHES=$(($TOTAL_SAMPLES / $BATCH_SIZE_SAMPLES  ))
    BATCHES=$(($BATCHES + 1 ))

    # Extract only the read ids
    awk '{print \$1}' view.txt > read_ids.txt
    # split the read ids into separate files approximately even batches
    split read_ids.txt -n l/$BATCHES ids. -a 4 -d --additional-suffix .txt

and call them in some loop over each file writing to separate output.$N.bam files

for F in $(find -maxdepth 1 -iname "ids*txt"); do 
    N=$(echo $F | grep -oP "\d+")
    echo $N $F
    dorado basecaller hac,6mA /path/to/pod5_pass/ --recursive --reference /path/to/reference.fa -x cuda: all --read-ids $F > output.${N}.bam
done

Best regards,
Rich

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dorado basecalling bam file end up corrupt during sort #1197

dorado basecalling bam file end up corrupt during sort #1197

uribertocchi commented Dec 26, 2024

HalfPhoton commented Dec 26, 2024

uribertocchi commented Dec 27, 2024

HalfPhoton commented Dec 27, 2024

uribertocchi commented Dec 27, 2024

HalfPhoton commented Jan 2, 2025

uribertocchi commented Jan 2, 2025

HalfPhoton commented Jan 2, 2025 •

edited

Loading

dorado basecalling bam file end up corrupt during sort #1197

dorado basecalling bam file end up corrupt during sort #1197

Comments

uribertocchi commented Dec 26, 2024

Steps to reproduce the issue:

Run environment:

HalfPhoton commented Dec 26, 2024

uribertocchi commented Dec 27, 2024

HalfPhoton commented Dec 27, 2024

uribertocchi commented Dec 27, 2024

HalfPhoton commented Jan 2, 2025

uribertocchi commented Jan 2, 2025

HalfPhoton commented Jan 2, 2025 • edited Loading

HalfPhoton commented Jan 2, 2025 •

edited

Loading