Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Difference between the rna004_130bps_sup@v5.0.0 and rna004_130bps_sup@v5.1.0 ? #1208

Open
VBHarrisN opened this issue Jan 3, 2025 · 6 comments
Labels
performance Issues related to basecalling performance

Comments

@VBHarrisN
Copy link

Just curious, what is the difference between the rna004_130bps_sup@v5.0.0 and rna004_130bps_sup@v5.1.0 models? We noticed that the new v5.1.0 model takes approximately 4x the time that the v5.0.0 model does, and were curious about the difference between the two models. We are using a pretty standard RTX A5000 GPU. The v5.1.0 model is not noted in the model list on the GitHub nor in the changelog from my brief glance at them. Any clarification would be appreciated!

@HalfPhoton
Copy link
Collaborator

Hi @VBHarrisN, there's no architectural difference between the rna004_130bps_sup v5.0.0 and 5.1.0 models and as such they should run at the same speed. The v5.1.0 model gained improvements from better training.

The models are listed on both the README and the Dorado-docs Model List

To help us resolve the unexpected performance regression can you share the following information requested in the issue template:

  • Hardware specs
  • Dorado versions used
  • Verbose Dorado logs for both runs

Kind regards,
Rich

@HalfPhoton HalfPhoton added the performance Issues related to basecalling performance label Jan 6, 2025
@VBHarrisN
Copy link
Author

Run environment:
Dorado version: 0.9.0

Operating system: WSL

Hardware (CPUs, Memory, GPUs): 13th Gen Intel(R) I9-13900K, 64GB, NVIDIA RTX A5000

V5.1.0 verbose output:

(base) user@vernal:~$ sh dorado.sh
[2025-01-06 09:58:44.196] [info] Running: "basecaller" "--verbose" "--estimate-poly-a" "packages/dorado-0.9.0-linux-x64/models/rna004_130bps_sup@v5.1.0" "/filepath5/pod5"
[2025-01-06 09:58:44.239] [info] Normalised: overlap 500 -> 492
[2025-01-06 09:58:44.239] [info] > Creating basecall pipeline
[2025-01-06 09:58:44.245] [info]  - BAM format does not support `U`, so RNA output files will include `T` instead of `U` for all file types.
[2025-01-06 09:58:44.556] [debug] TxEncoderStack: use_koi_tiled false.
[2025-01-06 09:58:45.906] [debug] cuda:0 memory available: 24.19GB
[2025-01-06 09:58:45.906] [debug] cuda:0 memory limit 23.19GB
[2025-01-06 09:58:45.906] [debug] cuda:0 maximum safe estimated batch size at chunk size 18432 is 192
[2025-01-06 09:58:45.906] [debug] cuda:0 maximum safe estimated batch size at chunk size 9216 is 384
[2025-01-06 09:58:45.906] [debug] Auto batchsize cuda:0: testing up to 384 in steps of 32
[2025-01-06 09:58:45.907] [info] Calculating optimized batch size for GPU "NVIDIA RTX A5000" and model packages/dorado-0.9.0-linux-x64/models/rna004_130bps_sup@v5.1.0. Full benchmarking will run for this device, which may take some time.
[2025-01-06 09:58:46.170] [debug] Auto batchsize cuda:0: 32, time per chunk 1.164888 ms
[2025-01-06 09:58:46.314] [debug] Auto batchsize cuda:0: 64, time per chunk 0.875709 ms
[2025-01-06 09:58:46.488] [debug] Auto batchsize cuda:0: 96, time per chunk 0.800405 ms
[2025-01-06 09:58:46.737] [debug] Auto batchsize cuda:0: 128, time per chunk 0.776959 ms
[2025-01-06 09:58:47.007] [debug] Auto batchsize cuda:0: 160, time per chunk 0.772620 ms
[2025-01-06 09:58:47.323] [debug] Auto batchsize cuda:0: 192, time per chunk 0.760912 ms
[2025-01-06 09:58:47.680] [debug] Auto batchsize cuda:0: 224, time per chunk 0.743785 ms
[2025-01-06 09:58:48.085] [debug] Auto batchsize cuda:0: 256, time per chunk 0.739160 ms
[2025-01-06 09:58:48.544] [debug] Auto batchsize cuda:0: 288, time per chunk 0.740661 ms
[2025-01-06 09:58:49.054] [debug] Auto batchsize cuda:0: 320, time per chunk 0.741052 ms
[2025-01-06 09:58:49.612] [debug] Auto batchsize cuda:0: 352, time per chunk 0.738432 ms
[2025-01-06 09:58:50.218] [debug] Auto batchsize cuda:0: 384, time per chunk 0.739597 ms
[2025-01-06 09:58:50.218] [debug] Adding chunk timings to internal cache for GPU NVIDIA RTX A5000, model packages/dorado-0.9.0-linux-x64/models/rna004_130bps_sup@v5.1.0 (9 entries)
[2025-01-06 09:58:50.218] [debug] Largest batch size for cuda:0: 352, time per chunk 0.738432 ms
[2025-01-06 09:58:50.218] [debug] Final batch size for cuda:0[0]: 192
[2025-01-06 09:58:50.218] [debug] Final batch size for cuda:0[1]: 352
[2025-01-06 09:58:50.218] [info] cuda:0 using chunk size 18432, batch size 192
[2025-01-06 09:58:50.218] [debug] cuda:0 Model memory 20.55GB
[2025-01-06 09:58:50.218] [debug] cuda:0 Decode memory 2.50GB
[2025-01-06 09:58:51.827] [info] cuda:0 using chunk size 9216, batch size 352
[2025-01-06 09:58:51.827] [debug] cuda:0 Model memory 18.84GB
[2025-01-06 09:58:51.827] [debug] cuda:0 Decode memory 2.29GB
[2025-01-06 09:59:18.061] [debug] BasecallerNode chunk size 18432
[2025-01-06 09:59:18.061] [debug] BasecallerNode chunk size 9216
[2025-01-06 09:59:18.076] [debug] Load reads from file /filepath5/pod5/FBA14993_aee84685_df2dbdb3_2.pod5
[2025-01-06 10:36:35.734] [debug] Load reads from file /filepath5/pod5/FBA14993_aee84685_df2dbdb3_8.pod5
[2025-01-06 11:01:29.383] [debug] Load reads from file /filepath5/pod5/FBA14993_aee84685_df2dbdb3_3.pod5
[2025-01-06 11:39:01.033] [debug] Load reads from file /filepath5/pod5/FBA14993_aee84685_df2dbdb3_1.pod5
[2025-01-06 12:13:57.810] [debug] Load reads from file /filepath5/pod5/FBA14993_aee84685_df2dbdb3_6.pod5
[2025-01-06 12:47:54.882] [debug] Load reads from file /filepath5/pod5/FBA14993_aee84685_df2dbdb3_4.pod5
[2025-01-06 13:23:53.249] [debug] Load reads from file /filepath5/pod5/FBA14993_aee84685_df2dbdb3_5.pod5
[2025-01-06 13:58:21.834] [debug] Load reads from file /filepath5/pod5/FBA14993_aee84685_df2dbdb3_7.pod5
[2025-01-06 14:29:38.065] [debug] Load reads from file /filepath5/pod5/FBA14993_aee84685_df2dbdb3_0.pod5
[2025-01-06 15:15:05.324] [info] > Finished in (ms): 18947228
[2025-01-06 15:15:05.325] [info] > Simplex reads basecalled: 515863
[2025-01-06 15:15:05.325] [info] > Simplex reads filtered: 10310
[2025-01-06 15:15:05.325] [info] > Basecalled @ Samples/s: 3.820545e+05
[2025-01-06 15:15:05.325] [debug] > Including Padding @ Samples/s: 5.666e+05 (67.43%)
[2025-01-06 15:15:05.325] [debug] PolyA tail length distribution :
Super long poly a tail histogram
[2025-01-06 15:15:05.326] [info] > PolyA tails called 455749, not called 63956, avg tail length 44
[2025-01-06 15:15:05.326] [info] Modal tail length 22
[2025-01-06 15:15:05.379] [info] > Finished

v5.0.0 Verbose Output:

(base) user@vernal:~$ sh dorado.sh
[2025-01-06 15:21:44.116] [info] Running: "basecaller" "--verbose" "--estimate-poly-a" "user/packages/dorado-0.9.0-linux-x64/models/rna004_130bps_sup@v5.0.0" "filepath85/pod5"
[2025-01-06 15:21:44.160] [info] Normalised: overlap 500 -> 492
[2025-01-06 15:21:44.160] [info] > Creating basecall pipeline
[2025-01-06 15:21:44.163] [info]  - BAM format does not support `U`, so RNA output files will include `T` instead of `U` for all file types.
[2025-01-06 15:21:44.537] [debug] TxEncoderStack: use_koi_tiled false.
[2025-01-06 15:21:46.252] [debug] cuda:0 memory available: 24.19GB
[2025-01-06 15:21:46.252] [debug] cuda:0 memory limit 23.19GB
[2025-01-06 15:21:46.252] [debug] cuda:0 maximum safe estimated batch size at chunk size 18432 is 192
[2025-01-06 15:21:46.252] [debug] cuda:0 maximum safe estimated batch size at chunk size 9216 is 384
[2025-01-06 15:21:46.252] [debug] Auto batchsize cuda:0: testing up to 384 in steps of 32
[2025-01-06 15:21:46.252] [info] Calculating optimized batch size for GPU "NVIDIA RTX A5000" and model user/packages/dorado-0.9.0-linux-x64/models/rna004_130bps_sup@v5.0.0. Full benchmarking will run for this device, which may take some time.
[2025-01-06 15:21:46.514] [debug] Auto batchsize cuda:0: 32, time per chunk 1.078368 ms
[2025-01-06 15:21:46.655] [debug] Auto batchsize cuda:0: 64, time per chunk 0.859280 ms
[2025-01-06 15:21:46.824] [debug] Auto batchsize cuda:0: 96, time per chunk 0.800256 ms
[2025-01-06 15:21:47.043] [debug] Auto batchsize cuda:0: 128, time per chunk 0.783464 ms
[2025-01-06 15:21:47.312] [debug] Auto batchsize cuda:0: 160, time per chunk 0.773459 ms
[2025-01-06 15:21:47.626] [debug] Auto batchsize cuda:0: 192, time per chunk 0.761323 ms
[2025-01-06 15:21:47.984] [debug] Auto batchsize cuda:0: 224, time per chunk 0.745755 ms
[2025-01-06 15:21:48.391] [debug] Auto batchsize cuda:0: 256, time per chunk 0.740287 ms
[2025-01-06 15:21:48.851] [debug] Auto batchsize cuda:0: 288, time per chunk 0.742794 ms
[2025-01-06 15:21:49.364] [debug] Auto batchsize cuda:0: 320, time per chunk 0.745488 ms
[2025-01-06 15:21:49.926] [debug] Auto batchsize cuda:0: 352, time per chunk 0.741702 ms
[2025-01-06 15:21:50.536] [debug] Auto batchsize cuda:0: 384, time per chunk 0.743141 ms
[2025-01-06 15:21:50.536] [debug] Adding chunk timings to internal cache for GPU NVIDIA RTX A5000, model user/packages/dorado-0.9.0-linux-x64/models/rna004_130bps_sup@v5.0.0 (8 entries)
[2025-01-06 15:21:50.536] [debug] Largest batch size for cuda:0: 256, time per chunk 0.740287 ms
[2025-01-06 15:21:50.536] [debug] Final batch size for cuda:0[0]: 192
[2025-01-06 15:21:50.536] [debug] Final batch size for cuda:0[1]: 256
[2025-01-06 15:21:50.536] [info] cuda:0 using chunk size 18432, batch size 192
[2025-01-06 15:21:50.536] [debug] cuda:0 Model memory 20.55GB
[2025-01-06 15:21:50.536] [debug] cuda:0 Decode memory 2.50GB
[2025-01-06 15:21:52.159] [info] cuda:0 using chunk size 9216, batch size 256
[2025-01-06 15:21:52.159] [debug] cuda:0 Model memory 13.70GB
[2025-01-06 15:21:52.159] [debug] cuda:0 Decode memory 1.66GB
[2025-01-06 15:21:53.958] [debug] BasecallerNode chunk size 18432
[2025-01-06 15:21:53.958] [debug] BasecallerNode chunk size 9216
[2025-01-06 15:21:53.974] [debug] Load reads from file filepath85/pod5/FBA14993_aee84685_df2dbdb3_2.pod5
[2025-01-06 15:32:11.704] [debug] Load reads from file filepath85/pod5/FBA14993_aee84685_df2dbdb3_8.pod5
[2025-01-06 15:40:29.447] [debug] Load reads from file filepath85/pod5/FBA14993_aee84685_df2dbdb3_3.pod5
[2025-01-06 15:51:34.083] [debug] Load reads from file filepath85/pod5/FBA14993_aee84685_df2dbdb3_1.pod5
[2025-01-06 16:02:37.599] [debug] Load reads from file filepath85/pod5/FBA14993_aee84685_df2dbdb3_6.pod5
[2025-01-06 16:13:08.334] [debug] Load reads from file filepath85/pod5/FBA14993_aee84685_df2dbdb3_4.pod5
[2025-01-06 16:24:06.713] [debug] Load reads from file filepath85/pod5/FBA14993_aee84685_df2dbdb3_5.pod5
[2025-01-06 16:34:59.493] [debug] Load reads from file filepath85/pod5/FBA14993_aee84685_df2dbdb3_7.pod5
[2025-01-06 16:45:39.284] [debug] Load reads from file filepath85/pod5/FBA14993_aee84685_df2dbdb3_0.pod5
[2025-01-06 16:57:35.822] [info] > Finished in (ms): 5741827
[2025-01-06 16:57:35.822] [info] > Simplex reads basecalled: 525582
[2025-01-06 16:57:35.822] [info] > Simplex reads filtered: 727
[2025-01-06 16:57:35.822] [info] > Basecalled @ Samples/s: 1.260726e+06
[2025-01-06 16:57:35.822] [debug] > Including Padding @ Samples/s: 1.870e+06 (67.43%)
[2025-01-06 16:57:35.822] [debug] PolyA tail length distribution :
Poly a distribution cut out....
[2025-01-06 16:57:35.825] [info] > PolyA tails called 467641, not called 61647, avg tail length 50
[2025-01-06 16:57:35.825] [info] Modal tail length 41
[2025-01-06 16:57:35.865] [info] > Finished

Only difference I notice is that the CUDA batch size is much larger in the v5.1.0 model output, which seems like it would be faster....

I also can't help but notice that the modal tail length for the poly-a estimation is very different between the two models.

Let me know if I can provide any more information!

@HalfPhoton
Copy link
Collaborator

HalfPhoton commented Jan 7, 2025

Hi @VBHarrisN,

Thank you for the detailed logs - we'll look into this and get back to you once we've investigated.


Edit:

Could you run this side-by-side benchmark:

  • Single pod5 file - Use the same file in both cases
  • Disable poly-A estimation
  • Fix the batch size using: --batchsize 192

Best regards,
Rich

@VBHarrisN
Copy link
Author

Here is V5.0.0

[2025-01-07 10:14:00.870] [info] Running: "basecaller" "--batchsize" "192" "--verbose" "filepath/packages/dorado-0.9.0-linux-x64/models/rna004_130bps_sup@v5.0.0" "filepath85/pod5/FBA14993_aee84685_df2dbdb3_4.pod5"
[2025-01-07 10:14:00.929] [info] Normalised: overlap 500 -> 492
[2025-01-07 10:14:00.929] [info] > Creating basecall pipeline
[2025-01-07 10:14:00.943] [info]  - BAM format does not support `U`, so RNA output files will include `T` instead of `U` for all file types.
[2025-01-07 10:14:01.265] [debug] TxEncoderStack: use_koi_tiled false.
[2025-01-07 10:14:01.937] [debug] cuda:0 memory available: 24.19GB
[2025-01-07 10:14:01.937] [debug] cuda:0 memory limit 23.19GB
[2025-01-07 10:14:01.937] [debug] cuda:0 maximum safe estimated batch size at chunk size 18432 is 192
[2025-01-07 10:14:01.937] [debug] cuda:0 maximum safe estimated batch size at chunk size 9216 is 384
[2025-01-07 10:14:01.937] [info] cuda:0 using chunk size 18432, batch size 192
[2025-01-07 10:14:01.937] [debug] cuda:0 Model memory 20.55GB
[2025-01-07 10:14:01.937] [debug] cuda:0 Decode memory 2.50GB
[2025-01-07 10:14:03.863] [info] cuda:0 using chunk size 9216, batch size 192
[2025-01-07 10:14:03.863] [debug] cuda:0 Model memory 10.28GB
[2025-01-07 10:14:03.863] [debug] cuda:0 Decode memory 1.25GB
[2025-01-07 10:14:04.694] [debug] BasecallerNode chunk size 18432
[2025-01-07 10:14:04.694] [debug] BasecallerNode chunk size 9216
[2025-01-07 10:14:04.712] [debug] Load reads from file filepath85/pod5/FBA14993_aee84685_df2dbdb3_4.pod5
[2025-01-07 10:23:23.367] [info] > Finished in (ms): 558643
[2025-01-07 10:23:23.367] [info] > Simplex reads basecalled: 59876
[2025-01-07 10:23:23.367] [info] > Simplex reads filtered: 60
[2025-01-07 10:23:23.367] [info] > Basecalled @ Samples/s: 1.479424e+06
[2025-01-07 10:23:23.367] [debug] > Including Padding @ Samples/s: 2.195e+06 (67.40%)
[2025-01-07 10:23:23.371] [info] > Finished

Here is v5.1.0

[2025-01-07 10:23:24.003] [info] Running: "basecaller" "--batchsize" "192" "--verbose" "filepath/packages/dorado-0.9.0-linux-x64/models/rna004_130bps_sup@v5.1.0" "filepath85/pod5/FBA14993_aee84685_df2dbdb3_4.pod5"
[2025-01-07 10:23:24.040] [info] Normalised: overlap 500 -> 492
[2025-01-07 10:23:24.040] [info] > Creating basecall pipeline
[2025-01-07 10:23:24.041] [info]  - BAM format does not support `U`, so RNA output files will include `T` instead of `U` for all file types.
[2025-01-07 10:23:24.240] [debug] TxEncoderStack: use_koi_tiled false.
[2025-01-07 10:23:24.828] [debug] cuda:0 memory available: 24.19GB
[2025-01-07 10:23:24.828] [debug] cuda:0 memory limit 23.19GB
[2025-01-07 10:23:24.828] [debug] cuda:0 maximum safe estimated batch size at chunk size 18432 is 192
[2025-01-07 10:23:24.828] [debug] cuda:0 maximum safe estimated batch size at chunk size 9216 is 384
[2025-01-07 10:23:24.828] [info] cuda:0 using chunk size 18432, batch size 192
[2025-01-07 10:23:24.828] [debug] cuda:0 Model memory 20.55GB
[2025-01-07 10:23:24.828] [debug] cuda:0 Decode memory 2.50GB
[2025-01-07 10:23:26.669] [info] cuda:0 using chunk size 9216, batch size 192
[2025-01-07 10:23:26.669] [debug] cuda:0 Model memory 10.28GB
[2025-01-07 10:23:26.669] [debug] cuda:0 Decode memory 1.25GB
[2025-01-07 10:23:27.556] [debug] BasecallerNode chunk size 18432
[2025-01-07 10:23:27.556] [debug] BasecallerNode chunk size 9216
[2025-01-07 10:23:27.569] [debug] Load reads from file filepath85/pod5/FBA14993_aee84685_df2dbdb3_4.pod5
[2025-01-07 10:32:53.650] [info] > Finished in (ms): 566059
[2025-01-07 10:32:53.650] [info] > Simplex reads basecalled: 58763
[2025-01-07 10:32:53.650] [info] > Simplex reads filtered: 1159
[2025-01-07 10:32:53.650] [info] > Basecalled @ Samples/s: 1.460042e+06
[2025-01-07 10:32:53.650] [debug] > Including Padding @ Samples/s: 2.166e+06 (67.40%)
[2025-01-07 10:32:53.657] [info] > Finished

@malton-ont
Copy link
Collaborator

@VBHarrisN,

Those numbers suggest that it is the polyA calculation that is causing the difference in speeds, rather than the models. The amount of time polyA estimation takes is going to be a little data dependent, but a 60% slowdown seems excessive. Dorado 0.9.0 should already be refusing to estimate reads that have selected an implausibly large region to search for the polyA signal, but it's possible this needs tightening up.

Are you able to isolate a subset of reads that replicate this? It looks like some of the data in filepath5 should show the problem.

@VBHarrisN
Copy link
Author

VBHarrisN commented Jan 15, 2025

Unfortunately we are unable to share that data as it is sensitive information. However, I can say that even with the poly-a-estimation flag turned off, the v5.1.0 model still take about 60% longer to run when run on the full dataset. I can also validate that this happens on other datasets, not just this most recent one.

Short of sharing the data, we are happy to assist in any capacity with resolving this issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Issues related to basecalling performance
Projects
None yet
Development

No branches or pull requests

3 participants