-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding more species takes less time - unclear why? #955
Comments
Hi Eric, That's definitely interesting! If you have the log files we can try and work out which stage of orthofinder is bottlenecking the 51 species |
Great! I've made the logs and scripts available here: google drive More confusing - and in contrast - Species67 with diamond default and 100 CPUs is much faster than Species67 with diamond ultrasensitive and 125 CPUs - this is the opposite of Species51 using the two versions of Diamond. In all cases, it seems to be the final step of MSAs that the time variability is occurring in ways that seem to be inconsistent given the data and tools going in. There must be something I am overlooking in the details - or missing in the bigger picture - to make sense of this...? Species51 Diamond Default 125 CPUs 1000 Gb memory = 35,000+ MSAs at 5 days-6 hours final MSA step
Species51 Diamond Ultrasensitive 125 CPUs 600 Gb memory = 34,000+ MSAs at 2 days 3 hours final MSA step
Species67 Diamond Default 100 CPUs 750 Gb memory = 43,000+ MSAs at 2 days 15 hours final MSA step
Species67 Diamond Ultrasensitive 125 CPUs 750 Gb memory = 42,000+ MSAs at 9 days+ and still running
|
Thanks for sharing the log files - I think I have an explanation for what is happening Diamond ultra sensitive tends to make (marginally) fewer and smaller orthogroups. When it comes to multiple sequence alignment with MAFFT, the limiting factor is often the size of the largest orthogroups. Those alignments take a long long time to run. Diamond ultra-sensitive is slow for the all-versus-all search, however the alignment steps ends up being quicker, because there are fewer super-large orthogroups. My reccommendation would be to switch to using FAMSA for alignment, instead of MAFFT (see https://github.com/davidemms/OrthoFinder?tab=readme-ov-file#configjson--adding-addtional-programs-for-tree-inference-local-alignment-or-msa) FAMSA is much quicker at alignment, so you don't run into the same issue (and it will soon become the default option in the new version of orthofinder) |
Thank you for the quick and detailed explanation! I agree on diamond sensitive vs ultrasensitive - that makes sense. Thank you! I can check and see if the largest orthogroup in Species51 Diamond sensitive is larger than for Species67 Diamond sensitive. That would be the predict I guess to explain the unexpected time difference. I can update here later... I'll give FAMSA a try - do you think it is better in all cases - even when there are only 10s or a few hundred sequences? Somehow I had the idea that it was for large-scale alignment but at small scales MAFFT was best. But I've no idea where I got that idea / impression. Thank you again :) |
Hi!
I'm running OrthoFinder 3 (3.0.1b1) on two sets of eukaryotic genomes - species51 and species 67, where species51 is a subset of species67. I've run them both twice as core with slight differences in the user-provided species tree - and both times species67 finishes in 4 days - while species51 takes 6 days.
Here is an example command line;
orthofinder -t 100 -a 100 -M msa -X -s input/Species67-hand_built_rooted_newick-14dec2024 -S diamond -A mafft -T fasttree -f input/species67-T1-protomes -n orthofinder_species67_fast
I was wondering what might be causing this substantial difference in time between the two species sets - and why the species set with all the same and more species is finishing significantly faster. Would like to make sure things are ok / make sense - as this seems counter-intuitive to me.
Thank you! Eric
The text was updated successfully, but these errors were encountered: