Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding more species takes less time - unclear why? #955

Open
000generic opened this issue Dec 23, 2024 · 4 comments
Open

Adding more species takes less time - unclear why? #955

000generic opened this issue Dec 23, 2024 · 4 comments

Comments

@000generic
Copy link

000generic commented Dec 23, 2024

Hi!

I'm running OrthoFinder 3 (3.0.1b1) on two sets of eukaryotic genomes - species51 and species 67, where species51 is a subset of species67. I've run them both twice as core with slight differences in the user-provided species tree - and both times species67 finishes in 4 days - while species51 takes 6 days.

Here is an example command line;

orthofinder -t 100 -a 100 -M msa -X -s input/Species67-hand_built_rooted_newick-14dec2024 -S diamond -A mafft -T fasttree -f input/species67-T1-protomes -n orthofinder_species67_fast

I was wondering what might be causing this substantial difference in time between the two species sets - and why the species set with all the same and more species is finishing significantly faster. Would like to make sure things are ok / make sense - as this seems counter-intuitive to me.

Thank you! Eric

@lauriebelch
Copy link

lauriebelch commented Jan 6, 2025

Hi Eric,

That's definitely interesting! If you have the log files we can try and work out which stage of orthofinder is bottlenecking the 51 species

@000generic
Copy link
Author

000generic commented Jan 6, 2025

Great! I've made the logs and scripts available here:

google drive
It seems like there is inconsistency in the amount of time used for the MSAs in the final MSAs step - and since I first posted this issue - Species51 run with diamond ultrasensitive and 125 CPUs was faster than Species51 diamond default and 125 cpus - despite diamond ultrasensitive being slower than diamond default and no other major differences going into the two runs. This seeming inconsistency in the amount of time required to complete is similar to Species51 with diamond default and 125 CPUs vs Species67 with diamond default and 100 CPUs - here again, it doesn't seem to make sense to me given the number of species and now CPUs (the run with fewer CPUs and more species finishes faster).

More confusing - and in contrast - Species67 with diamond default and 100 CPUs is much faster than Species67 with diamond ultrasensitive and 125 CPUs - this is the opposite of Species51 using the two versions of Diamond.

In all cases, it seems to be the final step of MSAs that the time variability is occurring in ways that seem to be inconsistent given the data and tools going in.

There must be something I am overlooking in the details - or missing in the bigger picture - to make sense of this...?

Species51 Diamond Default 125 CPUs 1000 Gb memory = 35,000+ MSAs at 5 days-6 hours final MSA step

Wed Dec 18 00:03:31 EST 2024
Running Species51 on 125 CPU cores 1000 Gb memory
OrthoFinder version 3.0.1b1 Copyright (C) 2014 David Emms
...
Inferring multiple sequence alignments and gene trees
-----------------------------------------------------
2024-12-18 03:01:27 : Done 0 of 35357
2024-12-18 22:04:09 : Done 1000 of 35357
...
2024-12-19 04:50:07 : Done 35000 of 35357
2024-12-24 10:58:04 : Done MSA/Trees
...
OrthoFinder assigned 968455 genes (89.6% of total) to 36662 orthogroups

Species51 Diamond Ultrasensitive 125 CPUs 600 Gb memory = 34,000+ MSAs at 2 days 3 hours final MSA step

Wed Dec 25 20:07:54 EST 2024
Running Species51 on 125 CPU cores 600 Gb memory 
OrthoFinder version 3.0.1b1 Copyright (C) 2014 David Emms
...
Inferring multiple sequence alignments and gene trees
-----------------------------------------------------
2024-12-26 03:33:41 : Done 0 of 34436
2024-12-27 00:18:11 : Done 1000 of 34436
...
2024-12-27 06:46:14 : Done 34000 of 34436
2024-12-29 09:43:33 : Done MSA/Trees
...
OrthoFinder assigned 971862 genes (90.0% of total) to 35883 orthogroups.

Species67 Diamond Default 100 CPUs 750 Gb memory = 43,000+ MSAs at 2 days 15 hours final MSA step

Thu Dec 19 07:02:27 EST 2024
Running Species67 on 100 CPU cores 750 Gb memory
OrthoFinder version 3.0.1b1 Copyright (C) 2014 David Emms
...
Inferring multiple sequence alignments and gene trees
-----------------------------------------------------
2024-12-19 11:22:09 : Done 0 of 43809
2024-12-20 11:16:03 : Done 1000 of 43809
...
2024-12-20 21:04:46 : Done 43000 of 43809
2024-12-23 12:07:56 : Done MSA/Trees
...
OrthoFinder assigned 1191685 genes (89.5% of total) to 49297 orthogroups.

Species67 Diamond Ultrasensitive 125 CPUs 750 Gb memory = 42,000+ MSAs at 9 days+ and still running

Thu Dec 26 12:56:05 EST 2024
Running Species67 on 125 CPU cores 750 Gb memory
OrthoFinder version 3.0.1b1 Copyright (C) 2014 David Emms
...
Inferring multiple sequence alignments and gene trees
-----------------------------------------------------
2024-12-26 23:50:19 : Done 0 of 42370
2024-12-27 21:52:44 : Done 1000 of 42370
2024-12-28 06:37:06 : Done 42000 of 42370
....STILL RUNNING January 6 2:56 pm

@lauriebelch
Copy link

Thanks for sharing the log files - I think I have an explanation for what is happening

Diamond ultra sensitive tends to make (marginally) fewer and smaller orthogroups. When it comes to multiple sequence alignment with MAFFT, the limiting factor is often the size of the largest orthogroups. Those alignments take a long long time to run.

Diamond ultra-sensitive is slow for the all-versus-all search, however the alignment steps ends up being quicker, because there are fewer super-large orthogroups.

My reccommendation would be to switch to using FAMSA for alignment, instead of MAFFT (see https://github.com/davidemms/OrthoFinder?tab=readme-ov-file#configjson--adding-addtional-programs-for-tree-inference-local-alignment-or-msa)

FAMSA is much quicker at alignment, so you don't run into the same issue (and it will soon become the default option in the new version of orthofinder)

@000generic
Copy link
Author

Thank you for the quick and detailed explanation!

I agree on diamond sensitive vs ultrasensitive - that makes sense. Thank you!

I can check and see if the largest orthogroup in Species51 Diamond sensitive is larger than for Species67 Diamond sensitive. That would be the predict I guess to explain the unexpected time difference. I can update here later...

I'll give FAMSA a try - do you think it is better in all cases - even when there are only 10s or a few hundred sequences? Somehow I had the idea that it was for large-scale alignment but at small scales MAFFT was best. But I've no idea where I got that idea / impression.

Thank you again :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants