-
Notifications
You must be signed in to change notification settings - Fork 362
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mag: add pipeline compatible mini test database for upcoming CAT_pack #1408
base: mag
Are you sure you want to change the base?
Conversation
## Generate database using CAT_pack prepare | ||
CAT_pack prepare --db_fasta input_files/sequence_fixedheaders.txt --names input_files/names_reduced.dmp --nodes input_files/nodes_reduced.dmp --acc2tax input_files/accession2taxid_reduced.dmp --db_dir test/ | ||
|
||
## Test using uncompressed contigs from metaspades assembly (note: --no_stars was required for some reason but seesms to only occur when we have a single genome in there possible..) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this be a problem using it when testing in #mag? Or will it be OK to just add --no-stars
to the test config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latter as far as I can tell :)
The flag was added for this purpose according to GitHub
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
afaik Maxbin2 will force a split into the contigs, it just assumes there are 2+ genomes (at least that happened a few years ago). But thats fine I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
(It might be good at some point to add details to the README for the provenance of the test datasets...!)
I agree but would need @HadrienG @skrakau @d4straub to do some data archaeology for us 😅 Although I was speaking with @muabnezor about the long read only stuff he adding to mag and we were wondering if it would be worth replacing the test data now to something less mysterious and has paired short and long read data before we implement nf-test... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that we should document where the data comes from. I'm going to speak with Sabrina about that.
Regarding whether that is related to nf-test: I dont think that matters whatsoever, because no normal user is running the tests with nf-test on old versions (usually -profile test without nf-test), and devs might be fine that a past release fails with nf-test because they are usually just interested in the dev branch.
curl "https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?tool=portal&save=file&log$=seqview&db=nuccore&report=fasta_cds_aa&id=1992822979&extrafeat=null&conwithfeat=on&hide-cdd=on&ncbi_phid=CE8C15326D6BB8C10000000006490560" -o sequence.txt | ||
|
||
## use my scripts to filter NCBI nodes/names/acc2taxid files to just a given taxid (basically fancy iterative greps) | ||
bash ~/bin/taxdmp_filter.sh 817 ## for nodes/naames |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bash ~/bin/taxdmp_filter.sh 817 ## for nodes/naames | |
bash ~/bin/taxdmp_filter.sh 817 ## for nodes/names |
sed 's/lcl|//g;s/_/ /2' sequence.txt > sequence_fixedheaders.txt | ||
|
||
## Generate database using CAT_pack prepare | ||
CAT_pack prepare --db_fasta input_files/sequence_fixedheaders.txt --names input_files/names_reduced.dmp --nodes input_files/nodes_reduced.dmp --acc2tax input_files/accession2taxid_reduced.dmp --db_dir test/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CAT_pack prepare --db_fasta input_files/sequence_fixedheaders.txt --names input_files/names_reduced.dmp --nodes input_files/nodes_reduced.dmp --acc2tax input_files/accession2taxid_reduced.dmp --db_dir test/ | |
CAT_pack prepare --db_fasta input_files/sequence_fixedheaders.txt --names input_files/names_reduced.dmp --nodes input_files/nodes_reduced.dmp --acc2tax input_files/accession2taxid_reduced.dmp --db_dir test/ |
## Generate database using CAT_pack prepare | ||
CAT_pack prepare --db_fasta input_files/sequence_fixedheaders.txt --names input_files/names_reduced.dmp --nodes input_files/nodes_reduced.dmp --acc2tax input_files/accession2taxid_reduced.dmp --db_dir test/ | ||
|
||
## Test using uncompressed contigs from metaspades assembly (note: --no_stars was required for some reason but seesms to only occur when we have a single genome in there possible..) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
afaik Maxbin2 will force a split into the contigs, it just assumes there are 2+ genomes (at least that happened a few years ago). But thats fine I think.
About the origin of the test data: It seems to be from @HadrienG and there might be a discussion in slack about it. |
No description provided.