Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mag: add pipeline compatible mini test database for upcoming CAT_pack #1408

Open
wants to merge 6 commits into
base: mag
Choose a base branch
from

Conversation

jfy133
Copy link
Member

@jfy133 jfy133 commented Nov 28, 2024

No description provided.

## Generate database using CAT_pack prepare
CAT_pack prepare --db_fasta input_files/sequence_fixedheaders.txt --names input_files/names_reduced.dmp --nodes input_files/nodes_reduced.dmp --acc2tax input_files/accession2taxid_reduced.dmp --db_dir test/

## Test using uncompressed contigs from metaspades assembly (note: --no_stars was required for some reason but seesms to only occur when we have a single genome in there possible..)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this be a problem using it when testing in #mag? Or will it be OK to just add --no-stars to the test config?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latter as far as I can tell :)

The flag was added for this purpose according to GitHub

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

afaik Maxbin2 will force a split into the contigs, it just assumes there are 2+ genomes (at least that happened a few years ago). But thats fine I think.

Copy link

@prototaxites prototaxites left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

(It might be good at some point to add details to the README for the provenance of the test datasets...!)

@jfy133
Copy link
Member Author

jfy133 commented Nov 28, 2024

LGTM!

(It might be good at some point to add details to the README for the provenance of the test datasets...!)

I agree but would need @HadrienG @skrakau @d4straub to do some data archaeology for us 😅

Although I was speaking with @muabnezor about the long read only stuff he adding to mag and we were wondering if it would be worth replacing the test data now to something less mysterious and has paired short and long read data before we implement nf-test...

Copy link

@d4straub d4straub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we should document where the data comes from. I'm going to speak with Sabrina about that.

Regarding whether that is related to nf-test: I dont think that matters whatsoever, because no normal user is running the tests with nf-test on old versions (usually -profile test without nf-test), and devs might be fine that a past release fails with nf-test because they are usually just interested in the dev branch.

curl "https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?tool=portal&save=file&log$=seqview&db=nuccore&report=fasta_cds_aa&id=1992822979&extrafeat=null&conwithfeat=on&hide-cdd=on&ncbi_phid=CE8C15326D6BB8C10000000006490560" -o sequence.txt

## use my scripts to filter NCBI nodes/names/acc2taxid files to just a given taxid (basically fancy iterative greps)
bash ~/bin/taxdmp_filter.sh 817 ## for nodes/naames

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
bash ~/bin/taxdmp_filter.sh 817 ## for nodes/naames
bash ~/bin/taxdmp_filter.sh 817 ## for nodes/names

sed 's/lcl|//g;s/_/ /2' sequence.txt > sequence_fixedheaders.txt

## Generate database using CAT_pack prepare
CAT_pack prepare --db_fasta input_files/sequence_fixedheaders.txt --names input_files/names_reduced.dmp --nodes input_files/nodes_reduced.dmp --acc2tax input_files/accession2taxid_reduced.dmp --db_dir test/

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
CAT_pack prepare --db_fasta input_files/sequence_fixedheaders.txt --names input_files/names_reduced.dmp --nodes input_files/nodes_reduced.dmp --acc2tax input_files/accession2taxid_reduced.dmp --db_dir test/
CAT_pack prepare --db_fasta input_files/sequence_fixedheaders.txt --names input_files/names_reduced.dmp --nodes input_files/nodes_reduced.dmp --acc2tax input_files/accession2taxid_reduced.dmp --db_dir test/

## Generate database using CAT_pack prepare
CAT_pack prepare --db_fasta input_files/sequence_fixedheaders.txt --names input_files/names_reduced.dmp --nodes input_files/nodes_reduced.dmp --acc2tax input_files/accession2taxid_reduced.dmp --db_dir test/

## Test using uncompressed contigs from metaspades assembly (note: --no_stars was required for some reason but seesms to only occur when we have a single genome in there possible..)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

afaik Maxbin2 will force a split into the contigs, it just assumes there are 2+ genomes (at least that happened a few years ago). But thats fine I think.

@d4straub
Copy link

About the origin of the test data: It seems to be from @HadrienG and there might be a discussion in slack about it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants