Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Find location of genes in reference sequences #2

Open
dkoslicki opened this issue Jun 21, 2022 · 1 comment
Open

Find location of genes in reference sequences #2

dkoslicki opened this issue Jun 21, 2022 · 1 comment

Comments

@dkoslicki
Copy link
Member

The following will help with that task:

NCBI FTP site: ftp.ncbi.nlm.nih.gov (username: anonymous, pw: your email)
location /gene/DATA has a file gene2refseq that lists all the known genes, their corresponding locations and the accessions for the genomes that they belong to.

Pull out specific regions of a genome:

wget "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=NZ_CP096169.1&strand=1&rettype=fasta&seq_start=0&seq_stop=1385&retmode=text" -O out.fa

Idea:

  1. Select a bunch of accessions for which we have gene information (from the gene2refseq)
  2. Download their corresponding whole genomes
  3. Create a mapping file that maps regions to genes (slight modification of gene2refseq)
  4. Download the genes themselves for training of sourmash and Diamond
  5. Create a file that takes a bbmap simulation and spits out the genes covered in the simulation, as well as the %bp of the gene covered and the mean/median/summary of the total amount of bp mapped to each gene
@dkoslicki
Copy link
Member Author

This will pull the amino acid sequence itself

wget "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=NZ_CP096169.1&strand=1&rettype=fasta_cds_aa&seq_start=0&seq_stop=1385&retmode=text" -O out.fa

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant