Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Starting a list of barcoding genes and their synonyms. #93

Open
tobiasgf opened this issue Apr 12, 2024 · 20 comments
Open

Starting a list of barcoding genes and their synonyms. #93

tobiasgf opened this issue Apr 12, 2024 · 20 comments

Comments

@tobiasgf
Copy link
Collaborator

tobiasgf commented Apr 12, 2024

At some point we may want to make a more comprehensive list of barcoding genes and their synonyms (also for use in the sequence ID tool, etc.)

Starting here (with a bit of help from AI):

[edited: RNA to DNA, added "mt" to mitochondial versions]
[EDIT: Idea: maybe we should add SSU/LSU to the ribosomal gene names, to make the distinction between these more clear... e.g.: "12S mtDNA (SSU)" (or some rearranged variant) rather than "12S mtDNA"]

  • 12S mtDNA (12S ribosomal RNA): 12SrDNA, 12S
  • 16S mtDNA (mitochondrial) (16S ribosomal RNA): 16SrDNA, 16S
  • 16S rDNA (bacterial) (16S ribosomal RNA)
  • 18S rDNA (18S ribosomal RNA): 18SrDNA, 18S
  • 23S rDNA (23S ribosomal RNA)
  • 28S rDNA (28S ribosomal RNA)
  • rbcL (Ribulose-bisphosphate carboxylase large chain)
  • CytB (Cytochrome b): CYTB, CYB
  • COI (Cytochrome c oxidase I): CO1, COX1
  • COII (Cytochrome c oxidase II): CO2, COX2
  • COIII (Cytochrome c oxidase III): CO3, COX3
  • nifH (Nitrogenase reductase)
  • ITS (Internal Transcribed Spacer): ITS1, ITS2
  • ND1 (NADH dehydrogenase subunit 1): NAD1
  • ND2 (NADH dehydrogenase subunit 2): NAD2
  • ND3 (NADH dehydrogenase subunit 3): NAD3
  • ND4 (NADH dehydrogenase subunit 4): NAD4
  • ND5 (NADH dehydrogenase subunit 5): NAD5
  • ND6 (NADH dehydrogenase subunit 6): NAD6
  • amoA (Ammonia monooxygenase subunit A)
  • rpoB (RNA polymerase subunit B)
  • rpoC1 (RNA polymerase subunit C1)
  • rpoC2 (RNA polymerase subunit C2)
  • matK (Maturase K)
  • trnH (tRNA-His)
  • trnL (tRNA-Leu)
  • psbK (Photosystem II protein K)
  • trnK (tRNA-Lys)
  • D-loop
  • atp6 (ATP synthase F0 subunit 6)
  • atp8 (ATP synthase F0 subunit 8)
  • tufA (Elongation factor Tu)
@tobiasgf
Copy link
Collaborator Author

tobiasgf commented Apr 16, 2024

Consider using rDNA instead of rDNA, reflecting the focus on DNA sequences and not the gene product.
[edit, I meant: rDNA instead of rRNA]

@tobiasgf
Copy link
Collaborator Author

Also, consider splitting ITS into: ITS1 and ITS2, and ITS region (ITS1+5.8S+ITS2)

@CecSve
Copy link

CecSve commented May 7, 2024

  • 16S rRNA (mitochondrial) (16S ribosomal RNA): 16SrDNA, 16S

  • 16S rRNA (bacterial) (16S ribosomal RNA)

Does it matter whether it is mitochondrial or bacterial? Concepts have to be unique, so I have added only one.

@tobiasgf
Copy link
Collaborator Author

tobiasgf commented May 7, 2024

Two different genes
16S mitochondrial gene is the Large subunit (LSU) (homologue to the 23S in bacteria and 28S in eukaryotes)
16S bacterial gene is the small subunit (SSU) (homologue to the 12S in mitochondria and 28S in eukaryotes)
They just (unfortunately) have the same sedimentation coefficient (S-value).

So the unique concepts would rather be LSU and SSU. But they are just targeted with very different primers.

But I believe it is best to keep these two genes (LSU and SSU) separate between bacteria/archaea, eukatyotes and mitochondria.

@CecSve
Copy link

CecSve commented May 7, 2024

What do you propose we call the concepts then so we make sure they do not match?

@tobiasgf
Copy link
Collaborator Author

tobiasgf commented May 7, 2024

Also, I think we should use the DNA-terminology (rDNA not rRNA), as we are talkning about the DNA that is being targeted, not the gene product. Maybe:

12S mtDNA
16S mtDNA
16S rDNA
18S rDNA
23S rDNA
28S rDNA

@CecSve
Copy link

CecSve commented May 7, 2024

Could I ask you to update the first post with the list? Then I'll add the concepts to the spreadsheet.

@CecSve
Copy link

CecSve commented May 10, 2024

Also, consider splitting ITS into: ITS1 and ITS2, and ITS region (ITS1+5.8S+ITS2)

I have added 1 and 2 - not sure what is meant with the region. Could you please specify? https://docs.google.com/spreadsheets/d/1_cV4LZeqF_sm-JVaHuKusWBeVK2VkVWEcuPM8MPZN3s/edit#gid=1674909017

@CecSve
Copy link

CecSve commented May 10, 2024

Two different genes 16S mitochondrial gene is the Large subunit (LSU) (homologue to the 23S in bacteria and 28S in eukaryotes) 16S bacterial gene is the small subunit (SSU) (homologue to the 12S in mitochondria and 28S in eukaryotes) They just (unfortunately) have the same sedimentation coefficient (S-value).

So the unique concepts would rather be LSU and SSU. But they are just targeted with very different primers.

But I believe it is best to keep these two genes (LSU and SSU) separate between bacteria/archaea, eukatyotes and mitochondria.

I do not think it is currently captured well as concepts. Ideally, it should be clear for publishers as well as users - but mostly users what goes in the interpreted field. Let me know how you think we should separate the two in a clear way. For example, we have a verbatim value 16S - how should this be interpreted?

@CecSve
Copy link

CecSve commented May 10, 2024

The verbatim values are now ready to be mapped - I have added some mapping already, however, since they all are identical to either the concept, label_en or alternativeLabel_en they will be removed from the mapping sheet since they will be mapped based on this.

@only1chunts
Copy link

fyi - i have linked a GSC MIxS ticket to this so that we can update the 'target_gene' accepted values list with the consensus list here, thanks.

@tobiasgf
Copy link
Collaborator Author

fyi - i have linked a GSC MIxS ticket to this so that we can update the 'target_gene' accepted values list with the consensus list here, thanks.

Very good, @only1chunts. I don't yet consider this list as authoritative or community accepted. This is a first attempt to collect a list of barcode genes (and their synonyms / alternative names).

@tobiasgf
Copy link
Collaborator Author

Also, I think we should use the DNA-terminology (rDNA not rRNA), as we are talkning about the DNA that is being targeted, not the gene product. Maybe:

12S mtDNA 16S mtDNA 16S rDNA 18S rDNA 23S rDNA 28S rDNA

Based on more recent speculations and discussions, a suggestion for an unambiguous set of labels for the commonly used ribosomal genes could be.
12S rRNA (SSU mitochondria)
16S rRNA (LSU mitochondria)
16S rRNA (SSU prokaryote)
23S rRNA (LSU prokaryote)
18S rRNA (SSU eukaryote)
28S rRNA (LSU eukaryote)
(The specification for the eukaryotes may not be needed, as 18S and 28S are unambiguous. But for a systematic labelling, they may be relevant.)

@cpavloud
Copy link

Consider using rDNA instead of rDNA, reflecting the focus on DNA sequences and not the gene product. [edit, I meant: rDNA instead of rRNA]

Technically, there is no such thing as rDNA in an actual living organism (despite the fact that you can find it in the literature...), so I would keep the rRNA.
If you want to be more specific, you can use a sentence like "the gene encoding for 16S rRNA", or "the 16S rRNA gene" or something like that.

@cpavloud
Copy link

I think the main issue here arises from the primer (non-)specificities.
It is very common that e.g. with 16S rRNA you amplify Eukaryotes when aiming for Bacteria/Archaea.

So, maybe instead of modifying the target_gene term or proposing alternative values, we should be aiming to create a new target_organism term?

@only1chunts
Copy link

only1chunts commented Jan 14, 2025 via email

@cpavloud
Copy link

I am not in the group but I would love to be, so it would be great to put me in touch with the relevant people.
I don't want to hijack the conversation here.

@tobiasgf
Copy link
Collaborator Author

I think the main issue here arises from the primer (non-)specificities. It is very common that e.g. with 16S rRNA you amplify Eukaryotes when aiming for Bacteria/Archaea.

So, maybe instead of modifying the target_gene term or proposing alternative values, we should be aiming to create a new target_organism term?

That is a good point (unspecific primers)
SSU and LSU would be the all-inclusive target_gene labels maybe?
As @only1chunts writes, there is a difference between stating what was targeted and what was amplified, so it is important to be aware e.g. that "18S data" likely includes 12S (mito) and 16S (bact) as well, for sure.

not easy

@CecSve
Copy link

CecSve commented Jan 15, 2025

@tobiasgf will you update the vocabulary to reflect the changes? It is not implemented yet, but it would ease the transition in the list is updated in the vocabulary and mapped to verbatim values.

FYI, I will not put the vocabulary on the server until there is a (somewhat) static list of terms. Ideally, concepts should not be changed regularly once they are in the server since it would require full reinterpretation of occurrences.

@sformel-usgs
Copy link

sformel-usgs commented Jan 21, 2025

This conversation moved fast! I was traveling last week, but here are my thoughts.

  1. Creating a controlled vocabulary to help clean up this mess is a great idea, it's sorely needed.
  2. To future proof this a bit, I think we should explore chunking this up by organelle using GO terms. This might be too onerous for users, but it also might raise awareness about the diversity of data that needs to be interoperable for data sharing purposes.
  3. Since primers may detect off-target taxa (e.g. chloroplast instead of prokaryote), IMO, it would be best if the taxonomy was kept in taxonomy terms, rather than be implied in the gene target terms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants