Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enforce start codon in pyrodigal ORF prediction #64

Open
LanderDC opened this issue Dec 4, 2024 · 3 comments
Open

Enforce start codon in pyrodigal ORF prediction #64

LanderDC opened this issue Dec 4, 2024 · 3 comments
Labels
enhancement New feature or request

Comments

@LanderDC
Copy link

LanderDC commented Dec 4, 2024

Hi,

I would like to use pyrodigal(-gv) to predict ORFs in eukaryotic viruses, I find that it generally works well. However, in some cases pyrodigal overestimates the gene size:

For example chryso_test.txt is a sequence with an ORF that has a start codon at position 20, but no stop codon. If I run pyrodigal with -c this ORF is not found (as expected), but without -c, pyrodigal puts the start codon outside of the sequence:

##gff-version  3
# Sequence Data: seqnum=1;seqlen=3312;seqhdr="NODE_1_length_3312_cov_25.710665_MEMO032"
# Model Data: version=pyrodigal.v3.6.3;run_type=Metagenomic;model="5|Anaplasma_marginale_Maries|B|49.8|11|1";gc_cont=49.80;transl_table=11;uses_sd=1
NODE_1_length_3312_cov_25.710665_MEMO032	pyrodigal_v3.6.3	CDS	2	3310	275.9	+	0	ID=NODE_1_length_3312_cov_25.710665_MEMO032_1;partial=11;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.460;conf=99.99;score=275.94;cscore=274.33;sscore=1.61;rscore=0.00;uscore=0.00;tscore=1.61;

Would it be possible to create a feature that allows the user to specify that a start codon should be enforced (and not both start and stop codon can be outside a gene as with -c)? Or would there be any reason that I'm overlooking, why this behaviour would not be desirable?

For people having a similar problem: while looking for a solution to my specific problem, I found that commenting out following lines (and rebuilding the package), would give me the output I want:

elif i <= 2 and not closed and last[i%3] - i > min_edge_gene:
saw_start[i%3] = True
self._add_node(
ndx = i,
type = node_type.ATG,
stop_val = last[i%3],
strand = 1,
edge = True,
)
nn += 1

and

elif i <= 2 and not closed and last[i%3] - i > min_edge_gene:
saw_start[i%3] = 1
node = self._add_node(
ndx = sequence.slen - i - 1,
type = node_type.ATG,
strand = -1,
stop_val = sequence.slen - last[i%3] - 1,
edge = True,
)
nn += 1

@LanderDC LanderDC changed the title Enforce start or stop codon in pyrodigal ORF prediction Enforce start codon in pyrodigal ORF prediction Dec 6, 2024
@LanderDC
Copy link
Author

LanderDC commented Dec 6, 2024

I tried to give it a go myself in the meantime (see forked repository), so if you're interested I can draft up a pull request. I guess there would be better ways to implement this, but that would involve breaking changes and I wanted to avoid that.

In summary, the closed argument of the GeneFinder class now takes a list of two boolean values, which is then parsed into closed_start and closed_stop, or alternatively, when only one boolean is given, both closed_start and closed_stop are set to that value (this way existing pipelines would not break).

For pyrodigal's CLI, -c now takes a string (options: none [default] -> no closed ends, both -> both closed ends, start -> only closed end at start) or when only -c is given, this corresponds to both.

@althonos
Copy link
Owner

Hi @LanderDC !

That sounds like a reasonable request, especially for viral genomes. I have to think about some of the implications with the code, and also whether this doesn't combine with #65 in a larger "how to handle circular topologies" question?

@althonos althonos added the enhancement New feature or request label Dec 11, 2024
@LanderDC
Copy link
Author

LanderDC commented Dec 12, 2024

Sounds great! In my forked repository, all unit tests run fine with my altered version (except 1 in test_nodes.py, because I changed the arguments of the extract function), but there might be of course hidden issues that I'm not able to spot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants