The Main ORF type
The main type of the package is ORF
which represents an Open Reading Frame.
GeneFinder.ORF
— Typestruct ORF{N,F} <: GenomicFeatures.AbstractGenomicInterval{F}
The ORF
struct represents an Open Reading Frame (ORF) in genomics.
Fields
groupname::String
: The name of the group to which the ORF belongs.first::Int64
: The starting position of the ORF.last::Int64
: The ending position of the ORF.strand::Strand
: The strand on which the ORF is located.frame::Int
: The reading frame of the ORF.features::Features
: The features associated with the ORF.scheme::Union{Nothing,Function}
: The scheme used for the ORF.
Constructor
FASTX.sequence
— Methodsequence(i::ORF{N,F})
Extracts the DNA sequence corresponding to the given open reading frame (ORF).
Arguments
i::ORF{N,F}
: The open reading frame (ORF) for which the DNA sequence needs to be extracted.
Returns
- The DNA sequence corresponding to the given open reading frame (ORF).
Finding ORFs
The function findorfs
is the main function of the package. It is generic method that can handle different gene finding methods.
GeneFinder.findorfs
— Methodfindorfs(sequence::NucleicSeqOrView{DNAAlphabet{N}}; ::M, kwargs...) where {N, M<:GeneFinderMethod}
This is the main interface method for finding open reading frames (ORFs) in a DNA sequence.
It takes the following required arguments:
sequence
: The nucleic acid sequence to search for ORFs.method
: The algorithm used to find ORFs. It can be eitherNaiveFinder()
,NaiveFinderScored()
or yet other implementations.
Keyword Arguments regardless of the finder method:
alternative_start::Bool
: A boolean indicating whether to consider alternative start codons. Default isfalse
.minlen::Int
: The minimum length of an ORF. Default is6
.scheme::Function
: The scoring scheme to use for scoring the sequence from the ORF. Default isnothing
.
Returns
A vector of ORF
objects representing the found ORFs.
Example
sequence = randdnaseq(120)
+API · GeneFinder.jl The Main ORF type
The main type of the package is ORF
which represents an Open Reading Frame.
GeneFinder.ORF
— Typestruct ORF{N,F} <: GenomicFeatures.AbstractGenomicInterval{F}
The ORF
struct represents an Open Reading Frame (ORF) in genomics.
Fields
groupname::String
: The name of the group to which the ORF belongs.first::Int64
: The starting position of the ORF.last::Int64
: The ending position of the ORF.strand::Strand
: The strand on which the ORF is located.frame::Int
: The reading frame of the ORF.features::Features
: The features associated with the ORF.scheme::Union{Nothing,Function}
: The scheme used for the ORF.
Constructor
ORF{N,F}(
+ groupname::String,
+ first::Int64,
+ last::Int64,
+ strand::Strand,
+ frame::Int,
+ features::Features,
+ scheme::Union{Nothing,Function}
+)
+
+# Example
+
+A full instance `ORF`
+
julia ORF{4,NaiveFinder}("seq01", 1, 33, STRAND_POS, 1, Features((score = 0.0,)), nothing)
+A partial instance `ORF`
+
julia ORF{NaiveFinder}(1:33, '+', 1) ```
sourceFASTX.sequence
— Methodsequence(i::ORF{N,F})
Extracts the DNA sequence corresponding to the given open reading frame (ORF).
Arguments
i::ORF{N,F}
: The open reading frame (ORF) for which the DNA sequence needs to be extracted.
Returns
- The DNA sequence corresponding to the given open reading frame (ORF).
sourceGeneFinder.features
— Methodfeatures(i::ORF{N,F})
Extracts the features from an ORF
object.
Arguments
i::ORF{N,F}
: An ORF
object.
Returns
The features of the ORF
object.
sourceGeneFinder.source
— Methodsource(i::ORF{N,F})
Get the source sequence associated with the given ORF
object.
Arguments
i::ORF{N,F}
: The ORF
object for which to retrieve the source sequence.
Returns
The source sequence associated with the ORF
object.
Examples
seq = dna"ATGATGCATGCATGCATGCTAGTAACTAGCTAGCTAGCTAGTAA"
+orfs = findorfs(seq)
+source(orfs[1])
+
+44nt DNA Sequence:
+ATGATGCATGCATGCATGCTAGTAACTAGCTAGCTAGCTAGTAA
Warning The source
method works if the sequence is defined in the global scope. Otherwise it will throw an error. For instance a common failure is to define a simple ORF
that by defualt will have an "unnamedsource" as groupname
and then try to get the source sequence.
orf = ORF{NaiveFinder}(1:33, '+', 1)
+source(orf)
+
+ERROR: UndefVarError: `unnamedsource` not defined
+Stacktrace:
+ [1] source(i::ORF{4, NaiveFinder})
+ @ GeneFinder ~/.julia/dev/GeneFinder/src/types.jl:192
+ [2] top-level scope
+ @ REPL[12]:1
sourceFinding ORFs
The function findorfs
is the main function of the package. It is generic method that can handle different gene finding methods.
GeneFinder.findorfs
— Methodfindorfs(sequence::NucleicSeqOrView{DNAAlphabet{N}}; ::M, kwargs...) where {N, M<:GeneFinderMethod}
This is the main interface method for finding open reading frames (ORFs) in a DNA sequence.
It takes the following required arguments:
sequence
: The nucleic acid sequence to search for ORFs.method
: The algorithm used to find ORFs. It can be either NaiveFinder()
, NaiveFinderScored()
or yet other implementations.
Keyword Arguments regardless of the finder method:
alternative_start::Bool
: A boolean indicating whether to consider alternative start codons. Default is false
.minlen::Int
: The minimum length of an ORF. Default is 6
.scheme::Function
: The scoring scheme to use for scoring the sequence from the ORF. Default is nothing
.
Returns
A vector of ORF
objects representing the found ORFs.
Example
sequence = randdnaseq(120)
120nt DNA Sequence:
GCCGGACAGCGAAGGCTAATAAATGCCCGTGCCAGTATC…TCTGAGTTACTGTACACCCGAAAGACGTTGTACGCATTT
@@ -7,7 +35,7 @@
findorfs(sequence, NaiveFinder())
1-element Vector{ORF}:
- ORF{NaiveFinder}(77:118, '-', 2, 0.0)
sourceFinding ORFs using BioRegex and scoring
GeneFinder.NaiveFinder
— MethodNaiveFinder(sequence::NucleicSeqOrView{DNAAlphabet{N}}; kwargs...) -> Vector{ORF} where {N}
A simple implementation that finds ORFs in a DNA sequence.
The NaiveFinder
method takes a LongSequence{DNAAlphabet{4}} sequence and returns a Vector{ORF} containing the ORFs found in the sequence. It searches entire regularly expressed CDS, adding each ORF it finds to the vector. The function also searches the reverse complement of the sequence, so it finds ORFs on both strands. Extending the starting codons with the alternative_start = true
will search for ATG, GTG, and TTG. Some studies have shown that in E. coli (K-12 strain), ATG, GTG and TTG are used 83 %, 14 % and 3 % respectively.
Note This function has neither ORFs scoring scheme by default nor length constraints. Thus it might consider aa"M*"
a posible encoding protein from the resulting ORFs.
Required Arguments
sequence::NucleicSeqOrView{DNAAlphabet{N}}
: The nucleic acid sequence to search for ORFs.
Keywords Arguments
alternative_start::Bool
: If true will pass the extended start codons to search. This will increase 3x the execution time. Default is false
.minlen::Int64=6
: Length of the allowed ORF. Default value allow aa"M*"
a posible encoding protein from the resulting ORFs.scheme::Function
: The scoring scheme to use for scoring the sequence from the ORF. Default is nothing
.
Note As the scheme is generally a scoring function that at least requires a sequence, one simple scheme is the log-odds ratio score. This score is a log-odds ratio that compares the probability of the sequence generated by a coding model to the probability of the sequence generated by a non-coding model:
\[S(x) = \sum_{i=1}^{L} \beta_{x_{i}x} = \sum_{i=1} \log \frac{a^{\mathscr{m}_{1}}_{i-1} x_i}{a^{\mathscr{m}_{2}}_{i-1} x_i}\]
If the log-odds ratio exceeds a given threshold (η
), the sequence is considered likely to be coding. See lordr
for more information about coding creteria.
sourceGeneFinder._locationiterator
— Methodlocationiterator(sequence::NucleicSeqOrView{DNAAlphabet{N}}; alternative_start::Bool=false) where {N}
This is an iterator function that uses regular expressions to search the entire ORF (instead of start and stop codons) in a LongSequence{DNAAlphabet{4}}
sequence. It uses an anonymous function that will find the first regularly expressed ORF. Then using this anonymous function it creates an iterator that will apply it until there is no other CDS.
Note As a note of the implementation we want to expand on how the ORFs are found:
The expression (?:[N]{3})*?
serves as the boundary between the start and stop codons. Within this expression, the character class [N]{3}
captures exactly three occurrences of any character (representing nucleotides using IUPAC codes). This portion functions as the regular codon matches. Since it is enclosed within a non-capturing group (?:)
and followed by *?
, it allows for the matching of intermediate codons, but with a preference for the smallest number of repetitions.
In summary, the regular expression ATG(?:[N]{3})*?T(AG|AA|GA)
identifies patterns that start with "ATG," followed by any number of three-character codons (represented by "N" in the IUPAC code), and ends with a stop codon "TAG," "TAA," or "TGA." This pattern is commonly used to identify potential protein-coding regions within genetic sequences.
See more about the discussion here
source<!– ## Geting ORFs sequences
–>
Writing ORFs to files
GeneFinder.write_orfs_bed
— Methodwrite_orfs_bed(input::NucleicSeqOrView{DNAAlphabet{N}}, output::Union{IOStream, IOBuffer}, finder::F; kwargs...)
+ ORF{NaiveFinder}(77:118, '-', 2, 0.0)
sourceFinding ORFs using BioRegex and scoring
GeneFinder.NaiveFinder
— MethodNaiveFinder(sequence::NucleicSeqOrView{DNAAlphabet{N}}; kwargs...) -> Vector{ORF} where {N}
A simple implementation that finds ORFs in a DNA sequence.
The NaiveFinder
method takes a LongSequence{DNAAlphabet{4}} sequence and returns a Vector{ORF} containing the ORFs found in the sequence. It searches entire regularly expressed CDS, adding each ORF it finds to the vector. The function also searches the reverse complement of the sequence, so it finds ORFs on both strands. Extending the starting codons with the alternative_start = true
will search for ATG, GTG, and TTG. Some studies have shown that in E. coli (K-12 strain), ATG, GTG and TTG are used 83 %, 14 % and 3 % respectively.
Note This function has neither ORFs scoring scheme by default nor length constraints. Thus it might consider aa"M*"
a posible encoding protein from the resulting ORFs.
Required Arguments
sequence::NucleicSeqOrView{DNAAlphabet{N}}
: The nucleic acid sequence to search for ORFs.
Keywords Arguments
alternative_start::Bool
: If true will pass the extended start codons to search. This will increase 3x the execution time. Default is false
.minlen::Int64=6
: Length of the allowed ORF. Default value allow aa"M*"
a posible encoding protein from the resulting ORFs.scheme::Function
: The scoring scheme to use for scoring the sequence from the ORF. Default is nothing
.
Note As the scheme is generally a scoring function that at least requires a sequence, one simple scheme is the log-odds ratio score. This score is a log-odds ratio that compares the probability of the sequence generated by a coding model to the probability of the sequence generated by a non-coding model:
\[S(x) = \sum_{i=1}^{L} \beta_{x_{i}x} = \sum_{i=1} \log \frac{a^{\mathscr{m}_{1}}_{i-1} x_i}{a^{\mathscr{m}_{2}}_{i-1} x_i}\]
If the log-odds ratio exceeds a given threshold (η
), the sequence is considered likely to be coding. See lordr
for more information about coding creteria.
sourceGeneFinder._locationiterator
— Method_locationiterator(sequence::NucleicSeqOrView{DNAAlphabet{N}}; alternative_start::Bool=false) where {N}
This is an iterator function that uses regular expressions to search the entire ORF (instead of start and stop codons) in a LongSequence{DNAAlphabet{4}}
sequence. It uses an anonymous function that will find the first regularly expressed ORF. Then using this anonymous function it creates an iterator that will apply it until there is no other CDS.
Note As a note of the implementation we want to expand on how the ORFs are found:
The expression (?:[N]{3})*?
serves as the boundary between the start and stop codons. Within this expression, the character class [N]{3}
captures exactly three occurrences of any character (representing nucleotides using IUPAC codes). This portion functions as the regular codon matches. Since it is enclosed within a non-capturing group (?:)
and followed by *?
, it allows for the matching of intermediate codons, but with a preference for the smallest number of repetitions.
In summary, the regular expression ATG(?:[N]{3})*?T(AG|AA|GA)
identifies patterns that start with "ATG," followed by any number of three-character codons (represented by "N" in the IUPAC code), and ends with a stop codon "TAG," "TAA," or "TGA." This pattern is commonly used to identify potential protein-coding regions within genetic sequences.
See more about the discussion here
sourceWriting ORFs to files
GeneFinder.write_orfs_bed
— Methodwrite_orfs_bed(input::NucleicSeqOrView{DNAAlphabet{N}}, output::Union{IOStream, IOBuffer}, finder::F; kwargs...)
write_orfs_bed(input::NucleicSeqOrView{DNAAlphabet{N}}, output::String, finder::F; kwargs...)
Write BED data to a file.
Arguments
input
: The input DNA sequence NucSeq or a view.output
: The otput format, it can be a file (String
) or a buffer (IOStream
or `IOBuffer)finder
: The algorithm used to find ORFs. It can be either NaiveFinder()
or NaiveFinderScored()
.
Keywords
alternative_start::Bool=false
: If true, alternative start codons will be used when identifying CDSs. Default is false
.minlen::Int64=6
: The minimum length that a CDS must have in order to be included in the output file. Default is 6
.
sourceGeneFinder.write_orfs_faa
— Methodwrite_orfs_faa(input::NucleicSeqOrView{DNAAlphabet{4}}, output::Union{IOStream, IOBuffer}, finder::F; kwargs...)
write_orfs_faa(input::NucleicSeqOrView{DNAAlphabet{4}}, output::String, finder::F; kwargs...)
Write the protein sequences encoded by the coding sequences (CDSs) of a given DNA sequence to the specified file.
Arguments
input
: The input DNA sequence NucSeq or a view.output
: The otput format, it can be a file (String
) or a buffer (IOStream
or `IOBuffer)finder
: The algorithm used to find ORFs. It can be either NaiveFinder()
or NaiveFinderScored()
.
Keywords
code::GeneticCode=BioSequences.standard_genetic_code
: The genetic code by which codons will be translated. See BioSequences.ncbi_trans_table
for more info. alternative_start::Bool=false
: If true will pass the extended start codons to search. This will increase 3x the exec. time.minlen::Int64=6
: Length of the allowed ORF. Default value allow aa"M*"
a posible encoding protein from the resulting ORFs.
Examples
filename = "output.faa"
@@ -23,4 +51,4 @@
open(filename, "w") do file
write_orfs_fna(seq, file, NaiveFinder())
end
sourceGeneFinder.write_orfs_gff
— Methodwrite_orfs_gff(input::NucleicSeqOrView{DNAAlphabet{N}}, output::Union{IOStream, IOBuffer}, finder::F; kwargs...)
-write_orfs_gff(input::NucleicSeqOrView{DNAAlphabet{N}}, output::String, finder::F; kwargs...)
Write GFF data to a file.
Arguments
input
: The input DNA sequence NucSeq or a view.output
: The otput format, it can be a file (String
) or a buffer (IOStream
or `IOBuffer)finder
: The algorithm used to find ORFs. It can be either NaiveFinder()
or NaiveFinderScored()
.
Keywords
code::GeneticCode=BioSequences.standard_genetic_code
: The genetic code by which codons will be translated. See BioSequences.ncbi_trans_table
for more info. alternative_start::Bool=false
: If true will pass the extended start codons to search. This will increase 3x the exec. time.minlen::Int64=6
: Length of the allowed ORF. Default value allow aa"M*"
a posible encoding protein from the resulting ORFs.
sourceSettings
This document was generated with Documenter.jl version 1.5.0 on Wednesday 3 July 2024. Using Julia version 1.10.4.
+write_orfs_gff(input::NucleicSeqOrView{DNAAlphabet{N}}, output::String, finder::F; kwargs...)
Write GFF data to a file.
Arguments
input
: The input DNA sequence NucSeq or a view.output
: The otput format, it can be a file (String
) or a buffer (IOStream
or `IOBuffer)finder
: The algorithm used to find ORFs. It can be eitherNaiveFinder()
orNaiveFinderScored()
.
Keywords
code::GeneticCode=BioSequences.standard_genetic_code
: The genetic code by which codons will be translated. SeeBioSequences.ncbi_trans_table
for more info.alternative_start::Bool=false
: If true will pass the extended start codons to search. This will increase 3x the exec. time.minlen::Int64=6
: Length of the allowed ORF. Default value allowaa"M*"
a posible encoding protein from the resulting ORFs.