diff --git a/dev/.documenter-siteinfo.json b/dev/.documenter-siteinfo.json index 0b92eb1..7a04e6c 100644 --- a/dev/.documenter-siteinfo.json +++ b/dev/.documenter-siteinfo.json @@ -1 +1 @@ -{"documenter":{"julia_version":"1.10.4","generation_timestamp":"2024-07-03T16:58:01","documenter_version":"1.5.0"}} \ No newline at end of file +{"documenter":{"julia_version":"1.10.4","generation_timestamp":"2024-07-03T22:47:22","documenter_version":"1.5.0"}} \ No newline at end of file diff --git a/dev/api/index.html b/dev/api/index.html index 8724158..1a048e6 100644 --- a/dev/api/index.html +++ b/dev/api/index.html @@ -1,5 +1,33 @@ -API · GeneFinder.jl

The Main ORF type

The main type of the package is ORF which represents an Open Reading Frame.

GeneFinder.ORFType
struct ORF{N,F} <: GenomicFeatures.AbstractGenomicInterval{F}

The ORF struct represents an Open Reading Frame (ORF) in genomics.

Fields

  • groupname::String: The name of the group to which the ORF belongs.
  • first::Int64: The starting position of the ORF.
  • last::Int64: The ending position of the ORF.
  • strand::Strand: The strand on which the ORF is located.
  • frame::Int: The reading frame of the ORF.
  • features::Features: The features associated with the ORF.
  • scheme::Union{Nothing,Function}: The scheme used for the ORF.

Constructor

source
FASTX.sequenceMethod
sequence(i::ORF{N,F})

Extracts the DNA sequence corresponding to the given open reading frame (ORF).

Arguments

  • i::ORF{N,F}: The open reading frame (ORF) for which the DNA sequence needs to be extracted.

Returns

  • The DNA sequence corresponding to the given open reading frame (ORF).
source

Finding ORFs

The function findorfs is the main function of the package. It is generic method that can handle different gene finding methods.

GeneFinder.findorfsMethod
findorfs(sequence::NucleicSeqOrView{DNAAlphabet{N}}; ::M, kwargs...) where {N, M<:GeneFinderMethod}

This is the main interface method for finding open reading frames (ORFs) in a DNA sequence.

It takes the following required arguments:

  • sequence: The nucleic acid sequence to search for ORFs.
  • method: The algorithm used to find ORFs. It can be either NaiveFinder(), NaiveFinderScored() or yet other implementations.

Keyword Arguments regardless of the finder method:

  • alternative_start::Bool: A boolean indicating whether to consider alternative start codons. Default is false.
  • minlen::Int: The minimum length of an ORF. Default is 6.
  • scheme::Function: The scoring scheme to use for scoring the sequence from the ORF. Default is nothing.

Returns

A vector of ORF objects representing the found ORFs.

Example

sequence = randdnaseq(120)
+API · GeneFinder.jl

The Main ORF type

The main type of the package is ORF which represents an Open Reading Frame.

GeneFinder.ORFType
struct ORF{N,F} <: GenomicFeatures.AbstractGenomicInterval{F}

The ORF struct represents an Open Reading Frame (ORF) in genomics.

Fields

  • groupname::String: The name of the group to which the ORF belongs.
  • first::Int64: The starting position of the ORF.
  • last::Int64: The ending position of the ORF.
  • strand::Strand: The strand on which the ORF is located.
  • frame::Int: The reading frame of the ORF.
  • features::Features: The features associated with the ORF.
  • scheme::Union{Nothing,Function}: The scheme used for the ORF.

Constructor

ORF{N,F}(
+    groupname::String,
+    first::Int64,
+    last::Int64,
+    strand::Strand,
+    frame::Int,
+    features::Features,
+    scheme::Union{Nothing,Function}
+)
+
+# Example
+
+A full instance `ORF`
+

julia ORF{4,NaiveFinder}("seq01", 1, 33, STRAND_POS, 1, Features((score = 0.0,)), nothing)


+A partial instance `ORF`
+

julia ORF{NaiveFinder}(1:33, '+', 1) ```

source
FASTX.sequenceMethod
sequence(i::ORF{N,F})

Extracts the DNA sequence corresponding to the given open reading frame (ORF).

Arguments

  • i::ORF{N,F}: The open reading frame (ORF) for which the DNA sequence needs to be extracted.

Returns

  • The DNA sequence corresponding to the given open reading frame (ORF).
source
GeneFinder.featuresMethod
features(i::ORF{N,F})

Extracts the features from an ORF object.

Arguments

  • i::ORF{N,F}: An ORF object.

Returns

The features of the ORF object.

source
GeneFinder.sourceMethod
source(i::ORF{N,F})

Get the source sequence associated with the given ORF object.

Arguments

  • i::ORF{N,F}: The ORF object for which to retrieve the source sequence.

Returns

The source sequence associated with the ORF object.

Examples

seq = dna"ATGATGCATGCATGCATGCTAGTAACTAGCTAGCTAGCTAGTAA"
+orfs = findorfs(seq)
+source(orfs[1])
+
+44nt DNA Sequence:
+ATGATGCATGCATGCATGCTAGTAACTAGCTAGCTAGCTAGTAA
Warning

The source method works if the sequence is defined in the global scope. Otherwise it will throw an error. For instance a common failure is to define a simple ORF that by defualt will have an "unnamedsource" as groupname and then try to get the source sequence.

orf = ORF{NaiveFinder}(1:33, '+', 1)
+source(orf)
+
+ERROR: UndefVarError: `unnamedsource` not defined
+Stacktrace:
+ [1] source(i::ORF{4, NaiveFinder})
+   @ GeneFinder ~/.julia/dev/GeneFinder/src/types.jl:192
+ [2] top-level scope
+   @ REPL[12]:1
source

Finding ORFs

The function findorfs is the main function of the package. It is generic method that can handle different gene finding methods.

GeneFinder.findorfsMethod
findorfs(sequence::NucleicSeqOrView{DNAAlphabet{N}}; ::M, kwargs...) where {N, M<:GeneFinderMethod}

This is the main interface method for finding open reading frames (ORFs) in a DNA sequence.

It takes the following required arguments:

  • sequence: The nucleic acid sequence to search for ORFs.
  • method: The algorithm used to find ORFs. It can be either NaiveFinder(), NaiveFinderScored() or yet other implementations.

Keyword Arguments regardless of the finder method:

  • alternative_start::Bool: A boolean indicating whether to consider alternative start codons. Default is false.
  • minlen::Int: The minimum length of an ORF. Default is 6.
  • scheme::Function: The scoring scheme to use for scoring the sequence from the ORF. Default is nothing.

Returns

A vector of ORF objects representing the found ORFs.

Example

sequence = randdnaseq(120)
 
 120nt DNA Sequence:
  GCCGGACAGCGAAGGCTAATAAATGCCCGTGCCAGTATC…TCTGAGTTACTGTACACCCGAAAGACGTTGTACGCATTT
@@ -7,7 +35,7 @@
 findorfs(sequence, NaiveFinder())
 
 1-element Vector{ORF}:
- ORF{NaiveFinder}(77:118, '-', 2, 0.0)
source

Finding ORFs using BioRegex and scoring

GeneFinder.NaiveFinderMethod
NaiveFinder(sequence::NucleicSeqOrView{DNAAlphabet{N}}; kwargs...) -> Vector{ORF} where {N}

A simple implementation that finds ORFs in a DNA sequence.

The NaiveFinder method takes a LongSequence{DNAAlphabet{4}} sequence and returns a Vector{ORF} containing the ORFs found in the sequence. It searches entire regularly expressed CDS, adding each ORF it finds to the vector. The function also searches the reverse complement of the sequence, so it finds ORFs on both strands. Extending the starting codons with the alternative_start = true will search for ATG, GTG, and TTG. Some studies have shown that in E. coli (K-12 strain), ATG, GTG and TTG are used 83 %, 14 % and 3 % respectively.

Note

This function has neither ORFs scoring scheme by default nor length constraints. Thus it might consider aa"M*" a posible encoding protein from the resulting ORFs.

Required Arguments

  • sequence::NucleicSeqOrView{DNAAlphabet{N}}: The nucleic acid sequence to search for ORFs.

Keywords Arguments

  • alternative_start::Bool: If true will pass the extended start codons to search. This will increase 3x the execution time. Default is false.
  • minlen::Int64=6: Length of the allowed ORF. Default value allow aa"M*" a posible encoding protein from the resulting ORFs.
  • scheme::Function: The scoring scheme to use for scoring the sequence from the ORF. Default is nothing.
Note

As the scheme is generally a scoring function that at least requires a sequence, one simple scheme is the log-odds ratio score. This score is a log-odds ratio that compares the probability of the sequence generated by a coding model to the probability of the sequence generated by a non-coding model:

\[S(x) = \sum_{i=1}^{L} \beta_{x_{i}x} = \sum_{i=1} \log \frac{a^{\mathscr{m}_{1}}_{i-1} x_i}{a^{\mathscr{m}_{2}}_{i-1} x_i}\]

If the log-odds ratio exceeds a given threshold (η), the sequence is considered likely to be coding. See lordr for more information about coding creteria.

source
GeneFinder._locationiteratorMethod
locationiterator(sequence::NucleicSeqOrView{DNAAlphabet{N}}; alternative_start::Bool=false) where {N}

This is an iterator function that uses regular expressions to search the entire ORF (instead of start and stop codons) in a LongSequence{DNAAlphabet{4}} sequence. It uses an anonymous function that will find the first regularly expressed ORF. Then using this anonymous function it creates an iterator that will apply it until there is no other CDS.

Note

As a note of the implementation we want to expand on how the ORFs are found:

The expression (?:[N]{3})*? serves as the boundary between the start and stop codons. Within this expression, the character class [N]{3} captures exactly three occurrences of any character (representing nucleotides using IUPAC codes). This portion functions as the regular codon matches. Since it is enclosed within a non-capturing group (?:) and followed by *?, it allows for the matching of intermediate codons, but with a preference for the smallest number of repetitions.

In summary, the regular expression ATG(?:[N]{3})*?T(AG|AA|GA) identifies patterns that start with "ATG," followed by any number of three-character codons (represented by "N" in the IUPAC code), and ends with a stop codon "TAG," "TAA," or "TGA." This pattern is commonly used to identify potential protein-coding regions within genetic sequences.

See more about the discussion here

source

<!– ## Geting ORFs sequences

–>

Writing ORFs to files

GeneFinder.write_orfs_bedMethod
write_orfs_bed(input::NucleicSeqOrView{DNAAlphabet{N}}, output::Union{IOStream, IOBuffer}, finder::F; kwargs...)
+ ORF{NaiveFinder}(77:118, '-', 2, 0.0)
source

Finding ORFs using BioRegex and scoring

GeneFinder.NaiveFinderMethod
NaiveFinder(sequence::NucleicSeqOrView{DNAAlphabet{N}}; kwargs...) -> Vector{ORF} where {N}

A simple implementation that finds ORFs in a DNA sequence.

The NaiveFinder method takes a LongSequence{DNAAlphabet{4}} sequence and returns a Vector{ORF} containing the ORFs found in the sequence. It searches entire regularly expressed CDS, adding each ORF it finds to the vector. The function also searches the reverse complement of the sequence, so it finds ORFs on both strands. Extending the starting codons with the alternative_start = true will search for ATG, GTG, and TTG. Some studies have shown that in E. coli (K-12 strain), ATG, GTG and TTG are used 83 %, 14 % and 3 % respectively.

Note

This function has neither ORFs scoring scheme by default nor length constraints. Thus it might consider aa"M*" a posible encoding protein from the resulting ORFs.

Required Arguments

  • sequence::NucleicSeqOrView{DNAAlphabet{N}}: The nucleic acid sequence to search for ORFs.

Keywords Arguments

  • alternative_start::Bool: If true will pass the extended start codons to search. This will increase 3x the execution time. Default is false.
  • minlen::Int64=6: Length of the allowed ORF. Default value allow aa"M*" a posible encoding protein from the resulting ORFs.
  • scheme::Function: The scoring scheme to use for scoring the sequence from the ORF. Default is nothing.
Note

As the scheme is generally a scoring function that at least requires a sequence, one simple scheme is the log-odds ratio score. This score is a log-odds ratio that compares the probability of the sequence generated by a coding model to the probability of the sequence generated by a non-coding model:

\[S(x) = \sum_{i=1}^{L} \beta_{x_{i}x} = \sum_{i=1} \log \frac{a^{\mathscr{m}_{1}}_{i-1} x_i}{a^{\mathscr{m}_{2}}_{i-1} x_i}\]

If the log-odds ratio exceeds a given threshold (η), the sequence is considered likely to be coding. See lordr for more information about coding creteria.

source
GeneFinder._locationiteratorMethod
_locationiterator(sequence::NucleicSeqOrView{DNAAlphabet{N}}; alternative_start::Bool=false) where {N}

This is an iterator function that uses regular expressions to search the entire ORF (instead of start and stop codons) in a LongSequence{DNAAlphabet{4}} sequence. It uses an anonymous function that will find the first regularly expressed ORF. Then using this anonymous function it creates an iterator that will apply it until there is no other CDS.

Note

As a note of the implementation we want to expand on how the ORFs are found:

The expression (?:[N]{3})*? serves as the boundary between the start and stop codons. Within this expression, the character class [N]{3} captures exactly three occurrences of any character (representing nucleotides using IUPAC codes). This portion functions as the regular codon matches. Since it is enclosed within a non-capturing group (?:) and followed by *?, it allows for the matching of intermediate codons, but with a preference for the smallest number of repetitions.

In summary, the regular expression ATG(?:[N]{3})*?T(AG|AA|GA) identifies patterns that start with "ATG," followed by any number of three-character codons (represented by "N" in the IUPAC code), and ends with a stop codon "TAG," "TAA," or "TGA." This pattern is commonly used to identify potential protein-coding regions within genetic sequences.

See more about the discussion here

source

Writing ORFs to files

GeneFinder.write_orfs_bedMethod
write_orfs_bed(input::NucleicSeqOrView{DNAAlphabet{N}}, output::Union{IOStream, IOBuffer}, finder::F; kwargs...)
 write_orfs_bed(input::NucleicSeqOrView{DNAAlphabet{N}}, output::String, finder::F; kwargs...)

Write BED data to a file.

Arguments

  • input: The input DNA sequence NucSeq or a view.
  • output: The otput format, it can be a file (String) or a buffer (IOStream or `IOBuffer)
  • finder: The algorithm used to find ORFs. It can be either NaiveFinder() or NaiveFinderScored().

Keywords

  • alternative_start::Bool=false: If true, alternative start codons will be used when identifying CDSs. Default is false.
  • minlen::Int64=6: The minimum length that a CDS must have in order to be included in the output file. Default is 6.
source
GeneFinder.write_orfs_faaMethod
write_orfs_faa(input::NucleicSeqOrView{DNAAlphabet{4}}, output::Union{IOStream, IOBuffer}, finder::F; kwargs...)
 write_orfs_faa(input::NucleicSeqOrView{DNAAlphabet{4}}, output::String, finder::F; kwargs...)

Write the protein sequences encoded by the coding sequences (CDSs) of a given DNA sequence to the specified file.

Arguments

  • input: The input DNA sequence NucSeq or a view.
  • output: The otput format, it can be a file (String) or a buffer (IOStream or `IOBuffer)
  • finder: The algorithm used to find ORFs. It can be either NaiveFinder() or NaiveFinderScored().

Keywords

  • code::GeneticCode=BioSequences.standard_genetic_code: The genetic code by which codons will be translated. See BioSequences.ncbi_trans_table for more info.
  • alternative_start::Bool=false: If true will pass the extended start codons to search. This will increase 3x the exec. time.
  • minlen::Int64=6: Length of the allowed ORF. Default value allow aa"M*" a posible encoding protein from the resulting ORFs.

Examples

filename = "output.faa"
 
@@ -23,4 +51,4 @@
 open(filename, "w") do file
      write_orfs_fna(seq, file, NaiveFinder())
 end
source
GeneFinder.write_orfs_gffMethod
write_orfs_gff(input::NucleicSeqOrView{DNAAlphabet{N}}, output::Union{IOStream, IOBuffer}, finder::F; kwargs...)
-write_orfs_gff(input::NucleicSeqOrView{DNAAlphabet{N}}, output::String, finder::F; kwargs...)

Write GFF data to a file.

Arguments

  • input: The input DNA sequence NucSeq or a view.
  • output: The otput format, it can be a file (String) or a buffer (IOStream or `IOBuffer)
  • finder: The algorithm used to find ORFs. It can be either NaiveFinder() or NaiveFinderScored().

Keywords

  • code::GeneticCode=BioSequences.standard_genetic_code: The genetic code by which codons will be translated. See BioSequences.ncbi_trans_table for more info.
  • alternative_start::Bool=false: If true will pass the extended start codons to search. This will increase 3x the exec. time.
  • minlen::Int64=6: Length of the allowed ORF. Default value allow aa"M*" a posible encoding protein from the resulting ORFs.
source
+write_orfs_gff(input::NucleicSeqOrView{DNAAlphabet{N}}, output::String, finder::F; kwargs...)

Write GFF data to a file.

Arguments

  • input: The input DNA sequence NucSeq or a view.
  • output: The otput format, it can be a file (String) or a buffer (IOStream or `IOBuffer)
  • finder: The algorithm used to find ORFs. It can be either NaiveFinder() or NaiveFinderScored().

Keywords

  • code::GeneticCode=BioSequences.standard_genetic_code: The genetic code by which codons will be translated. See BioSequences.ncbi_trans_table for more info.
  • alternative_start::Bool=false: If true will pass the extended start codons to search. This will increase 3x the exec. time.
  • minlen::Int64=6: Length of the allowed ORF. Default value allow aa"M*" a posible encoding protein from the resulting ORFs.
source
diff --git a/dev/assets/lors-lamda.png b/dev/assets/lors-lamda.png new file mode 100644 index 0000000..678f9aa Binary files /dev/null and b/dev/assets/lors-lamda.png differ diff --git a/dev/features/index.html b/dev/features/index.html new file mode 100644 index 0000000..463c7e4 --- /dev/null +++ b/dev/features/index.html @@ -0,0 +1,113 @@ + +Scoring ORFs · GeneFinder.jl

The ORF features

The ORF type is designed to be flexible and can store various types of information about the ORF. This versatility allows it to hold data such as the score of the ORF based on a scoring function, the sequence of the ORF, or even the translated amino acid sequence. For example, in the NaiveFinder method, the score subfield is utilized to store the score of the ORF obtained from the scoring function. This capability is possible because the ORF type not only captures structural details of the ORF, such as the range, strand, and frame, but also provides a convenient field called Features for additional information.

phi = dna"GTGTGAGGTTATAACGCCGAAGCGGTAAAAATTTTAATTTTTGCCGCTGAGGGGTTGACCAAGCGAAGCGCGGTAGGTTTTCTGCTTAGGAGTTTAATCATGTTTCAGACTTTTATTTCTCGCCATAATTCAAACTTTTTTTCTGATAAGCTGGTTCTCACTTCTGTTACTCCAGCTTCTTCGGCACCTGTTTTACAGACACCTAAAGCTACATCGTCAACGTTATATTTTGATAGTTTGACGGTTAATGCTGGTAATGGTGGTTTTCTTCATTGCATTCAGATGGATACATCTGTCAACGCCGCTAATCAGGTTGTTTCTGTTGGTGCTGATATTGCTTTTGATGCCGACCCTAAATTTTTTGCCTGTTTGGTTCGCTTTGAGTCTTCTTCGGTTCCGACTACCCTCCCGACTGCCTATGATGTTTATCCTTTGAATGGTCGCCATGATGGTGGTTATTATACCGTCAAGGACTGTGTGACTATTGACGTCCTTCCCCGTACGCCGGGCAATAACGTTTATGTTGGTTTCATGGTTTGGTCTAACTTTACCGCTACTAAATGCCGCGGATTGGTTTCGCTGAATCAGGTTATTAAAGAGATTATTTGTCTCCAGCCACTTAAGTGAGGTGATTTATGTTTGGTGCTATTGCTGGCGGTATTGCTTCTGCTCTTGCTGGTGGCGCCATGTCTAAATTGTTTGGAGGCGGTCAAAAAGCCGCCTCCGGTGGCATTCAAGGTGATGTGCTTGCTACCGATAACAATACTGTAGGCATGGGTGATGCTGGTATTAAATCTGCCATTCAAGGCTCTAATGTTCCTAACCCTGATGAGGCCGCCCCTAGTTTTGTTTCTGGTGCTATGGCTAAAGCTGGTAAAGGACTTCTTGAAGGTACGTTGCAGGCTGGCACTTCTGCCGTTTCTGATAAGTTGCTTGATTTGGTTGGACTTGGTGGCAAGTCTGCCGCTGATAAAGGAAAGGATACTCGTGATTATCTTGCTGCTGCATTTCCTGAGCTTAATGCTTGGGAGCGTGCTGGTGCTGATGCTTCCTCTGCTGGTATGGTTGACGCCGGATTTGAGAATCAAAAAGAGCTTACTAAAATGCAACTGGACAATCAGAAAGAGATTGCCGAGATGCAAAATGAGACTCAAAAAGAGATTGCTGGCATTCAGTCGGCGACTTCACGCCAGAATACGAAAGACCAGGTATATGCACAAAATGAGATGCTTGCTTATCAACAGAAGGAGTCTACTGCTCGCGTTGCGTCTATTATGGAAAACACCAATCTTTCCAAGCAACAGCAGGTTTCCGAGATTATGCGCCAAATGCTTACTCAAGCTCAAACGGCTGGTCAGTATTTTACCAATGACCAAATCAAAGAAATGACTCGCAAGGTTAGTGCTGAGGTTGACTTAGTTCATCAGCAAACGCAGAATCAGCGGTATGGCTCTTCTCATATTGGCGCTACTGCAAAGGATATTTCTAATGTCGTCACTGATGCTGCTTCTGGTGTGGTTGATATTTTTCATGGTATTGATAAAGCTGTTGCCGATACTTGGAACAATTTCTGGAAAGACGGTAAAGCTGATGGTATTGGCTCTAATTTGTCTAGGAAATAACCGTCAGGATTGACACCCTCCCAATTGTATGTTTTCATGCCTCCAAATCTTGGAGGCTTTTTTATGGTTCGTTCTTATTACCCTTCTGAATGTCACGCTGATTATTTTGACTTTGAGCGTATCGAGGCTCTTAAACCTGCTATTGAGGCTTGTGGCATTTCTACTCTTTCTCAATCCCCAATGCTTGGCTTCCATAAGCAGATGGATAACCGCATCAAGCTCTTGGAAGAGATTCTGTCTTTTCGTATGCAGGGCGTTGAGTTCGATAATGGTGATATGTATGTTGACGGCCATAAGGCTGCTTCTGACGTTCGTGATGAGTTTGTATCTGTTACTGAGAAGTTAATGGATGAATTGGCACAATGCTACAATGTGCTCCCCCAACTTGATATTAATAACACTATAGACCACCGCCCCGAAGGGGACGAAAAATGGTTTTTAGAGAACGAGAAGACGGTTACGCAGTTTTGCCGCAAGCTGGCTGCTGAACGCCCTCTTAAGGATATTCGCGATGAGTATAATTACCCCAAAAAGAAAGGTATTAAGGATGAGTGTTCAAGATTGCTGGAGGCCTCCACTATGAAATCGCGTAGAGGCTTTGCTATTCAGCGTTTGATGAATGCAATGCGACAGGCTCATGCTGATGGTTGGTTTATCGTTTTTGACACTCTCACGTTGGCTGACGACCGATTAGAGGCGTTTTATGATAATCCCAATGCTTTGCGTGACTATTTTCGTGATATTGGTCGTATGGTTCTTGCTGCCGAGGGTCGCAAGGCTAATGATTCACACGCCGACTGCTATCAGTATTTTTGTGTGCCTGAGTATGGTACAGCTAATGGCCGTCTTCATTTCCATGCGGTGCACTTTATGCGGACACTTCCTACAGGTAGCGTTGACCCTAATTTTGGTCGTCGGGTACGCAATCGCCGCCAGTTAAATAGCTTGCAAAATACGTGGCCTTATGGTTACAGTATGCCCATCGCAGTTCGCTACACGCAGGACGCTTTTTCACGTTCTGGTTGGTTGTGGCCTGTTGATGCTAAAGGTGAGCCGCTTAAAGCTACCAGTTATATGGCTGTTGGTTTCTATGTGGCTAAATACGTTAACAAAAAGTCAGATATGGACCTTGCTGCTAAAGGTCTAGGAGCTAAAGAATGGAACAACTCACTAAAAACCAAGCTGTCGCTACTTCCCAAGAAGCTGTTCAGAATCAGAATGAGCCGCAACTTCGGGATGAAAATGCTCACAATGACAAATCTGTCCACGGAGTGCTTAATCCAACTTACCAAGCTGGGTTACGACGCGACGCCGTTCAACCAGATATTGAAGCAGAACGCAAAAAGAGAGATGAGATTGAGGCTGGGAAAAGTTACTGTAGCCGACGTTTTGGCGGCGCAACCTGTGACGACAAATCTGCTCAAATTTATGCGCGCTTCGATAAAAATGATTGGCGTATCCAACCTGCAGAGTTTTATCGCTTCCATGACGCAGAAGTTAACACTTTCGGATATTTCTGATGAGTCGAAAAATTATCTTGATAAAGCAGGAATTACTACTGCTTGTTTACGAATTAAATCGAAGTGGACTGCTGGCGGAAAATGAGAAAATTCGACCTATCCTTGCGCAGCTCGAGAAGCTCTTACTTTGCGACCTTTCGCCATCAACTAACGATTCTGTCAAAAACTGACGCGTTGGATGAGGAGAAGTGGCTTAATATGCTTGGCACGTTCGTCAAGGACTGGTTTAGATATGAGTCACATTTTGTTCATGGTAGAGATTCTCTTGTTGACATTTTAAAAGAGCGTGGATTACTATCTGAGTCCGATGCTGTTCAACCACTAATAGGTAAGAAATCATGAGTCAAGTTACTGAACAATCCGTACGTTTCCAGACCGCTTTGGCCTCTATTAAGCTCATTCAGGCTTCTGCCGTTTTGGATTTAACCGAAGATGATTTCGATTTTCTGACGAGTAACAAAGTTTGGATTGCTACTGACCGCTCTCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTACGCTGGACTTTGTGGGATACCCTCGCTTTCCTGCTCCTGTTGAGTTTATTGCTGCCGTCATTGCTTATTATGTTCATCCCGTCAACATTCAAACGGCCTGTCTCATCATGGAAGGCGCTGAATTTACGGAAAACATTATTAATGGCGTCGAGCGTCCGGTTAAAGCCGCTGAATTGTTCGCGTTTACCTTGCGTGTACGCGCAGGAAACACTGACGTTCTTACTGACGCAGAAGAAAACGTGCGTCAAAAATTACGTGCGGAAGGAGTGATGTAATGTCTAAAGGTAAAAAACGTTCTGGCGCTCGCCCTGGTCGTCCGCAGCCGTTGCGAGGTACTAAAGGCAAGCGTAAAGGCGCTCGTCTTTGGTATGTAGGTGGTCAACAATTTTAATTGCAGGGGCTTCGGCCCCTTACTTGAGGATAAATTATGTCTAATATTCAAACTGGCGCCGAGCGTATGCCGCATGACCTTTCCCATCTTGGCTTCCTTGCTGGTCAGATTGGTCGTCTTATTACCATTTCAACTACTCCGGTTATCGCTGGCGACTCCTTCGAGATGGACGCCGTTGGCGCTCTCCGTCTTTCTCCATTGCGTCGTGGCCTTGCTATTGACTCTACTGTAGACATTTTTACTTTTTATGTCCCTCATCGTCACGTTTATGGTGAACAGTGGATTAAGTTCATGAAGGATGGTGTTAATGCCACTCCTCTCCCGACTGTTAACACTACTGGTTATATTGACCATGCCGCTTTTCTTGGCACGATTAACCCTGATACCAATAAAATCCCTAAGCATTTGTTTCAGGGTTATTTGAATATCTATAACAACTATTTTAAAGCGCCGTGGATGCCTGACCGTACCGAGGCTAACCCTAATGAGCTTAATCAAGATGATGCTCGTTATGGTTTCCGTTGCTGCCATCTCAAAAACATTTGGACTGCTCCGCTTCCTCCTGAGACTGAGCTTTCTCGCCAAATGACGACTTCTACCACATCTATTGACATTATGGGTCTGCAAGCTGCTTATGCTAATTTGCATACTGACCAAGAACGTGATTACTTCATGCAGCGTTACCATGATGTTATTTCTTCATTTGGAGGTAAAACCTCTTATGACGCTGACAACCGTCCTTTACTTGTCATGCGCTCTAATCTCTGGGCATCTGGCTATGATGTTGATGGAACTGACCAAACGTCGTTAGGCCAGTTTTCTGGTCGTGTTCAACAGACCTATAAACATTCTGTGCCGCGTTTCTTTGTTCCTGAGCATGGCACTATGTTTACTCTTGCGCTTGTTCGTTTTCCGCCTACTGCGACTAAAGAGATTCAGTACCTTAACGCTAAAGGTGCTTTGACTTATACCGATATTGCTGGCGACCCTGTTTTGTATGGCAACTTGCCGCCGCGTGAAATTTCTATGAAGGATGTTTTCCGTTCTGGTGATTCGTCTAAGAAGTTTAAGATTGCTGAGGGTCAGTGGTATCGTTATGCGCCTTCGTATGTTTCTCCTGCTTATCACCTTCTTGAAGGCTTCCCATTCATTCAGGAACCGCCTTCTGGTGATTTGCAAGAACGCGTACTTATTCGCCACCATGATTATGACCAGTGTTTCCAGTCCGTTCAGTTGTTGCAGTGGAATAGTCAGGTTAAATTTAATGTGACCGTTTATCGCAATCTGCCGACCACTCGCGATTCAATCATGACTTCGTGATAAAAGATTGA"
+
+phiorfs = findorfs(phi, finder=NaiveFinder, minlen=75, scheme=lors)
+
+124-element Vector{ORF{4, NaiveFinder}}:
+ ORF{NaiveFinder}(9:101, '-', 3)
+ ORF{NaiveFinder}(100:627, '+', 1)
+ ORF{NaiveFinder}(223:447, '-', 1)
+ ORF{NaiveFinder}(248:436, '+', 2)
+ ORF{NaiveFinder}(257:436, '+', 2)
+ ORF{NaiveFinder}(283:627, '+', 1)
+ ORF{NaiveFinder}(344:436, '+', 2)
+ ORF{NaiveFinder}(532:627, '+', 1)
+ ORF{NaiveFinder}(636:1622, '+', 3)
+ ORF{NaiveFinder}(687:1622, '+', 3)
+ ORF{NaiveFinder}(774:1622, '+', 3)
+ ORF{NaiveFinder}(781:1389, '+', 1)
+ ORF{NaiveFinder}(814:1389, '+', 1)
+ ORF{NaiveFinder}(829:1389, '+', 1)
+ ORF{NaiveFinder}(861:1622, '+', 3)
+ ⋮
+ ORF{NaiveFinder}(4671:5375, '+', 3)
+ ORF{NaiveFinder}(4690:4866, '+', 1)
+ ORF{NaiveFinder}(4728:5375, '+', 3)
+ ORF{NaiveFinder}(4741:4866, '+', 1)
+ ORF{NaiveFinder}(4744:4866, '+', 1)
+ ORF{NaiveFinder}(4777:4866, '+', 1)
+ ORF{NaiveFinder}(4806:5375, '+', 3)
+ ORF{NaiveFinder}(4863:5258, '-', 3)
+ ORF{NaiveFinder}(4933:5019, '+', 1)
+ ORF{NaiveFinder}(4941:5375, '+', 3)
+ ORF{NaiveFinder}(5082:5375, '+', 3)
+ ORF{NaiveFinder}(5089:5325, '+', 1)
+ ORF{NaiveFinder}(5122:5202, '-', 1)
+ ORF{NaiveFinder}(5152:5325, '+', 1)
+ ORF{NaiveFinder}(5164:5325, '+', 1)

In the example above we calculated a score using the lors scoring scheme (see lors from the BioMarkovChains.jl package). The score is stored in the score subfield of the ORF .

All features can be accesed using a conviniente funciton called features that returns a NamedTuple with the features of the ORF and can be broadcasted to the entire collection of ORFs using the . syntax.

features.(phiorfs)
+
+124-element Vector{@NamedTuple{score::Float64}}:
+ (score = -3.002461366087374,)
+ (score = -10.814621287968222,)
+ (score = -5.344187934894264,)
+ (score = -1.316724559874126,)
+ (score = -1.796631200562138,)
+ (score = -3.2651518608269856,)
+ (score = -1.4019264441082822,)
+ (score = -2.3192349590107475,)
+ (score = 5.055524446434241,)
+ (score = 2.7116397224896436,)
+ (score = 2.2564640592402165,)
+ (score = 1.777499581940097,)
+ (score = 2.3474811908011186,)
+ (score = 2.38568188352799,)
+ (score = 2.498608044469827,)
+ ⋮
+ (score = -5.474837954151803,)
+ (score = 0.6909362932156138,)
+ (score = -5.900045211699447,)
+ (score = 1.2010656615619415,)
+ (score = 0.8541931309205604,)
+ (score = 2.7897961643147777,)
+ (score = -4.42890346770467,)
+ (score = -5.40624241726446,)
+ (score = -0.8080572222081075,)
+ (score = -5.571494087742448,)
+ (score = -4.882156920421228,)
+ (score = -5.639670353834974,)
+ (score = -0.8764121443326865,)
+ (score = -4.308687693802273,)
+ (score = -4.459423419810693,)

Analysing Lamda ORFs

In this case the lors calculates the log odds ratio of the ORF sequence given two Markov models (by default: ECOLICDS and ECOLINOCDS), one for the coding region and one for the non-coding region. The score is stored in the score field of the NamedTuple returned by the features function. By default the lors function return the base 2 logarithm of the odds ratio, so it is analogous to the bits of information that the ORF sequence is coding.

Now we can even analyse how is the distribution of the ORFs' scores as a function of their lengths compared to random sequences.


+lambda = fasta_to_dna("test/data/NC_001416.1.fasta")[1]
+
+lambaorfs = findorfs(lambda, finder=NaiveFinder, minlen=100, scheme=lors)
+
+lamdascores = score.(lambaorfs)
+lambdalengths = length.(lambaorfs)
+
+## get some random sequences of variable lengths
+vseqs = LongDNA[]
+for i in 1:708
+    push!(vseqs, randdnaseq(rand(100:1000)))
+end
+
+## get the lengths and scores of the random generated sequences
+randlengths = length.(vseqs)
+randscores = lors.(vseqs)
+
+## plot the scores as a function of the lengths
+using CairoMakie
+
+f = Figure()
+ax = Axis(f[1, 1], xlabel="Length", ylabel="Log-odds ratio (Bits)")
+
+scatter!(ax,
+    randlengths,
+    randscores,
+    marker = :circle, 
+    markersize = 6, 
+    color = :black, 
+    label = "Random sequences"
+)
+scatter!(ax,
+    lambdalengths, 
+    lambdascores, 
+    marker = :rect, 
+    markersize = 6, 
+    color = :blue, 
+    label = "Lambda ORFs"
+)
+
+axislegend(ax)
+
+f

diff --git a/dev/index.html b/dev/index.html index e79538d..2db0494 100644 --- a/dev/index.html +++ b/dev/index.html @@ -1,5 +1,5 @@ -Home · GeneFinder.jl
+Home · GeneFinder.jl


A Gene Finder framework for Julia. @@ -41,4 +41,4 @@ version = {v0.3.0}, year = {2024}, month = {04} -}

+}
diff --git a/dev/iodocs/index.html b/dev/iodocs/index.html index 4cdf34c..8c235cd 100644 --- a/dev/iodocs/index.html +++ b/dev/iodocs/index.html @@ -1,5 +1,5 @@ -Wrtiting ORFs In Files · GeneFinder.jl

Writting ORFs into bioinformatic formats

This package facilitates the creation of FASTA, BED, and GFF files, specifically extracting Open Reading Frame (ORF) information from BioSequence instances, particularly those of type NucleicSeqOrView{A} where A, and then writing the information into the desired format.

Functionality:

The package provides four distinct functions for writing files in different formats:

FunctionDescription
write_orfs_fnaWrites nucleotide sequences in FASTA format.
write_orfs_faaWrites amino acid sequences in FASTA format.
write_orfs_bedOutputs information in BED format.
write_orfs_gffGenerates files in GFF format.

All these functions support processing both BioSequence instances and external FASTA files. In the case of a BioSequence instace into external files, simply provide the path to the FASTA file using a String to the path. To demonstrate the use of the write_* methods with a BioSequence, consider the following example:

using BioSequences, GeneFinder
+Wrtiting ORFs In Files · GeneFinder.jl

Writting ORFs into bioinformatic formats

This package facilitates the creation of FASTA, BED, and GFF files, specifically extracting Open Reading Frame (ORF) information from BioSequence instances, particularly those of type NucleicSeqOrView{A} where A, and then writing the information into the desired format.

Functionality:

The package provides four distinct functions for writing files in different formats:

FunctionDescription
write_orfs_fnaWrites nucleotide sequences in FASTA format.
write_orfs_faaWrites amino acid sequences in FASTA format.
write_orfs_bedOutputs information in BED format.
write_orfs_gffGenerates files in GFF format.

All these functions support processing both BioSequence instances and external FASTA files. In the case of a BioSequence instace into external files, simply provide the path to the FASTA file using a String to the path. To demonstrate the use of the write_* methods with a BioSequence, consider the following example:

using BioSequences, GeneFinder
 
 # > 180195.SAMN03785337.LFLS01000089 -> finds only 1 gene in Prodigal (from Pyrodigal tests)
 seq = dna"AACCAGGGCAATATCAGTACCGCGGGCAATGCAACCCTGACTGCCGGCGGTAACCTGAACAGCACTGGCAATCTGACTGTGGGCGGTGTTACCAACGGCACTGCTACTACTGGCAACATCGCACTGACCGGTAACAATGCGCTGAGCGGTCCGGTCAATCTGAATGCGTCGAATGGCACGGTGACCTTGAACACGACCGGCAATACCACGCTCGGTAACGTGACGGCACAAGGCAATGTGACGACCAATGTGTCCAACGGCAGTCTGACGGTTACCGGCAATACGACAGGTGCCAACACCAACCTCAGTGCCAGCGGCAACCTGACCGTGGGTAACCAGGGCAATATCAGTACCGCAGGCAATGCAACCCTGACGGCCGGCGACAACCTGACGAGCACTGGCAATCTGACTGTGGGCGGCGTCACCAACGGCACGGCCACCACCGGCAACATCGCGCTGACCGGTAACAATGCACTGGCTGGTCCTGTCAATCTGAACGCGCCGAACGGCACCGTGACCCTGAACACAACCGGCAATACCACGCTGGGTAATGTCACCGCACAAGGCAATGTGACGACTAATGTGTCCAACGGCAGCCTGACAGTCGCTGGCAATACCACAGGTGCCAACACCAACCTGAGTGCCAGCGGCAATCTGACCGTGGGCAACCAGGGCAATATCAGTACCGCGGGCAATGCAACCCTGACTGCCGGCGGTAACCTGAGC"

Once a BioSequence object has been instantiated, the write_orfs_fna function proves useful for generating a FASTA file containing the nucleotide sequences of the ORFs. Notably, the write_orfs* methods support either an IOStream or an IOBuffer as an output argument, allowing flexibility in directing the output either to a file or a buffer. In the following example, we demonstrate writing the output directly to a file.

outfile = "LFLS01000089.fna"
@@ -31,4 +31,4 @@
 >seq id=11 start=581 stop=601 strand=+ frame=2 score=0.0
 ATGTGTCCAACGGCAGCCTGA
 >seq id=12 start=695 stop=706 strand=+ frame=2 score=0.0
-ATGCAACCCTGA

This could also be done to writting a FASTA file with the nucleotide sequences of the ORFs using the write_orfs_fna function. Similarly for the BED and GFF files using the write_orfs_bed and write_orfs_gff functions respectively.

+ATGCAACCCTGA

This could also be done to writting a FASTA file with the nucleotide sequences of the ORFs using the write_orfs_fna function. Similarly for the BED and GFF files using the write_orfs_bed and write_orfs_gff functions respectively.

diff --git a/dev/naivefinder/index.html b/dev/naivefinder/index.html index f82336b..fd1d812 100644 --- a/dev/naivefinder/index.html +++ b/dev/naivefinder/index.html @@ -1,5 +1,5 @@ -Finding ORFs · GeneFinder.jl

The ORF type

For convenience, the ORF type is more stringent in preventing the creation of incompatible instances. As a result, attempting to create an instance with incompatible parameters will result in an error. For instance, the following code snippet will trigger an error:

ORF{4,NaiveFinder}(1:10, '+', 4) # Or any F <: GeneFinderMethod
+Finding ORFs · GeneFinder.jl

The ORF type

For convenience, the ORF type is more stringent in preventing the creation of incompatible instances. As a result, attempting to create an instance with incompatible parameters will result in an error. For instance, the following code snippet will trigger an error:

ORF{4,NaiveFinder}(1:10, '+', 4) # Or any F <: GeneFinderMethod
 
 ERROR: AssertionError: Invalid frame value. Frame must be 1, 2, or 3.
 Stacktrace:
@@ -29,7 +29,7 @@
  ORF{NaiveFinder}(551:574, '+', 2)
  ORF{NaiveFinder}(569:574, '+', 2)
  ORF{NaiveFinder}(581:601, '+', 2)
- ORF{NaiveFinder}(695:706, '+', 2)

Two other methods where implemented into sequence to get the ORFs in DNA or aminoacid sequences, respectively. They use the findorfs function to first get the ORFs and then get the correspondance array of BioSequence objects.

sequece.(orfs)
+ ORF{NaiveFinder}(695:706, '+', 2)

Two other methods where implemented into sequence to get the ORFs in DNA or aminoacid sequences, respectively. They use the findorfs function to first get the ORFs and then get the correspondance array of BioSequence objects.

sequence.(orfs)
 
 12-element Vector{NucSeq{4, DNAAlphabet{4}}}
  ATGCAACCCTGA
@@ -57,4 +57,4 @@
  MSPHKAM*
  M*
  MCPTAA*
- MQP*
+ MQP*
diff --git a/dev/objects.inv b/dev/objects.inv index 6d9a208..30a0889 100644 Binary files a/dev/objects.inv and b/dev/objects.inv differ diff --git a/dev/roadmap/index.html b/dev/roadmap/index.html index 3a3a264..38e11ac 100644 --- a/dev/roadmap/index.html +++ b/dev/roadmap/index.html @@ -1,2 +1,2 @@ -- · GeneFinder.jl

Roadmap

Coding genes (CDS - ORFs)

  • Finding ORFs
  • ☐ EasyGene
  • ☐ GLIMMER
  • ☐ Prodigal - Pyrodigal
  • ☐ PHANOTATE
  • ☐ k-mer based gene finders (?)
  • ☐ Augustus (?)

Non-coding genes (RNA)

  • ☐ Infernal
  • ☐ tRNAscan

Other features

  • ☐ parallelism SIMD ?
  • ☐ memory management (?)
  • ☐ incorporate Ribosime Binding Sites (RBS)
  • ☐ incorporate Programmed Reading Frame Shifting (PRFS)
  • ☐ specialized types
    • ☒ Gene
    • ☒ ORF
    • ☒ Codon
    • ☒ CDS
    • ☐ EukaryoticGene (?)
    • ☐ ProkaryoticGene (?)
    • ☐ Intron
    • ☐ Exon
    • ☐ GFF –\> See other packages
    • ☐ FASTX –\> See I/O in other packages

Compatibilities

Must interact with or extend:

  • GenomicAnnotations.jl
  • BioSequences.jl
  • SequenceVariation.jl
  • GenomicFeatures.jl
  • FASTX.jl
  • Kmers.jl
  • Graphs.jl
+- · GeneFinder.jl

Roadmap

Coding genes (CDS - ORFs)

  • Finding ORFs
  • ☐ EasyGene
  • ☐ GLIMMER
  • ☐ Prodigal - Pyrodigal
  • ☐ PHANOTATE
  • ☐ k-mer based gene finders (?)
  • ☐ Augustus (?)

Non-coding genes (RNA)

  • ☐ Infernal
  • ☐ tRNAscan

Other features

  • ☐ parallelism SIMD ?
  • ☐ memory management (?)
  • ☐ incorporate Ribosime Binding Sites (RBS)
  • ☐ incorporate Programmed Reading Frame Shifting (PRFS)
  • ☐ specialized types
    • ☒ Gene
    • ☒ ORF
    • ☒ Codon
    • ☒ CDS
    • ☐ EukaryoticGene (?)
    • ☐ ProkaryoticGene (?)
    • ☐ Intron
    • ☐ Exon
    • ☐ GFF –\> See other packages
    • ☐ FASTX –\> See I/O in other packages

Compatibilities

Must interact with or extend:

  • GenomicAnnotations.jl
  • BioSequences.jl
  • SequenceVariation.jl
  • GenomicFeatures.jl
  • FASTX.jl
  • Kmers.jl
  • Graphs.jl
diff --git a/dev/search_index.js b/dev/search_index.js index ab85c82..f6deb35 100644 --- a/dev/search_index.js +++ b/dev/search_index.js @@ -1,3 +1,3 @@ var documenterSearchIndex = {"docs": -[{"location":"iodocs/#Writting-ORFs-into-bioinformatic-formats","page":"Wrtiting ORFs In Files","title":"Writting ORFs into bioinformatic formats","text":"","category":"section"},{"location":"iodocs/","page":"Wrtiting ORFs In Files","title":"Wrtiting ORFs In Files","text":"This package facilitates the creation of FASTA, BED, and GFF files, specifically extracting Open Reading Frame (ORF) information from BioSequence instances, particularly those of type NucleicSeqOrView{A} where A, and then writing the information into the desired format.","category":"page"},{"location":"iodocs/","page":"Wrtiting ORFs In Files","title":"Wrtiting ORFs In Files","text":"Functionality:","category":"page"},{"location":"iodocs/","page":"Wrtiting ORFs In Files","title":"Wrtiting ORFs In Files","text":"The package provides four distinct functions for writing files in different formats:","category":"page"},{"location":"iodocs/","page":"Wrtiting ORFs In Files","title":"Wrtiting ORFs In Files","text":"Function Description\nwrite_orfs_fna Writes nucleotide sequences in FASTA format.\nwrite_orfs_faa Writes amino acid sequences in FASTA format.\nwrite_orfs_bed Outputs information in BED format.\nwrite_orfs_gff Generates files in GFF format.","category":"page"},{"location":"iodocs/","page":"Wrtiting ORFs In Files","title":"Wrtiting ORFs In Files","text":"All these functions support processing both BioSequence instances and external FASTA files. In the case of a BioSequence instace into external files, simply provide the path to the FASTA file using a String to the path. To demonstrate the use of the write_* methods with a BioSequence, consider the following example:","category":"page"},{"location":"iodocs/","page":"Wrtiting ORFs In Files","title":"Wrtiting ORFs In Files","text":"using BioSequences, GeneFinder\n\n# > 180195.SAMN03785337.LFLS01000089 -> finds only 1 gene in Prodigal (from Pyrodigal tests)\nseq = dna\"AACCAGGGCAATATCAGTACCGCGGGCAATGCAACCCTGACTGCCGGCGGTAACCTGAACAGCACTGGCAATCTGACTGTGGGCGGTGTTACCAACGGCACTGCTACTACTGGCAACATCGCACTGACCGGTAACAATGCGCTGAGCGGTCCGGTCAATCTGAATGCGTCGAATGGCACGGTGACCTTGAACACGACCGGCAATACCACGCTCGGTAACGTGACGGCACAAGGCAATGTGACGACCAATGTGTCCAACGGCAGTCTGACGGTTACCGGCAATACGACAGGTGCCAACACCAACCTCAGTGCCAGCGGCAACCTGACCGTGGGTAACCAGGGCAATATCAGTACCGCAGGCAATGCAACCCTGACGGCCGGCGACAACCTGACGAGCACTGGCAATCTGACTGTGGGCGGCGTCACCAACGGCACGGCCACCACCGGCAACATCGCGCTGACCGGTAACAATGCACTGGCTGGTCCTGTCAATCTGAACGCGCCGAACGGCACCGTGACCCTGAACACAACCGGCAATACCACGCTGGGTAATGTCACCGCACAAGGCAATGTGACGACTAATGTGTCCAACGGCAGCCTGACAGTCGCTGGCAATACCACAGGTGCCAACACCAACCTGAGTGCCAGCGGCAATCTGACCGTGGGCAACCAGGGCAATATCAGTACCGCGGGCAATGCAACCCTGACTGCCGGCGGTAACCTGAGC\"","category":"page"},{"location":"iodocs/","page":"Wrtiting ORFs In Files","title":"Wrtiting ORFs In Files","text":"Once a BioSequence object has been instantiated, the write_orfs_fna function proves useful for generating a FASTA file containing the nucleotide sequences of the ORFs. Notably, the write_orfs* methods support either an IOStream or an IOBuffer as an output argument, allowing flexibility in directing the output either to a file or a buffer. In the following example, we demonstrate writing the output directly to a file.","category":"page"},{"location":"iodocs/","page":"Wrtiting ORFs In Files","title":"Wrtiting ORFs In Files","text":"outfile = \"LFLS01000089.fna\"\n\nopen(outfile, \"w\") do io\n write_orfs_fna(seq, io, NaiveFinder())\nend","category":"page"},{"location":"iodocs/","page":"Wrtiting ORFs In Files","title":"Wrtiting ORFs In Files","text":"cat LFLS01000089.fna\n\n>seq id=01 start=29 stop=40 strand=+ frame=2 score=0.0\nATGCAACCCTGA\n>seq id=02 start=137 stop=145 strand=+ frame=2 score=0.0\nATGCGCTGA\n>seq id=03 start=164 stop=184 strand=+ frame=2 score=0.0\nATGCGTCGAATGGCACGGTGA\n>seq id=04 start=173 stop=184 strand=+ frame=2 score=0.0\nATGGCACGGTGA\n>seq id=05 start=236 stop=241 strand=+ frame=2 score=0.0\nATGTGA\n>seq id=06 start=248 stop=268 strand=+ frame=2 score=0.0\nATGTGTCCAACGGCAGTCTGA\n>seq id=07 start=362 stop=373 strand=+ frame=2 score=0.0\nATGCAACCCTGA\n>seq id=08 start=470 stop=496 strand=+ frame=2 score=0.0\nATGCACTGGCTGGTCCTGTCAATCTGA\n>seq id=09 start=551 stop=574 strand=+ frame=2 score=0.0\nATGTCACCGCACAAGGCAATGTGA\n>seq id=10 start=569 stop=574 strand=+ frame=2 score=0.0\nATGTGA\n>seq id=11 start=581 stop=601 strand=+ frame=2 score=0.0\nATGTGTCCAACGGCAGCCTGA\n>seq id=12 start=695 stop=706 strand=+ frame=2 score=0.0\nATGCAACCCTGA","category":"page"},{"location":"iodocs/","page":"Wrtiting ORFs In Files","title":"Wrtiting ORFs In Files","text":"This could also be done to writting a FASTA file with the nucleotide sequences of the ORFs using the write_orfs_fna function. Similarly for the BED and GFF files using the write_orfs_bed and write_orfs_gff functions respectively.","category":"page"},{"location":"api/","page":"API","title":"API","text":"CurrentModule = GeneFinder\nDocTestSetup = quote\n using GeneFinder\nend","category":"page"},{"location":"api/#The-Main-ORF-type","page":"API","title":"The Main ORF type","text":"","category":"section"},{"location":"api/","page":"API","title":"API","text":"The main type of the package is ORF which represents an Open Reading Frame.","category":"page"},{"location":"api/","page":"API","title":"API","text":"Modules = [GeneFinder]\nPages = [\"types.jl\"]","category":"page"},{"location":"api/#GeneFinder.ORF","page":"API","title":"GeneFinder.ORF","text":"struct ORF{N,F} <: GenomicFeatures.AbstractGenomicInterval{F}\n\nThe ORF struct represents an Open Reading Frame (ORF) in genomics.\n\nFields\n\ngroupname::String: The name of the group to which the ORF belongs.\nfirst::Int64: The starting position of the ORF.\nlast::Int64: The ending position of the ORF.\nstrand::Strand: The strand on which the ORF is located.\nframe::Int: The reading frame of the ORF.\nfeatures::Features: The features associated with the ORF.\nscheme::Union{Nothing,Function}: The scheme used for the ORF.\n\nConstructor\n\n\n\n\n\n","category":"type"},{"location":"api/#FASTX.sequence-Union{Tuple{ORF{N, F}}, Tuple{F}, Tuple{N}} where {N, F}","page":"API","title":"FASTX.sequence","text":"sequence(i::ORF{N,F})\n\nExtracts the DNA sequence corresponding to the given open reading frame (ORF).\n\nArguments\n\ni::ORF{N,F}: The open reading frame (ORF) for which the DNA sequence needs to be extracted.\n\nReturns\n\nThe DNA sequence corresponding to the given open reading frame (ORF).\n\n\n\n\n\n","category":"method"},{"location":"api/#Finding-ORFs","page":"API","title":"Finding ORFs","text":"","category":"section"},{"location":"api/","page":"API","title":"API","text":"The function findorfs is the main function of the package. It is generic method that can handle different gene finding methods. ","category":"page"},{"location":"api/","page":"API","title":"API","text":"Modules = [GeneFinder]\nPages = [\"findorfs.jl\"]","category":"page"},{"location":"api/#GeneFinder.findorfs-Union{Tuple{Union{BioSequences.LongDNA{N}, BioSequences.LongSubSeq{BioSequences.DNAAlphabet{N}}}}, Tuple{F}, Tuple{N}} where {N, F<:GeneFinder.GeneFinderMethod}","page":"API","title":"GeneFinder.findorfs","text":"findorfs(sequence::NucleicSeqOrView{DNAAlphabet{N}}; ::M, kwargs...) where {N, M<:GeneFinderMethod}\n\nThis is the main interface method for finding open reading frames (ORFs) in a DNA sequence.\n\nIt takes the following required arguments:\n\nsequence: The nucleic acid sequence to search for ORFs.\nmethod: The algorithm used to find ORFs. It can be either NaiveFinder(), NaiveFinderScored() or yet other implementations.\n\nKeyword Arguments regardless of the finder method:\n\nalternative_start::Bool: A boolean indicating whether to consider alternative start codons. Default is false.\nminlen::Int: The minimum length of an ORF. Default is 6.\nscheme::Function: The scoring scheme to use for scoring the sequence from the ORF. Default is nothing.\n\nReturns\n\nA vector of ORF objects representing the found ORFs.\n\nExample\n\nsequence = randdnaseq(120)\n\n120nt DNA Sequence:\n GCCGGACAGCGAAGGCTAATAAATGCCCGTGCCAGTATC…TCTGAGTTACTGTACACCCGAAAGACGTTGTACGCATTT\n\nfindorfs(sequence, NaiveFinder())\n\n1-element Vector{ORF}:\n ORF{NaiveFinder}(77:118, '-', 2, 0.0)\n\n\n\n\n\n","category":"method"},{"location":"api/#Finding-ORFs-using-BioRegex-and-scoring","page":"API","title":"Finding ORFs using BioRegex and scoring","text":"","category":"section"},{"location":"api/","page":"API","title":"API","text":"Modules = [GeneFinder]\nPages = [\"algorithms/naivefinder.jl\"]","category":"page"},{"location":"api/#GeneFinder.NaiveFinder-Union{Tuple{Union{BioSequences.LongDNA{N}, BioSequences.LongSubSeq{BioSequences.DNAAlphabet{N}}}}, Tuple{N}} where N","page":"API","title":"GeneFinder.NaiveFinder","text":"NaiveFinder(sequence::NucleicSeqOrView{DNAAlphabet{N}}; kwargs...) -> Vector{ORF} where {N}\n\nA simple implementation that finds ORFs in a DNA sequence.\n\nThe NaiveFinder method takes a LongSequence{DNAAlphabet{4}} sequence and returns a Vector{ORF} containing the ORFs found in the sequence. It searches entire regularly expressed CDS, adding each ORF it finds to the vector. The function also searches the reverse complement of the sequence, so it finds ORFs on both strands. Extending the starting codons with the alternative_start = true will search for ATG, GTG, and TTG. Some studies have shown that in E. coli (K-12 strain), ATG, GTG and TTG are used 83 %, 14 % and 3 % respectively.\n\nnote: Note\nThis function has neither ORFs scoring scheme by default nor length constraints. Thus it might consider aa\"M*\" a posible encoding protein from the resulting ORFs.\n\nRequired Arguments\n\nsequence::NucleicSeqOrView{DNAAlphabet{N}}: The nucleic acid sequence to search for ORFs.\n\nKeywords Arguments\n\nalternative_start::Bool: If true will pass the extended start codons to search. This will increase 3x the execution time. Default is false.\nminlen::Int64=6: Length of the allowed ORF. Default value allow aa\"M*\" a posible encoding protein from the resulting ORFs.\nscheme::Function: The scoring scheme to use for scoring the sequence from the ORF. Default is nothing.\n\nnote: Note\nAs the scheme is generally a scoring function that at least requires a sequence, one simple scheme is the log-odds ratio score. This score is a log-odds ratio that compares the probability of the sequence generated by a coding model to the probability of the sequence generated by a non-coding model:S(x) = sum_i=1^L beta_x_ix = sum_i=1 log fraca^mathscrm_1_i-1 x_ia^mathscrm_2_i-1 x_iIf the log-odds ratio exceeds a given threshold (η), the sequence is considered likely to be coding. See lordr for more information about coding creteria.\n\n\n\n\n\n","category":"method"},{"location":"api/#GeneFinder._locationiterator-Union{Tuple{Union{BioSequences.LongDNA{N}, BioSequences.LongSubSeq{BioSequences.DNAAlphabet{N}}}}, Tuple{N}} where N","page":"API","title":"GeneFinder._locationiterator","text":"locationiterator(sequence::NucleicSeqOrView{DNAAlphabet{N}}; alternative_start::Bool=false) where {N}\n\nThis is an iterator function that uses regular expressions to search the entire ORF (instead of start and stop codons) in a LongSequence{DNAAlphabet{4}} sequence. It uses an anonymous function that will find the first regularly expressed ORF. Then using this anonymous function it creates an iterator that will apply it until there is no other CDS.\n\nnote: Note\nAs a note of the implementation we want to expand on how the ORFs are found:The expression (?:[N]{3})*? serves as the boundary between the start and stop codons. Within this expression, the character class [N]{3} captures exactly three occurrences of any character (representing nucleotides using IUPAC codes). This portion functions as the regular codon matches. Since it is enclosed within a non-capturing group (?:) and followed by *?, it allows for the matching of intermediate codons, but with a preference for the smallest number of repetitions. In summary, the regular expression ATG(?:[N]{3})*?T(AG|AA|GA) identifies patterns that start with \"ATG,\" followed by any number of three-character codons (represented by \"N\" in the IUPAC code), and ends with a stop codon \"TAG,\" \"TAA,\" or \"TGA.\" This pattern is commonly used to identify potential protein-coding regions within genetic sequences.See more about the discussion here\n\n\n\n\n\n","category":"method"},{"location":"api/","page":"API","title":"API","text":"","category":"page"},{"location":"api/#Writing-ORFs-to-files","page":"API","title":"Writing ORFs to files","text":"","category":"section"},{"location":"api/","page":"API","title":"API","text":"Modules = [GeneFinder]\nPages = [\"io.jl\"]","category":"page"},{"location":"api/#GeneFinder.write_orfs_bed-Union{Tuple{F}, Tuple{N}, Tuple{Union{BioSequences.LongDNA{N}, BioSequences.LongSubSeq{BioSequences.DNAAlphabet{N}}}, Union{IOStream, IOBuffer}}} where {N, F<:GeneFinder.GeneFinderMethod}","page":"API","title":"GeneFinder.write_orfs_bed","text":"write_orfs_bed(input::NucleicSeqOrView{DNAAlphabet{N}}, output::Union{IOStream, IOBuffer}, finder::F; kwargs...)\nwrite_orfs_bed(input::NucleicSeqOrView{DNAAlphabet{N}}, output::String, finder::F; kwargs...)\n\nWrite BED data to a file.\n\nArguments\n\ninput: The input DNA sequence NucSeq or a view.\noutput: The otput format, it can be a file (String) or a buffer (IOStream or `IOBuffer)\nfinder: The algorithm used to find ORFs. It can be either NaiveFinder() or NaiveFinderScored().\n\nKeywords\n\nalternative_start::Bool=false: If true, alternative start codons will be used when identifying CDSs. Default is false.\nminlen::Int64=6: The minimum length that a CDS must have in order to be included in the output file. Default is 6.\n\n\n\n\n\n","category":"method"},{"location":"api/#GeneFinder.write_orfs_faa-Union{Tuple{F}, Tuple{N}, Tuple{Union{BioSequences.LongDNA{N}, BioSequences.LongSubSeq{BioSequences.DNAAlphabet{N}}}, Union{IOStream, IOBuffer}}} where {N, F<:GeneFinder.GeneFinderMethod}","page":"API","title":"GeneFinder.write_orfs_faa","text":"write_orfs_faa(input::NucleicSeqOrView{DNAAlphabet{4}}, output::Union{IOStream, IOBuffer}, finder::F; kwargs...)\nwrite_orfs_faa(input::NucleicSeqOrView{DNAAlphabet{4}}, output::String, finder::F; kwargs...)\n\nWrite the protein sequences encoded by the coding sequences (CDSs) of a given DNA sequence to the specified file.\n\nArguments\n\ninput: The input DNA sequence NucSeq or a view.\noutput: The otput format, it can be a file (String) or a buffer (IOStream or `IOBuffer)\nfinder: The algorithm used to find ORFs. It can be either NaiveFinder() or NaiveFinderScored().\n\nKeywords\n\ncode::GeneticCode=BioSequences.standard_genetic_code: The genetic code by which codons will be translated. See BioSequences.ncbi_trans_table for more info. \nalternative_start::Bool=false: If true will pass the extended start codons to search. This will increase 3x the exec. time.\nminlen::Int64=6: Length of the allowed ORF. Default value allow aa\"M*\" a posible encoding protein from the resulting ORFs.\n\nExamples\n\nfilename = \"output.faa\"\n\nseq = dna\"ATGATGCATGCATGCATGCTAGTAACTAGCTAGCTAGCTAGTAA\"\n\nopen(filename, \"w\") do file\n write_orfs_faa(seq, file)\nend\n\n\n\n\n\n","category":"method"},{"location":"api/#GeneFinder.write_orfs_fna-Union{Tuple{F}, Tuple{N}, Tuple{Union{BioSequences.LongDNA{N}, BioSequences.LongSubSeq{BioSequences.DNAAlphabet{N}}}, Union{IOStream, IOBuffer}}} where {N, F<:GeneFinder.GeneFinderMethod}","page":"API","title":"GeneFinder.write_orfs_fna","text":"write_orfs_fna(input::NucleicSeqOrView{DNAAlphabet{N}}, output::Union{IOStream, IOBuffer}, finder::F; kwargs...)\nwrite_orfs_fna(input::NucleicSeqOrView{DNAAlphabet{N}}, output::String, finder::F; kwargs...)\n\nWrite a file containing the coding sequences (CDSs) of a given DNA sequence to the specified file.\n\nArguments\n\ninput::NucleicAcidAlphabet{DNAAlphabet{N}}: The input DNA sequence.\noutput::IO: The otput format, it can be a file (String) or a buffer (IOStream or `IOBuffer)\nfinder::F: The algorithm used to find ORFs. It can be either NaiveFinder() or NaiveFinderScored().\n\nKeywords\n\nalternative_start::Bool=false: If true, alternative start codons will be used when identifying CDSs. Default is false.\nminlen::Int64=6: The minimum length that a CDS must have in order to be included in the output file. Default is 6.\n\nExamples\n\nfilename = \"output.fna\"\n\nseq = dna\"ATGATGCATGCATGCATGCTAGTAACTAGCTAGCTAGCTAGTAA\"\n\nopen(filename, \"w\") do file\n write_orfs_fna(seq, file, NaiveFinder())\nend\n\n\n\n\n\n","category":"method"},{"location":"api/#GeneFinder.write_orfs_gff-Union{Tuple{F}, Tuple{N}, Tuple{Union{BioSequences.LongDNA{N}, BioSequences.LongSubSeq{BioSequences.DNAAlphabet{N}}}, Union{IOStream, IOBuffer}}} where {N, F<:GeneFinder.GeneFinderMethod}","page":"API","title":"GeneFinder.write_orfs_gff","text":"write_orfs_gff(input::NucleicSeqOrView{DNAAlphabet{N}}, output::Union{IOStream, IOBuffer}, finder::F; kwargs...)\nwrite_orfs_gff(input::NucleicSeqOrView{DNAAlphabet{N}}, output::String, finder::F; kwargs...)\n\nWrite GFF data to a file.\n\nArguments\n\ninput: The input DNA sequence NucSeq or a view.\noutput: The otput format, it can be a file (String) or a buffer (IOStream or `IOBuffer)\nfinder: The algorithm used to find ORFs. It can be either NaiveFinder() or NaiveFinderScored().\n\nKeywords\n\ncode::GeneticCode=BioSequences.standard_genetic_code: The genetic code by which codons will be translated. See BioSequences.ncbi_trans_table for more info. \nalternative_start::Bool=false: If true will pass the extended start codons to search. This will increase 3x the exec. time.\nminlen::Int64=6: Length of the allowed ORF. Default value allow aa\"M*\" a posible encoding protein from the resulting ORFs.\n\n\n\n\n\n","category":"method"},{"location":"simplecodingrule/#Scoring-a-sequence-using-a-Markov-model","page":"A Simple Coding Rule","title":"Scoring a sequence using a Markov model","text":"","category":"section"},{"location":"simplecodingrule/","page":"A Simple Coding Rule","title":"A Simple Coding Rule","text":"A sequence of DNA could be scored using a Markov model of the transition probabilities of a known sequence. This could be done using a log-odds ratio score, which is the logarithm of the ratio of the transition probabilities of the sequence given a model and. The log-odds ratio score is defined as:","category":"page"},{"location":"simplecodingrule/","page":"A Simple Coding Rule","title":"A Simple Coding Rule","text":"beginalign\nS(x) = sum_i=1^L beta_x_ix = sum_i=1 log fraca^mathscrm_1_i-1 x_ia^mathscrm_2_i-1 x_i\nendalign","category":"page"},{"location":"simplecodingrule/","page":"A Simple Coding Rule","title":"A Simple Coding Rule","text":"Where the a^mathscrm_1_i-1 x_i is the transition probability of the first model (in this case the calculated for the given sequence) from the state x_i-1 to the state x_i and a^mathscrm_2_i-1 x_i is the transition probability of the second model from the state x_i-1 to the state x_i. The score is the sum of the log-odds ratio of the transition probabilities of the sequence given the two models.","category":"page"},{"location":"simplecodingrule/","page":"A Simple Coding Rule","title":"A Simple Coding Rule","text":"In the current implementation the second model is a CDS transition probability model of E. coli. This classification score is implemented in the naivescorefinder method. This method will return ORFs with the associated score of the sequence given the CDS model of E. coli.","category":"page"},{"location":"simplecodingrule/","page":"A Simple Coding Rule","title":"A Simple Coding Rule","text":"using GeneFinder, BioSequences\n\nseq = dna\"TTCGTCAGTCGTTCTGTTTCATTCAATACGATAGTAATGTATTTTTCGTGCATTTCCGGTGGAATCGTGCCGTCCAGCATAGCCTCCAGATATCCCCTTATAGAGGTCAGAGGGGAACGGAAATCGTGGGATACATTGGCTACAAACTTTTTCTGATCATCCTCGGAACGGGCAATTTCGCTTGCCATATAATTCAGACAGGAAGCCAGATAACCGATTTCATCCTCACTATCGACCTGAAATTCATAATGCATATTACCGGCAGCATACTGCTCTGTGGCATGAGTGATCTTCCTCAGAGGAATATATACGATCTCAGTGAAAAAGATCAGAATGATCAGGGATAGCAGGAACAGGATTGCCAGGGTGATATAGGAAATATTCAGCAGGTTGTTACAGGATTTCTGAATATCATTCATATCAGTATGGATGACTACATAGCCTTTTACCTTGTAGTTGGAGGTAATGGGAGCAAATACAGTAAGTACATCCGAATCAAAATTACCGAAGAAATCACCAACAATGTAATAGGAGCCGCTGGTTACGGTCGAATCAAAATTCTCAATGACAACCACATTCTCCACATCTAAGGGACTATTGGTATCCAGTACCAGTCGTCCGGAGGGATTGATGATGCGAATCTCGGAATTCAGGTAGACCGCCAGGGAGTCCAGCTGCATTTTAACGGTCTCCAAAGTTGTTTCACTGGTGTACAATCCGCCGGCATAGGTTCCGGCGATCAGGGTTGCTTCGGAATAGAGACTTTCTGCCTTTTCCCGGATCAGATGTTCTTTGGTCATATTGGGAACAAAAGTTGTAACAATGATGAAACCAAATACACCAAAAATAAAATATGCGAGTATAAATTTTAGATAAAGTGTTTTTTTCATAACAAATCCTGCTTTTGGTATGACTTAATTACGTACTTCGAATTTATAGCCGATGCCCCAGATGGTGCTGATCTTCCAGTTGGCATGATCCTTGATCTTCTC\"\n\nfindorfs(seq, minlen=75, finder=NaiveFinder)\n\n9-element Vector{ORF{4, NaiveFinder}}:\n ORF{NaiveFinder}(37:156, '+', 1,)\n ORF{NaiveFinder}(194:268, '-', 2)\n ORF{NaiveFinder}(194:283, '-', 2)\n ORF{NaiveFinder}(249:347, '+', 3)\n ORF{NaiveFinder}(426:590, '+', 3)\n ORF{NaiveFinder}(565:657, '+', 1)\n ORF{NaiveFinder}(650:727, '-', 2)\n ORF{NaiveFinder}(786:872, '+', 3)\n ORF{NaiveFinder}(887:976, '-', 2)","category":"page"},{"location":"simplecodingrule/#The-*log-odds-ratio*-decision-rule","page":"A Simple Coding Rule","title":"The log-odds ratio decision rule","text":"","category":"section"},{"location":"simplecodingrule/","page":"A Simple Coding Rule","title":"A Simple Coding Rule","text":"The sequence probability given a transition probability model could be used as the source of a sequence classification based on a decision rule to classify whether a sequence correspond to a model or another. Now, imagine we got two DNA sequence transition models, a CDS model and a No-CDS model. The log-odds ratio decision rule could be establish as:","category":"page"},{"location":"simplecodingrule/","page":"A Simple Coding Rule","title":"A Simple Coding Rule","text":"beginalign\nS(X) = log fracP_C(X_1=i_1 ldots X_T=i_T)P_N(X_1=i_1 ldots X_T=i_T) begincases eta Rightarrow textcoding eta Rightarrow textnoncoding endcases\nendalign","category":"page"},{"location":"simplecodingrule/","page":"A Simple Coding Rule","title":"A Simple Coding Rule","text":"Where the P_C is the probability of the sequence given a CDS model, P_N is the probability of the sequence given a No-CDS model, the decision rule is finally based on whether the ratio is greater or lesser than a given threshold η of significance level.","category":"page"},{"location":"simplecodingrule/","page":"A Simple Coding Rule","title":"A Simple Coding Rule","text":"In this package we have implemented this rule and call some basic models of CDS and No-CDS of E. coli from Axelson-Fisk (2015) work (implemented in BioMarkovChains.jl package). To check whether a random sequence could be coding based on these decision we use the predicate log_odds_ratio_decision_rule with the ECOLICDS and ECOLINOCDS models:","category":"page"},{"location":"simplecodingrule/","page":"A Simple Coding Rule","title":"A Simple Coding Rule","text":"orfsdna = findorfs(seq, minlen=75, alternative_start=true) .|> sequence\n\n20-element Vector{NucSeq{4, DNAAlphabet{4}}}\n ATGTATTTTTCGTGCATTTCCGGTGGAATCGTGCCGTCC…CGGAAATCGTGGGATACATTGGCTACAAACTTTTTCTGA\n GTGCATTTCCGGTGGAATCGTGCCGTCCAGCATAGCCTC…TACGATCTCAGTGAAAAAGATCAGAATGATCAGGGATAG\n GTGCCGTCCAGCATAGCCTCCAGATATCCCCTTATAGAG…CGGAAATCGTGGGATACATTGGCTACAAACTTTTTCTGA\n GTGGGATACATTGGCTACAAACTTTTTCTGATCATCCTC…TACGATCTCAGTGAAAAAGATCAGAATGATCAGGGATAG\n TTGCCATATAATTCAGACAGGAAGCCAGATAACCGATTT…GCATATTACCGGCAGCATACTGCTCTGTGGCATGAGTGA\n ATGCTGCCGGTAATATGCATTATGAATTTCAGGTCGATAGTGAGGATGAAATCGGTTATCTGGCTTCCTGTCTGA\n ATGCCACAGAGCAGTATGCTGCCGGTAATATGCATTATG…ATAGTGAGGATGAAATCGGTTATCTGGCTTCCTGTCTGA\n ATGCATATTACCGGCAGCATACTGCTCTGTGGCATGAGT…TACGATCTCAGTGAAAAAGATCAGAATGATCAGGGATAG\n GTGATCTTCCTCAGAGGAATATATACGATCTCAGTGAAA…ATCAGGGATAGCAGGAACAGGATTGCCAGGGTGATATAG\n ATGGATGACTACATAGCCTTTTACCTTGTAGTTGGAGGT…ATCAAAATTCTCAATGACAACCACATTCTCCACATCTAA\n TTGGTGATTTCTTCGGTAATTTTGATTCGGATGTACTTACTGTATTTGCTCCCATTACCTCCAACTACAAGGTAA\n TTGTTGGTGATTTCTTCGGTAATTTTGATTCGGATGTACTTACTGTATTTGCTCCCATTACCTCCAACTACAAGGTAA\n ATGACAACCACATTCTCCACATCTAAGGGACTATTGGTA…CCGGAGGGATTGATGATGCGAATCTCGGAATTCAGGTAG\n ATGCCGGCGGATTGTACACCAGTGAAACAACTTTGGAGACCGTTAAAATGCAGCTGGACTCCCTGGCGGTCTACCTGA\n TTGTTTCACTGGTGTACAATCCGCCGGCATAGGTTCCGG…TCAGATGTTCTTTGGTCATATTGGGAACAAAAGTTGTAA\n TTGCTTCGGAATAGAGACTTTCTGCCTTTTCCCGGATCAGATGTTCTTTGGTCATATTGGGAACAAAAGTTGTAA\n ATGTTCTTTGGTCATATTGGGAACAAAAGTTGTAACAAT…AAATACACCAAAAATAAAATATGCGAGTATAAATTTTAG\n TTGGTCATATTGGGAACAAAAGTTGTAACAATGATGAAA…ACACCAAAAATAAAATATGCGAGTATAAATTTTAGATAA\n TTGGGAACAAAAGTTGTAACAATGATGAAACCAAATACACCAAAAATAAAATATGCGAGTATAAATTTTAGATAA\n ATGCCAACTGGAAGATCAGCACCATCTGGGGCATCGGCT…TACGTAATTAAGTCATACCAAAAGCAGGATTTGTTATGA\n","category":"page"},{"location":"simplecodingrule/","page":"A Simple Coding Rule","title":"A Simple Coding Rule","text":"Now, we can score the sequences using the log-odds ratio score in the same line:","category":"page"},{"location":"simplecodingrule/","page":"A Simple Coding Rule","title":"A Simple Coding Rule","text":"orfsfeat = findorfs(seq, minlen=75, alternative_start=true, scheme=lors) .|> features\n\n20-element Vector{@NamedTuple{score::Float64}}:\n (score = -2.5146325834372343,)\n (score = -4.857592765476053,)\n (score = -1.9986133020444345,)\n (score = -3.4106894574555824,)\n (score = -1.763485388728319,)\n (score = 0.6825864481251348,)\n (score = 0.21287161698917936,)\n (score = -0.28187825646085224,)\n (score = -1.373474082107631,)\n (score = -4.273794970087796,)\n (score = -2.3961559066784597,)\n (score = -2.3663038090046142,)\n (score = -0.8406863072332524,)\n (score = 1.8013554455006733,)\n (score = -2.0768031699080756,)\n (score = -1.734088708668584,)\n (score = -2.9820908143871194,)\n (score = -3.072550585883162,)\n (score = -2.712493281013948,)\n (score = -2.0453354284951786,)","category":"page"},{"location":"simplecodingrule/","page":"A Simple Coding Rule","title":"A Simple Coding Rule","text":"Now the question is which of those sequences can we consider as coding sequences. We can use the iscoding predicate to check whether a sequence is coding or not based on the log-odds ratio decision rule:","category":"page"},{"location":"simplecodingrule/","page":"A Simple Coding Rule","title":"A Simple Coding Rule","text":"iscoding.(orfsdna) # criteria = log_odds_ratio_decision_rule","category":"page"},{"location":"simplecodingrule/","page":"A Simple Coding Rule","title":"A Simple Coding Rule","text":"20-element BitVector:\n 0\n 0\n 0\n 0\n 0\n 1\n 1\n 0\n 0\n 0\n 0\n 0\n 0\n 1\n 0\n 0\n 0\n 0\n 0\n 0","category":"page"},{"location":"simplecodingrule/","page":"A Simple Coding Rule","title":"A Simple Coding Rule","text":"In this case, the sequence has 20 ORFs and only 3 of them are classified as coding sequences. The classification is based on the log-odds ratio decision rule and the transition probability models of E. coli CDS and No-CDS. The log_odds_ratio_decision_rule method will return a boolean vector with the classification of each ORF in the sequence. Now we can simply filter the ORFs that are coding sequences:","category":"page"},{"location":"simplecodingrule/","page":"A Simple Coding Rule","title":"A Simple Coding Rule","text":"orfs = filter(orf -> iscoding(orf), orfsdna)\n\n3-element Vector{NucSeq{4, DNAAlphabet{4}}}\n ATGCTGCCGGTAATATGCATTATGAATTTCAGGTCGATAGTGAGGATGAAATCGGTTATCTGGCTTCCTGTCTGA\n ATGCCACAGAGCAGTATGCTGCCGGTAATATGCATTATG…ATAGTGAGGATGAAATCGGTTATCTGGCTTCCTGTCTGA\n ATGCCGGCGGATTGTACACCAGTGAAACAACTTTGGAGACCGTTAAAATGCAGCTGGACTCCCTGGCGGTCTACCTGA","category":"page"},{"location":"simplecodingrule/","page":"A Simple Coding Rule","title":"A Simple Coding Rule","text":"Or in terms of the ORF object:","category":"page"},{"location":"simplecodingrule/","page":"A Simple Coding Rule","title":"A Simple Coding Rule","text":"orfs = findorfs(seq, minlen=75, finder=NaiveFinder, alternative_start=true) # find ORFs with alternative start as well\norfs[iscoding.(orfsdna)]\n\n3-element Vector{ORF{4, NaiveFinder}}:\n ORF{NaiveFinder}(194:268, '-', 2, -0.026759927376272922)\n ORF{NaiveFinder}(194:283, '-', 2, -0.010354615336667268)\n ORF{NaiveFinder}(650:727, '-', 2, -0.04303976584597201)","category":"page"},{"location":"simplecodingrule/","page":"A Simple Coding Rule","title":"A Simple Coding Rule","text":"Or in a single line using another genome sequence:","category":"page"},{"location":"simplecodingrule/","page":"A Simple Coding Rule","title":"A Simple Coding Rule","text":"\nphi = dna\"GTGTGAGGTTATAACGCCGAAGCGGTAAAAATTTTAATTTTTGCCGCTGAGGGGTTGACCAAGCGAAGCGCGGTAGGTTTTCTGCTTAGGAGTTTAATCATGTTTCAGACTTTTATTTCTCGCCATAATTCAAACTTTTTTTCTGATAAGCTGGTTCTCACTTCTGTTACTCCAGCTTCTTCGGCACCTGTTTTACAGACACCTAAAGCTACATCGTCAACGTTATATTTTGATAGTTTGACGGTTAATGCTGGTAATGGTGGTTTTCTTCATTGCATTCAGATGGATACATCTGTCAACGCCGCTAATCAGGTTGTTTCTGTTGGTGCTGATATTGCTTTTGATGCCGACCCTAAATTTTTTGCCTGTTTGGTTCGCTTTGAGTCTTCTTCGGTTCCGACTACCCTCCCGACTGCCTATGATGTTTATCCTTTGAATGGTCGCCATGATGGTGGTTATTATACCGTCAAGGACTGTGTGACTATTGACGTCCTTCCCCGTACGCCGGGCAATAACGTTTATGTTGGTTTCATGGTTTGGTCTAACTTTACCGCTACTAAATGCCGCGGATTGGTTTCGCTGAATCAGGTTATTAAAGAGATTATTTGTCTCCAGCCACTTAAGTGAGGTGATTTATGTTTGGTGCTATTGCTGGCGGTATTGCTTCTGCTCTTGCTGGTGGCGCCATGTCTAAATTGTTTGGAGGCGGTCAAAAAGCCGCCTCCGGTGGCATTCAAGGTGATGTGCTTGCTACCGATAACAATACTGTAGGCATGGGTGATGCTGGTATTAAATCTGCCATTCAAGGCTCTAATGTTCCTAACCCTGATGAGGCCGCCCCTAGTTTTGTTTCTGGTGCTATGGCTAAAGCTGGTAAAGGACTTCTTGAAGGTACGTTGCAGGCTGGCACTTCTGCCGTTTCTGATAAGTTGCTTGATTTGGTTGGACTTGGTGGCAAGTCTGCCGCTGATAAAGGAAAGGATACTCGTGATTATCTTGCTGCTGCATTTCCTGAGCTTAATGCTTGGGAGCGTGCTGGTGCTGATGCTTCCTCTGCTGGTATGGTTGACGCCGGATTTGAGAATCAAAAAGAGCTTACTAAAATGCAACTGGACAATCAGAAAGAGATTGCCGAGATGCAAAATGAGACTCAAAAAGAGATTGCTGGCATTCAGTCGGCGACTTCACGCCAGAATACGAAAGACCAGGTATATGCACAAAATGAGATGCTTGCTTATCAACAGAAGGAGTCTACTGCTCGCGTTGCGTCTATTATGGAAAACACCAATCTTTCCAAGCAACAGCAGGTTTCCGAGATTATGCGCCAAATGCTTACTCAAGCTCAAACGGCTGGTCAGTATTTTACCAATGACCAAATCAAAGAAATGACTCGCAAGGTTAGTGCTGAGGTTGACTTAGTTCATCAGCAAACGCAGAATCAGCGGTATGGCTCTTCTCATATTGGCGCTACTGCAAAGGATATTTCTAATGTCGTCACTGATGCTGCTTCTGGTGTGGTTGATATTTTTCATGGTATTGATAAAGCTGTTGCCGATACTTGGAACAATTTCTGGAAAGACGGTAAAGCTGATGGTATTGGCTCTAATTTGTCTAGGAAATAACCGTCAGGATTGACACCCTCCCAATTGTATGTTTTCATGCCTCCAAATCTTGGAGGCTTTTTTATGGTTCGTTCTTATTACCCTTCTGAATGTCACGCTGATTATTTTGACTTTGAGCGTATCGAGGCTCTTAAACCTGCTATTGAGGCTTGTGGCATTTCTACTCTTTCTCAATCCCCAATGCTTGGCTTCCATAAGCAGATGGATAACCGCATCAAGCTCTTGGAAGAGATTCTGTCTTTTCGTATGCAGGGCGTTGAGTTCGATAATGGTGATATGTATGTTGACGGCCATAAGGCTGCTTCTGACGTTCGTGATGAGTTTGTATCTGTTACTGAGAAGTTAATGGATGAATTGGCACAATGCTACAATGTGCTCCCCCAACTTGATATTAATAACACTATAGACCACCGCCCCGAAGGGGACGAAAAATGGTTTTTAGAGAACGAGAAGACGGTTACGCAGTTTTGCCGCAAGCTGGCTGCTGAACGCCCTCTTAAGGATATTCGCGATGAGTATAATTACCCCAAAAAGAAAGGTATTAAGGATGAGTGTTCAAGATTGCTGGAGGCCTCCACTATGAAATCGCGTAGAGGCTTTGCTATTCAGCGTTTGATGAATGCAATGCGACAGGCTCATGCTGATGGTTGGTTTATCGTTTTTGACACTCTCACGTTGGCTGACGACCGATTAGAGGCGTTTTATGATAATCCCAATGCTTTGCGTGACTATTTTCGTGATATTGGTCGTATGGTTCTTGCTGCCGAGGGTCGCAAGGCTAATGATTCACACGCCGACTGCTATCAGTATTTTTGTGTGCCTGAGTATGGTACAGCTAATGGCCGTCTTCATTTCCATGCGGTGCACTTTATGCGGACACTTCCTACAGGTAGCGTTGACCCTAATTTTGGTCGTCGGGTACGCAATCGCCGCCAGTTAAATAGCTTGCAAAATACGTGGCCTTATGGTTACAGTATGCCCATCGCAGTTCGCTACACGCAGGACGCTTTTTCACGTTCTGGTTGGTTGTGGCCTGTTGATGCTAAAGGTGAGCCGCTTAAAGCTACCAGTTATATGGCTGTTGGTTTCTATGTGGCTAAATACGTTAACAAAAAGTCAGATATGGACCTTGCTGCTAAAGGTCTAGGAGCTAAAGAATGGAACAACTCACTAAAAACCAAGCTGTCGCTACTTCCCAAGAAGCTGTTCAGAATCAGAATGAGCCGCAACTTCGGGATGAAAATGCTCACAATGACAAATCTGTCCACGGAGTGCTTAATCCAACTTACCAAGCTGGGTTACGACGCGACGCCGTTCAACCAGATATTGAAGCAGAACGCAAAAAGAGAGATGAGATTGAGGCTGGGAAAAGTTACTGTAGCCGACGTTTTGGCGGCGCAACCTGTGACGACAAATCTGCTCAAATTTATGCGCGCTTCGATAAAAATGATTGGCGTATCCAACCTGCAGAGTTTTATCGCTTCCATGACGCAGAAGTTAACACTTTCGGATATTTCTGATGAGTCGAAAAATTATCTTGATAAAGCAGGAATTACTACTGCTTGTTTACGAATTAAATCGAAGTGGACTGCTGGCGGAAAATGAGAAAATTCGACCTATCCTTGCGCAGCTCGAGAAGCTCTTACTTTGCGACCTTTCGCCATCAACTAACGATTCTGTCAAAAACTGACGCGTTGGATGAGGAGAAGTGGCTTAATATGCTTGGCACGTTCGTCAAGGACTGGTTTAGATATGAGTCACATTTTGTTCATGGTAGAGATTCTCTTGTTGACATTTTAAAAGAGCGTGGATTACTATCTGAGTCCGATGCTGTTCAACCACTAATAGGTAAGAAATCATGAGTCAAGTTACTGAACAATCCGTACGTTTCCAGACCGCTTTGGCCTCTATTAAGCTCATTCAGGCTTCTGCCGTTTTGGATTTAACCGAAGATGATTTCGATTTTCTGACGAGTAACAAAGTTTGGATTGCTACTGACCGCTCTCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTACGCTGGACTTTGTGGGATACCCTCGCTTTCCTGCTCCTGTTGAGTTTATTGCTGCCGTCATTGCTTATTATGTTCATCCCGTCAACATTCAAACGGCCTGTCTCATCATGGAAGGCGCTGAATTTACGGAAAACATTATTAATGGCGTCGAGCGTCCGGTTAAAGCCGCTGAATTGTTCGCGTTTACCTTGCGTGTACGCGCAGGAAACACTGACGTTCTTACTGACGCAGAAGAAAACGTGCGTCAAAAATTACGTGCGGAAGGAGTGATGTAATGTCTAAAGGTAAAAAACGTTCTGGCGCTCGCCCTGGTCGTCCGCAGCCGTTGCGAGGTACTAAAGGCAAGCGTAAAGGCGCTCGTCTTTGGTATGTAGGTGGTCAACAATTTTAATTGCAGGGGCTTCGGCCCCTTACTTGAGGATAAATTATGTCTAATATTCAAACTGGCGCCGAGCGTATGCCGCATGACCTTTCCCATCTTGGCTTCCTTGCTGGTCAGATTGGTCGTCTTATTACCATTTCAACTACTCCGGTTATCGCTGGCGACTCCTTCGAGATGGACGCCGTTGGCGCTCTCCGTCTTTCTCCATTGCGTCGTGGCCTTGCTATTGACTCTACTGTAGACATTTTTACTTTTTATGTCCCTCATCGTCACGTTTATGGTGAACAGTGGATTAAGTTCATGAAGGATGGTGTTAATGCCACTCCTCTCCCGACTGTTAACACTACTGGTTATATTGACCATGCCGCTTTTCTTGGCACGATTAACCCTGATACCAATAAAATCCCTAAGCATTTGTTTCAGGGTTATTTGAATATCTATAACAACTATTTTAAAGCGCCGTGGATGCCTGACCGTACCGAGGCTAACCCTAATGAGCTTAATCAAGATGATGCTCGTTATGGTTTCCGTTGCTGCCATCTCAAAAACATTTGGACTGCTCCGCTTCCTCCTGAGACTGAGCTTTCTCGCCAAATGACGACTTCTACCACATCTATTGACATTATGGGTCTGCAAGCTGCTTATGCTAATTTGCATACTGACCAAGAACGTGATTACTTCATGCAGCGTTACCATGATGTTATTTCTTCATTTGGAGGTAAAACCTCTTATGACGCTGACAACCGTCCTTTACTTGTCATGCGCTCTAATCTCTGGGCATCTGGCTATGATGTTGATGGAACTGACCAAACGTCGTTAGGCCAGTTTTCTGGTCGTGTTCAACAGACCTATAAACATTCTGTGCCGCGTTTCTTTGTTCCTGAGCATGGCACTATGTTTACTCTTGCGCTTGTTCGTTTTCCGCCTACTGCGACTAAAGAGATTCAGTACCTTAACGCTAAAGGTGCTTTGACTTATACCGATATTGCTGGCGACCCTGTTTTGTATGGCAACTTGCCGCCGCGTGAAATTTCTATGAAGGATGTTTTCCGTTCTGGTGATTCGTCTAAGAAGTTTAAGATTGCTGAGGGTCAGTGGTATCGTTATGCGCCTTCGTATGTTTCTCCTGCTTATCACCTTCTTGAAGGCTTCCCATTCATTCAGGAACCGCCTTCTGGTGATTTGCAAGAACGCGTACTTATTCGCCACCATGATTATGACCAGTGTTTCCAGTCCGTTCAGTTGTTGCAGTGGAATAGTCAGGTTAAATTTAATGTGACCGTTTATCGCAATCTGCCGACCACTCGCGATTCAATCATGACTTCGTGATAAAAGATTGA\"\n\nfilter(x -> iscoding(sequence(x), η=1e-10) && length(x) > 100, findorfs(phi))\n\n34-element Vector{ORF{4, NaiveFinder}}:\n ORF{NaiveFinder}(636:1622, '+', 3)\n ORF{NaiveFinder}(687:1622, '+', 3)\n ORF{NaiveFinder}(774:1622, '+', 3)\n ORF{NaiveFinder}(781:1389, '+', 1)\n ORF{NaiveFinder}(814:1389, '+', 1)\n ORF{NaiveFinder}(829:1389, '+', 1)\n ORF{NaiveFinder}(861:1622, '+', 3)\n ORF{NaiveFinder}(1021:1389, '+', 1)\n ORF{NaiveFinder}(1386:1622, '+', 3)\n ORF{NaiveFinder}(1447:1635, '+', 1)\n ORF{NaiveFinder}(1489:1635, '+', 1)\n ORF{NaiveFinder}(1501:1635, '+', 1)\n ORF{NaiveFinder}(1531:1635, '+', 1)\n ORF{NaiveFinder}(2697:3227, '+', 3)\n ORF{NaiveFinder}(2745:3227, '+', 3)\n ⋮\n ORF{NaiveFinder}(2874:3227, '+', 3)\n ORF{NaiveFinder}(2973:3227, '+', 3)\n ORF{NaiveFinder}(3108:3227, '+', 3)\n ORF{NaiveFinder}(3142:3312, '+', 1)\n ORF{NaiveFinder}(3481:3939, '+', 1)\n ORF{NaiveFinder}(3659:3934, '+', 2)\n ORF{NaiveFinder}(3734:3934, '+', 2)\n ORF{NaiveFinder}(3772:3939, '+', 1)\n ORF{NaiveFinder}(3806:3934, '+', 2)\n ORF{NaiveFinder}(4129:4287, '+', 1)\n ORF{NaiveFinder}(4160:4291, '-', 2)\n ORF{NaiveFinder}(4540:4644, '+', 1)\n ORF{NaiveFinder}(4690:4866, '+', 1)\n ORF{NaiveFinder}(4741:4866, '+', 1)\n ORF{NaiveFinder}(4744:4866, '+', 1)","category":"page"},{"location":"naivefinder/#The-ORF-type","page":"Finding ORFs","title":"The ORF type","text":"","category":"section"},{"location":"naivefinder/","page":"Finding ORFs","title":"Finding ORFs","text":"For convenience, the ORF type is more stringent in preventing the creation of incompatible instances. As a result, attempting to create an instance with incompatible parameters will result in an error. For instance, the following code snippet will trigger an error:","category":"page"},{"location":"naivefinder/","page":"Finding ORFs","title":"Finding ORFs","text":"ORF{4,NaiveFinder}(1:10, '+', 4) # Or any F <: GeneFinderMethod\n\nERROR: AssertionError: Invalid frame value. Frame must be 1, 2, or 3.\nStacktrace:\n [1] ORF\n @ ~/.julia/dev/GeneFinder/src/types.jl:52 [inlined]\n [2] ORF{4, NaiveCollector}(range::UnitRange{Int64}, strand::Char, frame::Int64)\n @ GeneFinder ~/.julia/dev/GeneFinder/src/types.jl:79\n [3] top-level scope\n @ REPL[20]:1","category":"page"},{"location":"naivefinder/","page":"Finding ORFs","title":"Finding ORFs","text":"Similar behavior will be encountered when the strand is neither + nor -. This precautionary measure helps prevent the creation of invalid ORFs, ensuring greater stability and enabling the extension of its interface. For example, after creating a specific ORF, users can seamlessly iterate over a sequence of interest and verify whether the ORF is contained within the sequence.","category":"page"},{"location":"naivefinder/","page":"Finding ORFs","title":"Finding ORFs","text":"orf = ORF{4,NaiveFinder}(137:145, '+', 2)\nseq[orf]\n\n9nt DNA Sequence:\nATGCGCTGA","category":"page"},{"location":"naivefinder/","page":"Finding ORFs","title":"Finding ORFs","text":"warning: Warning\nIt is still possible to create an ORF and pass it to a sequence that does not necessarily contain an actual open reading frame. This will be addressed in future versions of the package. But the benefit of having it is that it will retrieve the corresponding subsequence of the sequence in a convinient way (5' to 3') regardless of the strand.","category":"page"},{"location":"naivefinder/#Finding-complete-and-overlapped-ORFs","page":"Finding ORFs","title":"Finding complete and overlapped ORFs","text":"","category":"section"},{"location":"naivefinder/","page":"Finding ORFs","title":"Finding ORFs","text":"The first implemented function is findorfs a very non-restrictive ORF finder function that will catch all ORFs in a dedicated structure. Note that this will catch random ORFs not necesarily genes since it has no ORFs size or overlapping condition contraints. Thus it might consider aa\"M*\" a posible encoding protein from the resulting ORFs.","category":"page"},{"location":"naivefinder/","page":"Finding ORFs","title":"Finding ORFs","text":"using BioSequences, GeneFinder\n\n# > 180195.SAMN03785337.LFLS01000089 -> finds only 1 gene in Prodigal (from Pyrodigal tests)\nseq = dna\"AACCAGGGCAATATCAGTACCGCGGGCAATGCAACCCTGACTGCCGGCGGTAACCTGAACAGCACTGGCAATCTGACTGTGGGCGGTGTTACCAACGGCACTGCTACTACTGGCAACATCGCACTGACCGGTAACAATGCGCTGAGCGGTCCGGTCAATCTGAATGCGTCGAATGGCACGGTGACCTTGAACACGACCGGCAATACCACGCTCGGTAACGTGACGGCACAAGGCAATGTGACGACCAATGTGTCCAACGGCAGTCTGACGGTTACCGGCAATACGACAGGTGCCAACACCAACCTCAGTGCCAGCGGCAACCTGACCGTGGGTAACCAGGGCAATATCAGTACCGCAGGCAATGCAACCCTGACGGCCGGCGACAACCTGACGAGCACTGGCAATCTGACTGTGGGCGGCGTCACCAACGGCACGGCCACCACCGGCAACATCGCGCTGACCGGTAACAATGCACTGGCTGGTCCTGTCAATCTGAACGCGCCGAACGGCACCGTGACCCTGAACACAACCGGCAATACCACGCTGGGTAATGTCACCGCACAAGGCAATGTGACGACTAATGTGTCCAACGGCAGCCTGACAGTCGCTGGCAATACCACAGGTGCCAACACCAACCTGAGTGCCAGCGGCAATCTGACCGTGGGCAACCAGGGCAATATCAGTACCGCGGGCAATGCAACCCTGACTGCCGGCGGTAACCTGAGC\"","category":"page"},{"location":"naivefinder/","page":"Finding ORFs","title":"Finding ORFs","text":"Now lest us find the ORFs","category":"page"},{"location":"naivefinder/","page":"Finding ORFs","title":"Finding ORFs","text":"orfs = findorfs(seq, finder=NaiveFinder)\n\n12-element Vector{ORF}:\n ORF{NaiveFinder}(29:40, '+', 2)\n ORF{NaiveFinder}(137:145, '+', 2)\n ORF{NaiveFinder}(164:184, '+', 2)\n ORF{NaiveFinder}(173:184, '+', 2)\n ORF{NaiveFinder}(236:241, '+', 2)\n ORF{NaiveFinder}(248:268, '+', 2)\n ORF{NaiveFinder}(362:373, '+', 2)\n ORF{NaiveFinder}(470:496, '+', 2)\n ORF{NaiveFinder}(551:574, '+', 2)\n ORF{NaiveFinder}(569:574, '+', 2)\n ORF{NaiveFinder}(581:601, '+', 2)\n ORF{NaiveFinder}(695:706, '+', 2)","category":"page"},{"location":"naivefinder/","page":"Finding ORFs","title":"Finding ORFs","text":"Two other methods where implemented into sequence to get the ORFs in DNA or aminoacid sequences, respectively. They use the findorfs function to first get the ORFs and then get the correspondance array of BioSequence objects.","category":"page"},{"location":"naivefinder/","page":"Finding ORFs","title":"Finding ORFs","text":"sequece.(orfs)\n\n12-element Vector{NucSeq{4, DNAAlphabet{4}}}\n ATGCAACCCTGA\n ATGCGCTGA\n ATGCGTCGAATGGCACGGTGA\n ATGGCACGGTGA\n ATGTGA\n ATGTGTCCAACGGCAGTCTGA\n ATGCAACCCTGA\n ATGCACTGGCTGGTCCTGTCAATCTGA\n ATGTCACCGCACAAGGCAATGTGA\n ATGTGA\n ATGTGTCCAACGGCAGCCTGA\n ATGCAACCCTGA","category":"page"},{"location":"naivefinder/","page":"Finding ORFs","title":"Finding ORFs","text":"transalate.(orfs)\n\n12-element Vector{LongSubSeq{AminoAcidAlphabet}}:\n MQP*\n MR*\n MRRMAR*\n MAR*\n M*\n MCPTAV*\n MQP*\n MHWLVLSI*\n MSPHKAM*\n M*\n MCPTAA*\n MQP*","category":"page"},{"location":"roadmap/#Roadmap","page":"-","title":"Roadmap","text":"","category":"section"},{"location":"roadmap/#Coding-genes-(CDS-ORFs)","page":"-","title":"Coding genes (CDS - ORFs)","text":"","category":"section"},{"location":"roadmap/","page":"-","title":"-","text":"☒ Finding ORFs\n☐ EasyGene\n☐ GLIMMER\n☐ Prodigal - Pyrodigal\n☐ PHANOTATE\n☐ k-mer based gene finders (?)\n☐ Augustus (?)","category":"page"},{"location":"roadmap/#Non-coding-genes-(RNA)","page":"-","title":"Non-coding genes (RNA)","text":"","category":"section"},{"location":"roadmap/","page":"-","title":"-","text":"☐ Infernal\n☐ tRNAscan","category":"page"},{"location":"roadmap/#Other-features","page":"-","title":"Other features","text":"","category":"section"},{"location":"roadmap/","page":"-","title":"-","text":"☐ parallelism SIMD ?\n☐ memory management (?)\n☐ incorporate Ribosime Binding Sites (RBS)\n☐ incorporate Programmed Reading Frame Shifting (PRFS)\n☐ specialized types\n☒ Gene\n☒ ORF\n☒ Codon\n☒ CDS\n☐ EukaryoticGene (?)\n☐ ProkaryoticGene (?)\n☐ Intron\n☐ Exon\n☐ GFF –\\> See other packages\n☐ FASTX –\\> See I/O in other packages","category":"page"},{"location":"roadmap/#Compatibilities","page":"-","title":"Compatibilities","text":"","category":"section"},{"location":"roadmap/","page":"-","title":"-","text":"Must interact with or extend:","category":"page"},{"location":"roadmap/","page":"-","title":"-","text":"GenomicAnnotations.jl\nBioSequences.jl\nSequenceVariation.jl\nGenomicFeatures.jl\nFASTX.jl\nKmers.jl\nGraphs.jl","category":"page"},{"location":"","page":"Home","title":"Home","text":"\n

\n
\n A Gene Finder framework for Julia.\n

","category":"page"},{"location":"","page":"Home","title":"Home","text":"\n
\n\n\n \"Documentation\"\n\n\n \"Release\"\n\n\n \"DOI\"\n\n\n
\n \"GitHub\n
\n\n \"License\"\n\n\n \"Repo\n\n\n \"Downloads\"\n\n\n \"Aqua\n\n\n
\n","category":"page"},{"location":"","page":"Home","title":"Home","text":"","category":"page"},{"location":"#Overview","page":"Home","title":"Overview","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"This is a species-agnostic and algorithm extensible gene finder library for the Julia Language.","category":"page"},{"location":"#Installation","page":"Home","title":"Installation","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"You can install GeneFinder from the julia REPL. Press ] to enter pkg mode, and enter the following:","category":"page"},{"location":"","page":"Home","title":"Home","text":"add GeneFinder\n","category":"page"},{"location":"#Citing","page":"Home","title":"Citing","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"@misc{GeneFinder.jl,\n\tauthor = {Camilo García},\n\ttitle = {GeneFinder.jl},\n\turl = {https://github.com/camilogarciabotero/GeneFinder.jl},\n\tversion = {v0.3.0},\n\tyear = {2024},\n\tmonth = {04}\n}","category":"page"}] +[{"location":"iodocs/#Writting-ORFs-into-bioinformatic-formats","page":"Wrtiting ORFs In Files","title":"Writting ORFs into bioinformatic formats","text":"","category":"section"},{"location":"iodocs/","page":"Wrtiting ORFs In Files","title":"Wrtiting ORFs In Files","text":"This package facilitates the creation of FASTA, BED, and GFF files, specifically extracting Open Reading Frame (ORF) information from BioSequence instances, particularly those of type NucleicSeqOrView{A} where A, and then writing the information into the desired format.","category":"page"},{"location":"iodocs/","page":"Wrtiting ORFs In Files","title":"Wrtiting ORFs In Files","text":"Functionality:","category":"page"},{"location":"iodocs/","page":"Wrtiting ORFs In Files","title":"Wrtiting ORFs In Files","text":"The package provides four distinct functions for writing files in different formats:","category":"page"},{"location":"iodocs/","page":"Wrtiting ORFs In Files","title":"Wrtiting ORFs In Files","text":"Function Description\nwrite_orfs_fna Writes nucleotide sequences in FASTA format.\nwrite_orfs_faa Writes amino acid sequences in FASTA format.\nwrite_orfs_bed Outputs information in BED format.\nwrite_orfs_gff Generates files in GFF format.","category":"page"},{"location":"iodocs/","page":"Wrtiting ORFs In Files","title":"Wrtiting ORFs In Files","text":"All these functions support processing both BioSequence instances and external FASTA files. In the case of a BioSequence instace into external files, simply provide the path to the FASTA file using a String to the path. To demonstrate the use of the write_* methods with a BioSequence, consider the following example:","category":"page"},{"location":"iodocs/","page":"Wrtiting ORFs In Files","title":"Wrtiting ORFs In Files","text":"using BioSequences, GeneFinder\n\n# > 180195.SAMN03785337.LFLS01000089 -> finds only 1 gene in Prodigal (from Pyrodigal tests)\nseq = dna\"AACCAGGGCAATATCAGTACCGCGGGCAATGCAACCCTGACTGCCGGCGGTAACCTGAACAGCACTGGCAATCTGACTGTGGGCGGTGTTACCAACGGCACTGCTACTACTGGCAACATCGCACTGACCGGTAACAATGCGCTGAGCGGTCCGGTCAATCTGAATGCGTCGAATGGCACGGTGACCTTGAACACGACCGGCAATACCACGCTCGGTAACGTGACGGCACAAGGCAATGTGACGACCAATGTGTCCAACGGCAGTCTGACGGTTACCGGCAATACGACAGGTGCCAACACCAACCTCAGTGCCAGCGGCAACCTGACCGTGGGTAACCAGGGCAATATCAGTACCGCAGGCAATGCAACCCTGACGGCCGGCGACAACCTGACGAGCACTGGCAATCTGACTGTGGGCGGCGTCACCAACGGCACGGCCACCACCGGCAACATCGCGCTGACCGGTAACAATGCACTGGCTGGTCCTGTCAATCTGAACGCGCCGAACGGCACCGTGACCCTGAACACAACCGGCAATACCACGCTGGGTAATGTCACCGCACAAGGCAATGTGACGACTAATGTGTCCAACGGCAGCCTGACAGTCGCTGGCAATACCACAGGTGCCAACACCAACCTGAGTGCCAGCGGCAATCTGACCGTGGGCAACCAGGGCAATATCAGTACCGCGGGCAATGCAACCCTGACTGCCGGCGGTAACCTGAGC\"","category":"page"},{"location":"iodocs/","page":"Wrtiting ORFs In Files","title":"Wrtiting ORFs In Files","text":"Once a BioSequence object has been instantiated, the write_orfs_fna function proves useful for generating a FASTA file containing the nucleotide sequences of the ORFs. Notably, the write_orfs* methods support either an IOStream or an IOBuffer as an output argument, allowing flexibility in directing the output either to a file or a buffer. In the following example, we demonstrate writing the output directly to a file.","category":"page"},{"location":"iodocs/","page":"Wrtiting ORFs In Files","title":"Wrtiting ORFs In Files","text":"outfile = \"LFLS01000089.fna\"\n\nopen(outfile, \"w\") do io\n write_orfs_fna(seq, io, NaiveFinder())\nend","category":"page"},{"location":"iodocs/","page":"Wrtiting ORFs In Files","title":"Wrtiting ORFs In Files","text":"cat LFLS01000089.fna\n\n>seq id=01 start=29 stop=40 strand=+ frame=2 score=0.0\nATGCAACCCTGA\n>seq id=02 start=137 stop=145 strand=+ frame=2 score=0.0\nATGCGCTGA\n>seq id=03 start=164 stop=184 strand=+ frame=2 score=0.0\nATGCGTCGAATGGCACGGTGA\n>seq id=04 start=173 stop=184 strand=+ frame=2 score=0.0\nATGGCACGGTGA\n>seq id=05 start=236 stop=241 strand=+ frame=2 score=0.0\nATGTGA\n>seq id=06 start=248 stop=268 strand=+ frame=2 score=0.0\nATGTGTCCAACGGCAGTCTGA\n>seq id=07 start=362 stop=373 strand=+ frame=2 score=0.0\nATGCAACCCTGA\n>seq id=08 start=470 stop=496 strand=+ frame=2 score=0.0\nATGCACTGGCTGGTCCTGTCAATCTGA\n>seq id=09 start=551 stop=574 strand=+ frame=2 score=0.0\nATGTCACCGCACAAGGCAATGTGA\n>seq id=10 start=569 stop=574 strand=+ frame=2 score=0.0\nATGTGA\n>seq id=11 start=581 stop=601 strand=+ frame=2 score=0.0\nATGTGTCCAACGGCAGCCTGA\n>seq id=12 start=695 stop=706 strand=+ frame=2 score=0.0\nATGCAACCCTGA","category":"page"},{"location":"iodocs/","page":"Wrtiting ORFs In Files","title":"Wrtiting ORFs In Files","text":"This could also be done to writting a FASTA file with the nucleotide sequences of the ORFs using the write_orfs_fna function. Similarly for the BED and GFF files using the write_orfs_bed and write_orfs_gff functions respectively.","category":"page"},{"location":"api/","page":"API","title":"API","text":"CurrentModule = GeneFinder\nDocTestSetup = quote\n using GeneFinder\nend","category":"page"},{"location":"api/#The-Main-ORF-type","page":"API","title":"The Main ORF type","text":"","category":"section"},{"location":"api/","page":"API","title":"API","text":"The main type of the package is ORF which represents an Open Reading Frame.","category":"page"},{"location":"api/","page":"API","title":"API","text":"Modules = [GeneFinder]\nPages = [\"types.jl\"]","category":"page"},{"location":"api/#GeneFinder.ORF","page":"API","title":"GeneFinder.ORF","text":"struct ORF{N,F} <: GenomicFeatures.AbstractGenomicInterval{F}\n\nThe ORF struct represents an Open Reading Frame (ORF) in genomics.\n\nFields\n\ngroupname::String: The name of the group to which the ORF belongs.\nfirst::Int64: The starting position of the ORF.\nlast::Int64: The ending position of the ORF.\nstrand::Strand: The strand on which the ORF is located.\nframe::Int: The reading frame of the ORF.\nfeatures::Features: The features associated with the ORF.\nscheme::Union{Nothing,Function}: The scheme used for the ORF.\n\nConstructor\n\nORF{N,F}(\n groupname::String,\n first::Int64,\n last::Int64,\n strand::Strand,\n frame::Int,\n features::Features,\n scheme::Union{Nothing,Function}\n)\n\n# Example\n\nA full instance `ORF`\n\n\njulia ORF{4,NaiveFinder}(\"seq01\", 1, 33, STRAND_POS, 1, Features((score = 0.0,)), nothing)\n\n\nA partial instance `ORF`\n\n\njulia ORF{NaiveFinder}(1:33, '+', 1) ```\n\n\n\n\n\n","category":"type"},{"location":"api/#FASTX.sequence-Union{Tuple{ORF{N, F}}, Tuple{F}, Tuple{N}} where {N, F}","page":"API","title":"FASTX.sequence","text":"sequence(i::ORF{N,F})\n\nExtracts the DNA sequence corresponding to the given open reading frame (ORF).\n\nArguments\n\ni::ORF{N,F}: The open reading frame (ORF) for which the DNA sequence needs to be extracted.\n\nReturns\n\nThe DNA sequence corresponding to the given open reading frame (ORF).\n\n\n\n\n\n","category":"method"},{"location":"api/#GeneFinder.features-Union{Tuple{ORF{N, F}}, Tuple{F}, Tuple{N}} where {N, F}","page":"API","title":"GeneFinder.features","text":"features(i::ORF{N,F})\n\nExtracts the features from an ORF object.\n\nArguments\n\ni::ORF{N,F}: An ORF object.\n\nReturns\n\nThe features of the ORF object.\n\n\n\n\n\n","category":"method"},{"location":"api/#GeneFinder.source-Union{Tuple{ORF{N, F}}, Tuple{F}, Tuple{N}} where {N, F}","page":"API","title":"GeneFinder.source","text":"source(i::ORF{N,F})\n\nGet the source sequence associated with the given ORF object.\n\nArguments\n\ni::ORF{N,F}: The ORF object for which to retrieve the source sequence.\n\nReturns\n\nThe source sequence associated with the ORF object.\n\nExamples\n\nseq = dna\"ATGATGCATGCATGCATGCTAGTAACTAGCTAGCTAGCTAGTAA\"\norfs = findorfs(seq)\nsource(orfs[1])\n\n44nt DNA Sequence:\nATGATGCATGCATGCATGCTAGTAACTAGCTAGCTAGCTAGTAA\n\nwarning: Warning\nThe source method works if the sequence is defined in the global scope. Otherwise it will throw an error. For instance a common failure is to define a simple ORF that by defualt will have an \"unnamedsource\" as groupname and then try to get the source sequence. orf = ORF{NaiveFinder}(1:33, '+', 1)\nsource(orf)\n\nERROR: UndefVarError: `unnamedsource` not defined\nStacktrace:\n [1] source(i::ORF{4, NaiveFinder})\n @ GeneFinder ~/.julia/dev/GeneFinder/src/types.jl:192\n [2] top-level scope\n @ REPL[12]:1\n\n\n\n\n\n","category":"method"},{"location":"api/#Finding-ORFs","page":"API","title":"Finding ORFs","text":"","category":"section"},{"location":"api/","page":"API","title":"API","text":"The function findorfs is the main function of the package. It is generic method that can handle different gene finding methods. ","category":"page"},{"location":"api/","page":"API","title":"API","text":"Modules = [GeneFinder]\nPages = [\"findorfs.jl\"]","category":"page"},{"location":"api/#GeneFinder.findorfs-Union{Tuple{Union{BioSequences.LongDNA{N}, BioSequences.LongSubSeq{BioSequences.DNAAlphabet{N}}}}, Tuple{F}, Tuple{N}} where {N, F<:GeneFinder.GeneFinderMethod}","page":"API","title":"GeneFinder.findorfs","text":"findorfs(sequence::NucleicSeqOrView{DNAAlphabet{N}}; ::M, kwargs...) where {N, M<:GeneFinderMethod}\n\nThis is the main interface method for finding open reading frames (ORFs) in a DNA sequence.\n\nIt takes the following required arguments:\n\nsequence: The nucleic acid sequence to search for ORFs.\nmethod: The algorithm used to find ORFs. It can be either NaiveFinder(), NaiveFinderScored() or yet other implementations.\n\nKeyword Arguments regardless of the finder method:\n\nalternative_start::Bool: A boolean indicating whether to consider alternative start codons. Default is false.\nminlen::Int: The minimum length of an ORF. Default is 6.\nscheme::Function: The scoring scheme to use for scoring the sequence from the ORF. Default is nothing.\n\nReturns\n\nA vector of ORF objects representing the found ORFs.\n\nExample\n\nsequence = randdnaseq(120)\n\n120nt DNA Sequence:\n GCCGGACAGCGAAGGCTAATAAATGCCCGTGCCAGTATC…TCTGAGTTACTGTACACCCGAAAGACGTTGTACGCATTT\n\nfindorfs(sequence, NaiveFinder())\n\n1-element Vector{ORF}:\n ORF{NaiveFinder}(77:118, '-', 2, 0.0)\n\n\n\n\n\n","category":"method"},{"location":"api/#Finding-ORFs-using-BioRegex-and-scoring","page":"API","title":"Finding ORFs using BioRegex and scoring","text":"","category":"section"},{"location":"api/","page":"API","title":"API","text":"Modules = [GeneFinder]\nPages = [\"algorithms/naivefinder.jl\"]","category":"page"},{"location":"api/#GeneFinder.NaiveFinder-Union{Tuple{Union{BioSequences.LongDNA{N}, BioSequences.LongSubSeq{BioSequences.DNAAlphabet{N}}}}, Tuple{N}} where N","page":"API","title":"GeneFinder.NaiveFinder","text":"NaiveFinder(sequence::NucleicSeqOrView{DNAAlphabet{N}}; kwargs...) -> Vector{ORF} where {N}\n\nA simple implementation that finds ORFs in a DNA sequence.\n\nThe NaiveFinder method takes a LongSequence{DNAAlphabet{4}} sequence and returns a Vector{ORF} containing the ORFs found in the sequence. It searches entire regularly expressed CDS, adding each ORF it finds to the vector. The function also searches the reverse complement of the sequence, so it finds ORFs on both strands. Extending the starting codons with the alternative_start = true will search for ATG, GTG, and TTG. Some studies have shown that in E. coli (K-12 strain), ATG, GTG and TTG are used 83 %, 14 % and 3 % respectively.\n\nnote: Note\nThis function has neither ORFs scoring scheme by default nor length constraints. Thus it might consider aa\"M*\" a posible encoding protein from the resulting ORFs.\n\nRequired Arguments\n\nsequence::NucleicSeqOrView{DNAAlphabet{N}}: The nucleic acid sequence to search for ORFs.\n\nKeywords Arguments\n\nalternative_start::Bool: If true will pass the extended start codons to search. This will increase 3x the execution time. Default is false.\nminlen::Int64=6: Length of the allowed ORF. Default value allow aa\"M*\" a posible encoding protein from the resulting ORFs.\nscheme::Function: The scoring scheme to use for scoring the sequence from the ORF. Default is nothing.\n\nnote: Note\nAs the scheme is generally a scoring function that at least requires a sequence, one simple scheme is the log-odds ratio score. This score is a log-odds ratio that compares the probability of the sequence generated by a coding model to the probability of the sequence generated by a non-coding model:S(x) = sum_i=1^L beta_x_ix = sum_i=1 log fraca^mathscrm_1_i-1 x_ia^mathscrm_2_i-1 x_iIf the log-odds ratio exceeds a given threshold (η), the sequence is considered likely to be coding. See lordr for more information about coding creteria.\n\n\n\n\n\n","category":"method"},{"location":"api/#GeneFinder._locationiterator-Union{Tuple{Union{BioSequences.LongDNA{N}, BioSequences.LongSubSeq{BioSequences.DNAAlphabet{N}}}}, Tuple{N}} where N","page":"API","title":"GeneFinder._locationiterator","text":"_locationiterator(sequence::NucleicSeqOrView{DNAAlphabet{N}}; alternative_start::Bool=false) where {N}\n\nThis is an iterator function that uses regular expressions to search the entire ORF (instead of start and stop codons) in a LongSequence{DNAAlphabet{4}} sequence. It uses an anonymous function that will find the first regularly expressed ORF. Then using this anonymous function it creates an iterator that will apply it until there is no other CDS.\n\nnote: Note\nAs a note of the implementation we want to expand on how the ORFs are found:The expression (?:[N]{3})*? serves as the boundary between the start and stop codons. Within this expression, the character class [N]{3} captures exactly three occurrences of any character (representing nucleotides using IUPAC codes). This portion functions as the regular codon matches. Since it is enclosed within a non-capturing group (?:) and followed by *?, it allows for the matching of intermediate codons, but with a preference for the smallest number of repetitions. In summary, the regular expression ATG(?:[N]{3})*?T(AG|AA|GA) identifies patterns that start with \"ATG,\" followed by any number of three-character codons (represented by \"N\" in the IUPAC code), and ends with a stop codon \"TAG,\" \"TAA,\" or \"TGA.\" This pattern is commonly used to identify potential protein-coding regions within genetic sequences.See more about the discussion here\n\n\n\n\n\n","category":"method"},{"location":"api/#Writing-ORFs-to-files","page":"API","title":"Writing ORFs to files","text":"","category":"section"},{"location":"api/","page":"API","title":"API","text":"Modules = [GeneFinder]\nPages = [\"io.jl\"]","category":"page"},{"location":"api/#GeneFinder.write_orfs_bed-Union{Tuple{F}, Tuple{N}, Tuple{Union{BioSequences.LongDNA{N}, BioSequences.LongSubSeq{BioSequences.DNAAlphabet{N}}}, Union{IOStream, IOBuffer}}} where {N, F<:GeneFinder.GeneFinderMethod}","page":"API","title":"GeneFinder.write_orfs_bed","text":"write_orfs_bed(input::NucleicSeqOrView{DNAAlphabet{N}}, output::Union{IOStream, IOBuffer}, finder::F; kwargs...)\nwrite_orfs_bed(input::NucleicSeqOrView{DNAAlphabet{N}}, output::String, finder::F; kwargs...)\n\nWrite BED data to a file.\n\nArguments\n\ninput: The input DNA sequence NucSeq or a view.\noutput: The otput format, it can be a file (String) or a buffer (IOStream or `IOBuffer)\nfinder: The algorithm used to find ORFs. It can be either NaiveFinder() or NaiveFinderScored().\n\nKeywords\n\nalternative_start::Bool=false: If true, alternative start codons will be used when identifying CDSs. Default is false.\nminlen::Int64=6: The minimum length that a CDS must have in order to be included in the output file. Default is 6.\n\n\n\n\n\n","category":"method"},{"location":"api/#GeneFinder.write_orfs_faa-Union{Tuple{F}, Tuple{N}, Tuple{Union{BioSequences.LongDNA{N}, BioSequences.LongSubSeq{BioSequences.DNAAlphabet{N}}}, Union{IOStream, IOBuffer}}} where {N, F<:GeneFinder.GeneFinderMethod}","page":"API","title":"GeneFinder.write_orfs_faa","text":"write_orfs_faa(input::NucleicSeqOrView{DNAAlphabet{4}}, output::Union{IOStream, IOBuffer}, finder::F; kwargs...)\nwrite_orfs_faa(input::NucleicSeqOrView{DNAAlphabet{4}}, output::String, finder::F; kwargs...)\n\nWrite the protein sequences encoded by the coding sequences (CDSs) of a given DNA sequence to the specified file.\n\nArguments\n\ninput: The input DNA sequence NucSeq or a view.\noutput: The otput format, it can be a file (String) or a buffer (IOStream or `IOBuffer)\nfinder: The algorithm used to find ORFs. It can be either NaiveFinder() or NaiveFinderScored().\n\nKeywords\n\ncode::GeneticCode=BioSequences.standard_genetic_code: The genetic code by which codons will be translated. See BioSequences.ncbi_trans_table for more info. \nalternative_start::Bool=false: If true will pass the extended start codons to search. This will increase 3x the exec. time.\nminlen::Int64=6: Length of the allowed ORF. Default value allow aa\"M*\" a posible encoding protein from the resulting ORFs.\n\nExamples\n\nfilename = \"output.faa\"\n\nseq = dna\"ATGATGCATGCATGCATGCTAGTAACTAGCTAGCTAGCTAGTAA\"\n\nopen(filename, \"w\") do file\n write_orfs_faa(seq, file)\nend\n\n\n\n\n\n","category":"method"},{"location":"api/#GeneFinder.write_orfs_fna-Union{Tuple{F}, Tuple{N}, Tuple{Union{BioSequences.LongDNA{N}, BioSequences.LongSubSeq{BioSequences.DNAAlphabet{N}}}, Union{IOStream, IOBuffer}}} where {N, F<:GeneFinder.GeneFinderMethod}","page":"API","title":"GeneFinder.write_orfs_fna","text":"write_orfs_fna(input::NucleicSeqOrView{DNAAlphabet{N}}, output::Union{IOStream, IOBuffer}, finder::F; kwargs...)\nwrite_orfs_fna(input::NucleicSeqOrView{DNAAlphabet{N}}, output::String, finder::F; kwargs...)\n\nWrite a file containing the coding sequences (CDSs) of a given DNA sequence to the specified file.\n\nArguments\n\ninput::NucleicAcidAlphabet{DNAAlphabet{N}}: The input DNA sequence.\noutput::IO: The otput format, it can be a file (String) or a buffer (IOStream or `IOBuffer)\nfinder::F: The algorithm used to find ORFs. It can be either NaiveFinder() or NaiveFinderScored().\n\nKeywords\n\nalternative_start::Bool=false: If true, alternative start codons will be used when identifying CDSs. Default is false.\nminlen::Int64=6: The minimum length that a CDS must have in order to be included in the output file. Default is 6.\n\nExamples\n\nfilename = \"output.fna\"\n\nseq = dna\"ATGATGCATGCATGCATGCTAGTAACTAGCTAGCTAGCTAGTAA\"\n\nopen(filename, \"w\") do file\n write_orfs_fna(seq, file, NaiveFinder())\nend\n\n\n\n\n\n","category":"method"},{"location":"api/#GeneFinder.write_orfs_gff-Union{Tuple{F}, Tuple{N}, Tuple{Union{BioSequences.LongDNA{N}, BioSequences.LongSubSeq{BioSequences.DNAAlphabet{N}}}, Union{IOStream, IOBuffer}}} where {N, F<:GeneFinder.GeneFinderMethod}","page":"API","title":"GeneFinder.write_orfs_gff","text":"write_orfs_gff(input::NucleicSeqOrView{DNAAlphabet{N}}, output::Union{IOStream, IOBuffer}, finder::F; kwargs...)\nwrite_orfs_gff(input::NucleicSeqOrView{DNAAlphabet{N}}, output::String, finder::F; kwargs...)\n\nWrite GFF data to a file.\n\nArguments\n\ninput: The input DNA sequence NucSeq or a view.\noutput: The otput format, it can be a file (String) or a buffer (IOStream or `IOBuffer)\nfinder: The algorithm used to find ORFs. It can be either NaiveFinder() or NaiveFinderScored().\n\nKeywords\n\ncode::GeneticCode=BioSequences.standard_genetic_code: The genetic code by which codons will be translated. See BioSequences.ncbi_trans_table for more info. \nalternative_start::Bool=false: If true will pass the extended start codons to search. This will increase 3x the exec. time.\nminlen::Int64=6: Length of the allowed ORF. Default value allow aa\"M*\" a posible encoding protein from the resulting ORFs.\n\n\n\n\n\n","category":"method"},{"location":"simplecodingrule/#Scoring-a-sequence-using-a-Markov-model","page":"A Simple Coding Rule","title":"Scoring a sequence using a Markov model","text":"","category":"section"},{"location":"simplecodingrule/","page":"A Simple Coding Rule","title":"A Simple Coding Rule","text":"A sequence of DNA could be scored using a Markov model of the transition probabilities of a known sequence. This could be done using a log-odds ratio score, which is the logarithm of the ratio of the transition probabilities of the sequence given a model and. The log-odds ratio score is defined as:","category":"page"},{"location":"simplecodingrule/","page":"A Simple Coding Rule","title":"A Simple Coding Rule","text":"beginalign\nS(x) = sum_i=1^L beta_x_ix = sum_i=1 log fraca^mathscrm_1_i-1 x_ia^mathscrm_2_i-1 x_i\nendalign","category":"page"},{"location":"simplecodingrule/","page":"A Simple Coding Rule","title":"A Simple Coding Rule","text":"Where the a^mathscrm_1_i-1 x_i is the transition probability of the first model (in this case the calculated for the given sequence) from the state x_i-1 to the state x_i and a^mathscrm_2_i-1 x_i is the transition probability of the second model from the state x_i-1 to the state x_i. The score is the sum of the log-odds ratio of the transition probabilities of the sequence given the two models.","category":"page"},{"location":"simplecodingrule/","page":"A Simple Coding Rule","title":"A Simple Coding Rule","text":"In the current implementation the second model is a CDS transition probability model of E. coli. This classification score is implemented in the naivescorefinder method. This method will return ORFs with the associated score of the sequence given the CDS model of E. coli.","category":"page"},{"location":"simplecodingrule/","page":"A Simple Coding Rule","title":"A Simple Coding Rule","text":"using GeneFinder, BioSequences\n\nseq = dna\"TTCGTCAGTCGTTCTGTTTCATTCAATACGATAGTAATGTATTTTTCGTGCATTTCCGGTGGAATCGTGCCGTCCAGCATAGCCTCCAGATATCCCCTTATAGAGGTCAGAGGGGAACGGAAATCGTGGGATACATTGGCTACAAACTTTTTCTGATCATCCTCGGAACGGGCAATTTCGCTTGCCATATAATTCAGACAGGAAGCCAGATAACCGATTTCATCCTCACTATCGACCTGAAATTCATAATGCATATTACCGGCAGCATACTGCTCTGTGGCATGAGTGATCTTCCTCAGAGGAATATATACGATCTCAGTGAAAAAGATCAGAATGATCAGGGATAGCAGGAACAGGATTGCCAGGGTGATATAGGAAATATTCAGCAGGTTGTTACAGGATTTCTGAATATCATTCATATCAGTATGGATGACTACATAGCCTTTTACCTTGTAGTTGGAGGTAATGGGAGCAAATACAGTAAGTACATCCGAATCAAAATTACCGAAGAAATCACCAACAATGTAATAGGAGCCGCTGGTTACGGTCGAATCAAAATTCTCAATGACAACCACATTCTCCACATCTAAGGGACTATTGGTATCCAGTACCAGTCGTCCGGAGGGATTGATGATGCGAATCTCGGAATTCAGGTAGACCGCCAGGGAGTCCAGCTGCATTTTAACGGTCTCCAAAGTTGTTTCACTGGTGTACAATCCGCCGGCATAGGTTCCGGCGATCAGGGTTGCTTCGGAATAGAGACTTTCTGCCTTTTCCCGGATCAGATGTTCTTTGGTCATATTGGGAACAAAAGTTGTAACAATGATGAAACCAAATACACCAAAAATAAAATATGCGAGTATAAATTTTAGATAAAGTGTTTTTTTCATAACAAATCCTGCTTTTGGTATGACTTAATTACGTACTTCGAATTTATAGCCGATGCCCCAGATGGTGCTGATCTTCCAGTTGGCATGATCCTTGATCTTCTC\"\n\nfindorfs(seq, minlen=75, finder=NaiveFinder)\n\n9-element Vector{ORF{4, NaiveFinder}}:\n ORF{NaiveFinder}(37:156, '+', 1,)\n ORF{NaiveFinder}(194:268, '-', 2)\n ORF{NaiveFinder}(194:283, '-', 2)\n ORF{NaiveFinder}(249:347, '+', 3)\n ORF{NaiveFinder}(426:590, '+', 3)\n ORF{NaiveFinder}(565:657, '+', 1)\n ORF{NaiveFinder}(650:727, '-', 2)\n ORF{NaiveFinder}(786:872, '+', 3)\n ORF{NaiveFinder}(887:976, '-', 2)","category":"page"},{"location":"simplecodingrule/#The-*log-odds-ratio*-decision-rule","page":"A Simple Coding Rule","title":"The log-odds ratio decision rule","text":"","category":"section"},{"location":"simplecodingrule/","page":"A Simple Coding Rule","title":"A Simple Coding Rule","text":"The sequence probability given a transition probability model could be used as the source of a sequence classification based on a decision rule to classify whether a sequence correspond to a model or another. Now, imagine we got two DNA sequence transition models, a CDS model and a No-CDS model. The log-odds ratio decision rule could be establish as:","category":"page"},{"location":"simplecodingrule/","page":"A Simple Coding Rule","title":"A Simple Coding Rule","text":"beginalign\nS(X) = log fracP_C(X_1=i_1 ldots X_T=i_T)P_N(X_1=i_1 ldots X_T=i_T) begincases eta Rightarrow textcoding eta Rightarrow textnoncoding endcases\nendalign","category":"page"},{"location":"simplecodingrule/","page":"A Simple Coding Rule","title":"A Simple Coding Rule","text":"Where the P_C is the probability of the sequence given a CDS model, P_N is the probability of the sequence given a No-CDS model, the decision rule is finally based on whether the ratio is greater or lesser than a given threshold η of significance level.","category":"page"},{"location":"simplecodingrule/","page":"A Simple Coding Rule","title":"A Simple Coding Rule","text":"In this package we have implemented this rule and call some basic models of CDS and No-CDS of E. coli from Axelson-Fisk (2015) work (implemented in BioMarkovChains.jl package). To check whether a random sequence could be coding based on these decision we use the predicate log_odds_ratio_decision_rule with the ECOLICDS and ECOLINOCDS models:","category":"page"},{"location":"simplecodingrule/","page":"A Simple Coding Rule","title":"A Simple Coding Rule","text":"orfsdna = findorfs(seq, minlen=75, alternative_start=true) .|> sequence\n\n20-element Vector{NucSeq{4, DNAAlphabet{4}}}\n ATGTATTTTTCGTGCATTTCCGGTGGAATCGTGCCGTCC…CGGAAATCGTGGGATACATTGGCTACAAACTTTTTCTGA\n GTGCATTTCCGGTGGAATCGTGCCGTCCAGCATAGCCTC…TACGATCTCAGTGAAAAAGATCAGAATGATCAGGGATAG\n GTGCCGTCCAGCATAGCCTCCAGATATCCCCTTATAGAG…CGGAAATCGTGGGATACATTGGCTACAAACTTTTTCTGA\n GTGGGATACATTGGCTACAAACTTTTTCTGATCATCCTC…TACGATCTCAGTGAAAAAGATCAGAATGATCAGGGATAG\n TTGCCATATAATTCAGACAGGAAGCCAGATAACCGATTT…GCATATTACCGGCAGCATACTGCTCTGTGGCATGAGTGA\n ATGCTGCCGGTAATATGCATTATGAATTTCAGGTCGATAGTGAGGATGAAATCGGTTATCTGGCTTCCTGTCTGA\n ATGCCACAGAGCAGTATGCTGCCGGTAATATGCATTATG…ATAGTGAGGATGAAATCGGTTATCTGGCTTCCTGTCTGA\n ATGCATATTACCGGCAGCATACTGCTCTGTGGCATGAGT…TACGATCTCAGTGAAAAAGATCAGAATGATCAGGGATAG\n GTGATCTTCCTCAGAGGAATATATACGATCTCAGTGAAA…ATCAGGGATAGCAGGAACAGGATTGCCAGGGTGATATAG\n ATGGATGACTACATAGCCTTTTACCTTGTAGTTGGAGGT…ATCAAAATTCTCAATGACAACCACATTCTCCACATCTAA\n TTGGTGATTTCTTCGGTAATTTTGATTCGGATGTACTTACTGTATTTGCTCCCATTACCTCCAACTACAAGGTAA\n TTGTTGGTGATTTCTTCGGTAATTTTGATTCGGATGTACTTACTGTATTTGCTCCCATTACCTCCAACTACAAGGTAA\n ATGACAACCACATTCTCCACATCTAAGGGACTATTGGTA…CCGGAGGGATTGATGATGCGAATCTCGGAATTCAGGTAG\n ATGCCGGCGGATTGTACACCAGTGAAACAACTTTGGAGACCGTTAAAATGCAGCTGGACTCCCTGGCGGTCTACCTGA\n TTGTTTCACTGGTGTACAATCCGCCGGCATAGGTTCCGG…TCAGATGTTCTTTGGTCATATTGGGAACAAAAGTTGTAA\n TTGCTTCGGAATAGAGACTTTCTGCCTTTTCCCGGATCAGATGTTCTTTGGTCATATTGGGAACAAAAGTTGTAA\n ATGTTCTTTGGTCATATTGGGAACAAAAGTTGTAACAAT…AAATACACCAAAAATAAAATATGCGAGTATAAATTTTAG\n TTGGTCATATTGGGAACAAAAGTTGTAACAATGATGAAA…ACACCAAAAATAAAATATGCGAGTATAAATTTTAGATAA\n TTGGGAACAAAAGTTGTAACAATGATGAAACCAAATACACCAAAAATAAAATATGCGAGTATAAATTTTAGATAA\n ATGCCAACTGGAAGATCAGCACCATCTGGGGCATCGGCT…TACGTAATTAAGTCATACCAAAAGCAGGATTTGTTATGA\n","category":"page"},{"location":"simplecodingrule/","page":"A Simple Coding Rule","title":"A Simple Coding Rule","text":"Now the question is which of those sequences can we consider as coding sequences. We can use the iscoding predicate to check whether a sequence is coding or not based on the log-odds ratio decision rule:","category":"page"},{"location":"simplecodingrule/","page":"A Simple Coding Rule","title":"A Simple Coding Rule","text":"iscoding.(orfsdna) # criteria = log_odds_ratio_decision_rule","category":"page"},{"location":"simplecodingrule/","page":"A Simple Coding Rule","title":"A Simple Coding Rule","text":"20-element BitVector:\n 0\n 0\n 0\n 0\n 0\n 1\n 1\n 0\n 0\n 0\n 0\n 0\n 0\n 1\n 0\n 0\n 0\n 0\n 0\n 0","category":"page"},{"location":"simplecodingrule/","page":"A Simple Coding Rule","title":"A Simple Coding Rule","text":"In this case, the sequence has 20 ORFs and only 3 of them are classified as coding sequences. The classification is based on the log-odds ratio decision rule and the transition probability models of E. coli CDS and No-CDS. The log_odds_ratio_decision_rule method will return a boolean vector with the classification of each ORF in the sequence. Now we can simply filter the ORFs that are coding sequences:","category":"page"},{"location":"simplecodingrule/","page":"A Simple Coding Rule","title":"A Simple Coding Rule","text":"orfs = filter(orf -> iscoding(orf), orfsdna)\n\n3-element Vector{NucSeq{4, DNAAlphabet{4}}}\n ATGCTGCCGGTAATATGCATTATGAATTTCAGGTCGATAGTGAGGATGAAATCGGTTATCTGGCTTCCTGTCTGA\n ATGCCACAGAGCAGTATGCTGCCGGTAATATGCATTATG…ATAGTGAGGATGAAATCGGTTATCTGGCTTCCTGTCTGA\n ATGCCGGCGGATTGTACACCAGTGAAACAACTTTGGAGACCGTTAAAATGCAGCTGGACTCCCTGGCGGTCTACCTGA","category":"page"},{"location":"simplecodingrule/","page":"A Simple Coding Rule","title":"A Simple Coding Rule","text":"Or in terms of the ORF object:","category":"page"},{"location":"simplecodingrule/","page":"A Simple Coding Rule","title":"A Simple Coding Rule","text":"orfs = findorfs(seq, minlen=75, finder=NaiveFinder, alternative_start=true) # find ORFs with alternative start as well\norfs[iscoding.(orfsdna)]\n\n3-element Vector{ORF{4, NaiveFinder}}:\n ORF{NaiveFinder}(194:268, '-', 2, -0.026759927376272922)\n ORF{NaiveFinder}(194:283, '-', 2, -0.010354615336667268)\n ORF{NaiveFinder}(650:727, '-', 2, -0.04303976584597201)","category":"page"},{"location":"simplecodingrule/","page":"A Simple Coding Rule","title":"A Simple Coding Rule","text":"Or in a single line using another genome sequence:","category":"page"},{"location":"simplecodingrule/","page":"A Simple Coding Rule","title":"A Simple Coding Rule","text":"phi = dna\"GTGTGAGGTTATAACGCCGAAGCGGTAAAAATTTTAATTTTTGCCGCTGAGGGGTTGACCAAGCGAAGCGCGGTAGGTTTTCTGCTTAGGAGTTTAATCATGTTTCAGACTTTTATTTCTCGCCATAATTCAAACTTTTTTTCTGATAAGCTGGTTCTCACTTCTGTTACTCCAGCTTCTTCGGCACCTGTTTTACAGACACCTAAAGCTACATCGTCAACGTTATATTTTGATAGTTTGACGGTTAATGCTGGTAATGGTGGTTTTCTTCATTGCATTCAGATGGATACATCTGTCAACGCCGCTAATCAGGTTGTTTCTGTTGGTGCTGATATTGCTTTTGATGCCGACCCTAAATTTTTTGCCTGTTTGGTTCGCTTTGAGTCTTCTTCGGTTCCGACTACCCTCCCGACTGCCTATGATGTTTATCCTTTGAATGGTCGCCATGATGGTGGTTATTATACCGTCAAGGACTGTGTGACTATTGACGTCCTTCCCCGTACGCCGGGCAATAACGTTTATGTTGGTTTCATGGTTTGGTCTAACTTTACCGCTACTAAATGCCGCGGATTGGTTTCGCTGAATCAGGTTATTAAAGAGATTATTTGTCTCCAGCCACTTAAGTGAGGTGATTTATGTTTGGTGCTATTGCTGGCGGTATTGCTTCTGCTCTTGCTGGTGGCGCCATGTCTAAATTGTTTGGAGGCGGTCAAAAAGCCGCCTCCGGTGGCATTCAAGGTGATGTGCTTGCTACCGATAACAATACTGTAGGCATGGGTGATGCTGGTATTAAATCTGCCATTCAAGGCTCTAATGTTCCTAACCCTGATGAGGCCGCCCCTAGTTTTGTTTCTGGTGCTATGGCTAAAGCTGGTAAAGGACTTCTTGAAGGTACGTTGCAGGCTGGCACTTCTGCCGTTTCTGATAAGTTGCTTGATTTGGTTGGACTTGGTGGCAAGTCTGCCGCTGATAAAGGAAAGGATACTCGTGATTATCTTGCTGCTGCATTTCCTGAGCTTAATGCTTGGGAGCGTGCTGGTGCTGATGCTTCCTCTGCTGGTATGGTTGACGCCGGATTTGAGAATCAAAAAGAGCTTACTAAAATGCAACTGGACAATCAGAAAGAGATTGCCGAGATGCAAAATGAGACTCAAAAAGAGATTGCTGGCATTCAGTCGGCGACTTCACGCCAGAATACGAAAGACCAGGTATATGCACAAAATGAGATGCTTGCTTATCAACAGAAGGAGTCTACTGCTCGCGTTGCGTCTATTATGGAAAACACCAATCTTTCCAAGCAACAGCAGGTTTCCGAGATTATGCGCCAAATGCTTACTCAAGCTCAAACGGCTGGTCAGTATTTTACCAATGACCAAATCAAAGAAATGACTCGCAAGGTTAGTGCTGAGGTTGACTTAGTTCATCAGCAAACGCAGAATCAGCGGTATGGCTCTTCTCATATTGGCGCTACTGCAAAGGATATTTCTAATGTCGTCACTGATGCTGCTTCTGGTGTGGTTGATATTTTTCATGGTATTGATAAAGCTGTTGCCGATACTTGGAACAATTTCTGGAAAGACGGTAAAGCTGATGGTATTGGCTCTAATTTGTCTAGGAAATAACCGTCAGGATTGACACCCTCCCAATTGTATGTTTTCATGCCTCCAAATCTTGGAGGCTTTTTTATGGTTCGTTCTTATTACCCTTCTGAATGTCACGCTGATTATTTTGACTTTGAGCGTATCGAGGCTCTTAAACCTGCTATTGAGGCTTGTGGCATTTCTACTCTTTCTCAATCCCCAATGCTTGGCTTCCATAAGCAGATGGATAACCGCATCAAGCTCTTGGAAGAGATTCTGTCTTTTCGTATGCAGGGCGTTGAGTTCGATAATGGTGATATGTATGTTGACGGCCATAAGGCTGCTTCTGACGTTCGTGATGAGTTTGTATCTGTTACTGAGAAGTTAATGGATGAATTGGCACAATGCTACAATGTGCTCCCCCAACTTGATATTAATAACACTATAGACCACCGCCCCGAAGGGGACGAAAAATGGTTTTTAGAGAACGAGAAGACGGTTACGCAGTTTTGCCGCAAGCTGGCTGCTGAACGCCCTCTTAAGGATATTCGCGATGAGTATAATTACCCCAAAAAGAAAGGTATTAAGGATGAGTGTTCAAGATTGCTGGAGGCCTCCACTATGAAATCGCGTAGAGGCTTTGCTATTCAGCGTTTGATGAATGCAATGCGACAGGCTCATGCTGATGGTTGGTTTATCGTTTTTGACACTCTCACGTTGGCTGACGACCGATTAGAGGCGTTTTATGATAATCCCAATGCTTTGCGTGACTATTTTCGTGATATTGGTCGTATGGTTCTTGCTGCCGAGGGTCGCAAGGCTAATGATTCACACGCCGACTGCTATCAGTATTTTTGTGTGCCTGAGTATGGTACAGCTAATGGCCGTCTTCATTTCCATGCGGTGCACTTTATGCGGACACTTCCTACAGGTAGCGTTGACCCTAATTTTGGTCGTCGGGTACGCAATCGCCGCCAGTTAAATAGCTTGCAAAATACGTGGCCTTATGGTTACAGTATGCCCATCGCAGTTCGCTACACGCAGGACGCTTTTTCACGTTCTGGTTGGTTGTGGCCTGTTGATGCTAAAGGTGAGCCGCTTAAAGCTACCAGTTATATGGCTGTTGGTTTCTATGTGGCTAAATACGTTAACAAAAAGTCAGATATGGACCTTGCTGCTAAAGGTCTAGGAGCTAAAGAATGGAACAACTCACTAAAAACCAAGCTGTCGCTACTTCCCAAGAAGCTGTTCAGAATCAGAATGAGCCGCAACTTCGGGATGAAAATGCTCACAATGACAAATCTGTCCACGGAGTGCTTAATCCAACTTACCAAGCTGGGTTACGACGCGACGCCGTTCAACCAGATATTGAAGCAGAACGCAAAAAGAGAGATGAGATTGAGGCTGGGAAAAGTTACTGTAGCCGACGTTTTGGCGGCGCAACCTGTGACGACAAATCTGCTCAAATTTATGCGCGCTTCGATAAAAATGATTGGCGTATCCAACCTGCAGAGTTTTATCGCTTCCATGACGCAGAAGTTAACACTTTCGGATATTTCTGATGAGTCGAAAAATTATCTTGATAAAGCAGGAATTACTACTGCTTGTTTACGAATTAAATCGAAGTGGACTGCTGGCGGAAAATGAGAAAATTCGACCTATCCTTGCGCAGCTCGAGAAGCTCTTACTTTGCGACCTTTCGCCATCAACTAACGATTCTGTCAAAAACTGACGCGTTGGATGAGGAGAAGTGGCTTAATATGCTTGGCACGTTCGTCAAGGACTGGTTTAGATATGAGTCACATTTTGTTCATGGTAGAGATTCTCTTGTTGACATTTTAAAAGAGCGTGGATTACTATCTGAGTCCGATGCTGTTCAACCACTAATAGGTAAGAAATCATGAGTCAAGTTACTGAACAATCCGTACGTTTCCAGACCGCTTTGGCCTCTATTAAGCTCATTCAGGCTTCTGCCGTTTTGGATTTAACCGAAGATGATTTCGATTTTCTGACGAGTAACAAAGTTTGGATTGCTACTGACCGCTCTCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTACGCTGGACTTTGTGGGATACCCTCGCTTTCCTGCTCCTGTTGAGTTTATTGCTGCCGTCATTGCTTATTATGTTCATCCCGTCAACATTCAAACGGCCTGTCTCATCATGGAAGGCGCTGAATTTACGGAAAACATTATTAATGGCGTCGAGCGTCCGGTTAAAGCCGCTGAATTGTTCGCGTTTACCTTGCGTGTACGCGCAGGAAACACTGACGTTCTTACTGACGCAGAAGAAAACGTGCGTCAAAAATTACGTGCGGAAGGAGTGATGTAATGTCTAAAGGTAAAAAACGTTCTGGCGCTCGCCCTGGTCGTCCGCAGCCGTTGCGAGGTACTAAAGGCAAGCGTAAAGGCGCTCGTCTTTGGTATGTAGGTGGTCAACAATTTTAATTGCAGGGGCTTCGGCCCCTTACTTGAGGATAAATTATGTCTAATATTCAAACTGGCGCCGAGCGTATGCCGCATGACCTTTCCCATCTTGGCTTCCTTGCTGGTCAGATTGGTCGTCTTATTACCATTTCAACTACTCCGGTTATCGCTGGCGACTCCTTCGAGATGGACGCCGTTGGCGCTCTCCGTCTTTCTCCATTGCGTCGTGGCCTTGCTATTGACTCTACTGTAGACATTTTTACTTTTTATGTCCCTCATCGTCACGTTTATGGTGAACAGTGGATTAAGTTCATGAAGGATGGTGTTAATGCCACTCCTCTCCCGACTGTTAACACTACTGGTTATATTGACCATGCCGCTTTTCTTGGCACGATTAACCCTGATACCAATAAAATCCCTAAGCATTTGTTTCAGGGTTATTTGAATATCTATAACAACTATTTTAAAGCGCCGTGGATGCCTGACCGTACCGAGGCTAACCCTAATGAGCTTAATCAAGATGATGCTCGTTATGGTTTCCGTTGCTGCCATCTCAAAAACATTTGGACTGCTCCGCTTCCTCCTGAGACTGAGCTTTCTCGCCAAATGACGACTTCTACCACATCTATTGACATTATGGGTCTGCAAGCTGCTTATGCTAATTTGCATACTGACCAAGAACGTGATTACTTCATGCAGCGTTACCATGATGTTATTTCTTCATTTGGAGGTAAAACCTCTTATGACGCTGACAACCGTCCTTTACTTGTCATGCGCTCTAATCTCTGGGCATCTGGCTATGATGTTGATGGAACTGACCAAACGTCGTTAGGCCAGTTTTCTGGTCGTGTTCAACAGACCTATAAACATTCTGTGCCGCGTTTCTTTGTTCCTGAGCATGGCACTATGTTTACTCTTGCGCTTGTTCGTTTTCCGCCTACTGCGACTAAAGAGATTCAGTACCTTAACGCTAAAGGTGCTTTGACTTATACCGATATTGCTGGCGACCCTGTTTTGTATGGCAACTTGCCGCCGCGTGAAATTTCTATGAAGGATGTTTTCCGTTCTGGTGATTCGTCTAAGAAGTTTAAGATTGCTGAGGGTCAGTGGTATCGTTATGCGCCTTCGTATGTTTCTCCTGCTTATCACCTTCTTGAAGGCTTCCCATTCATTCAGGAACCGCCTTCTGGTGATTTGCAAGAACGCGTACTTATTCGCCACCATGATTATGACCAGTGTTTCCAGTCCGTTCAGTTGTTGCAGTGGAATAGTCAGGTTAAATTTAATGTGACCGTTTATCGCAATCTGCCGACCACTCGCGATTCAATCATGACTTCGTGATAAAAGATTGA\"\n\nfilter(x -> iscoding(sequence(x), η=1e-10) && length(x) > 100, findorfs(phi))\n\n34-element Vector{ORF{4, NaiveFinder}}:\n ORF{NaiveFinder}(636:1622, '+', 3)\n ORF{NaiveFinder}(687:1622, '+', 3)\n ORF{NaiveFinder}(774:1622, '+', 3)\n ORF{NaiveFinder}(781:1389, '+', 1)\n ORF{NaiveFinder}(814:1389, '+', 1)\n ORF{NaiveFinder}(829:1389, '+', 1)\n ORF{NaiveFinder}(861:1622, '+', 3)\n ORF{NaiveFinder}(1021:1389, '+', 1)\n ORF{NaiveFinder}(1386:1622, '+', 3)\n ORF{NaiveFinder}(1447:1635, '+', 1)\n ORF{NaiveFinder}(1489:1635, '+', 1)\n ORF{NaiveFinder}(1501:1635, '+', 1)\n ORF{NaiveFinder}(1531:1635, '+', 1)\n ORF{NaiveFinder}(2697:3227, '+', 3)\n ORF{NaiveFinder}(2745:3227, '+', 3)\n ⋮\n ORF{NaiveFinder}(2874:3227, '+', 3)\n ORF{NaiveFinder}(2973:3227, '+', 3)\n ORF{NaiveFinder}(3108:3227, '+', 3)\n ORF{NaiveFinder}(3142:3312, '+', 1)\n ORF{NaiveFinder}(3481:3939, '+', 1)\n ORF{NaiveFinder}(3659:3934, '+', 2)\n ORF{NaiveFinder}(3734:3934, '+', 2)\n ORF{NaiveFinder}(3772:3939, '+', 1)\n ORF{NaiveFinder}(3806:3934, '+', 2)\n ORF{NaiveFinder}(4129:4287, '+', 1)\n ORF{NaiveFinder}(4160:4291, '-', 2)\n ORF{NaiveFinder}(4540:4644, '+', 1)\n ORF{NaiveFinder}(4690:4866, '+', 1)\n ORF{NaiveFinder}(4741:4866, '+', 1)\n ORF{NaiveFinder}(4744:4866, '+', 1)","category":"page"},{"location":"naivefinder/#The-ORF-type","page":"Finding ORFs","title":"The ORF type","text":"","category":"section"},{"location":"naivefinder/","page":"Finding ORFs","title":"Finding ORFs","text":"For convenience, the ORF type is more stringent in preventing the creation of incompatible instances. As a result, attempting to create an instance with incompatible parameters will result in an error. For instance, the following code snippet will trigger an error:","category":"page"},{"location":"naivefinder/","page":"Finding ORFs","title":"Finding ORFs","text":"ORF{4,NaiveFinder}(1:10, '+', 4) # Or any F <: GeneFinderMethod\n\nERROR: AssertionError: Invalid frame value. Frame must be 1, 2, or 3.\nStacktrace:\n [1] ORF\n @ ~/.julia/dev/GeneFinder/src/types.jl:52 [inlined]\n [2] ORF{4, NaiveCollector}(range::UnitRange{Int64}, strand::Char, frame::Int64)\n @ GeneFinder ~/.julia/dev/GeneFinder/src/types.jl:79\n [3] top-level scope\n @ REPL[20]:1","category":"page"},{"location":"naivefinder/","page":"Finding ORFs","title":"Finding ORFs","text":"Similar behavior will be encountered when the strand is neither + nor -. This precautionary measure helps prevent the creation of invalid ORFs, ensuring greater stability and enabling the extension of its interface. For example, after creating a specific ORF, users can seamlessly iterate over a sequence of interest and verify whether the ORF is contained within the sequence.","category":"page"},{"location":"naivefinder/","page":"Finding ORFs","title":"Finding ORFs","text":"orf = ORF{4,NaiveFinder}(137:145, '+', 2)\nseq[orf]\n\n9nt DNA Sequence:\nATGCGCTGA","category":"page"},{"location":"naivefinder/","page":"Finding ORFs","title":"Finding ORFs","text":"warning: Warning\nIt is still possible to create an ORF and pass it to a sequence that does not necessarily contain an actual open reading frame. This will be addressed in future versions of the package. But the benefit of having it is that it will retrieve the corresponding subsequence of the sequence in a convinient way (5' to 3') regardless of the strand.","category":"page"},{"location":"naivefinder/#Finding-complete-and-overlapped-ORFs","page":"Finding ORFs","title":"Finding complete and overlapped ORFs","text":"","category":"section"},{"location":"naivefinder/","page":"Finding ORFs","title":"Finding ORFs","text":"The first implemented function is findorfs a very non-restrictive ORF finder function that will catch all ORFs in a dedicated structure. Note that this will catch random ORFs not necesarily genes since it has no ORFs size or overlapping condition contraints. Thus it might consider aa\"M*\" a posible encoding protein from the resulting ORFs.","category":"page"},{"location":"naivefinder/","page":"Finding ORFs","title":"Finding ORFs","text":"using BioSequences, GeneFinder\n\n# > 180195.SAMN03785337.LFLS01000089 -> finds only 1 gene in Prodigal (from Pyrodigal tests)\nseq = dna\"AACCAGGGCAATATCAGTACCGCGGGCAATGCAACCCTGACTGCCGGCGGTAACCTGAACAGCACTGGCAATCTGACTGTGGGCGGTGTTACCAACGGCACTGCTACTACTGGCAACATCGCACTGACCGGTAACAATGCGCTGAGCGGTCCGGTCAATCTGAATGCGTCGAATGGCACGGTGACCTTGAACACGACCGGCAATACCACGCTCGGTAACGTGACGGCACAAGGCAATGTGACGACCAATGTGTCCAACGGCAGTCTGACGGTTACCGGCAATACGACAGGTGCCAACACCAACCTCAGTGCCAGCGGCAACCTGACCGTGGGTAACCAGGGCAATATCAGTACCGCAGGCAATGCAACCCTGACGGCCGGCGACAACCTGACGAGCACTGGCAATCTGACTGTGGGCGGCGTCACCAACGGCACGGCCACCACCGGCAACATCGCGCTGACCGGTAACAATGCACTGGCTGGTCCTGTCAATCTGAACGCGCCGAACGGCACCGTGACCCTGAACACAACCGGCAATACCACGCTGGGTAATGTCACCGCACAAGGCAATGTGACGACTAATGTGTCCAACGGCAGCCTGACAGTCGCTGGCAATACCACAGGTGCCAACACCAACCTGAGTGCCAGCGGCAATCTGACCGTGGGCAACCAGGGCAATATCAGTACCGCGGGCAATGCAACCCTGACTGCCGGCGGTAACCTGAGC\"","category":"page"},{"location":"naivefinder/","page":"Finding ORFs","title":"Finding ORFs","text":"Now lest us find the ORFs","category":"page"},{"location":"naivefinder/","page":"Finding ORFs","title":"Finding ORFs","text":"orfs = findorfs(seq, finder=NaiveFinder)\n\n12-element Vector{ORF}:\n ORF{NaiveFinder}(29:40, '+', 2)\n ORF{NaiveFinder}(137:145, '+', 2)\n ORF{NaiveFinder}(164:184, '+', 2)\n ORF{NaiveFinder}(173:184, '+', 2)\n ORF{NaiveFinder}(236:241, '+', 2)\n ORF{NaiveFinder}(248:268, '+', 2)\n ORF{NaiveFinder}(362:373, '+', 2)\n ORF{NaiveFinder}(470:496, '+', 2)\n ORF{NaiveFinder}(551:574, '+', 2)\n ORF{NaiveFinder}(569:574, '+', 2)\n ORF{NaiveFinder}(581:601, '+', 2)\n ORF{NaiveFinder}(695:706, '+', 2)","category":"page"},{"location":"naivefinder/","page":"Finding ORFs","title":"Finding ORFs","text":"Two other methods where implemented into sequence to get the ORFs in DNA or aminoacid sequences, respectively. They use the findorfs function to first get the ORFs and then get the correspondance array of BioSequence objects.","category":"page"},{"location":"naivefinder/","page":"Finding ORFs","title":"Finding ORFs","text":"sequence.(orfs)\n\n12-element Vector{NucSeq{4, DNAAlphabet{4}}}\n ATGCAACCCTGA\n ATGCGCTGA\n ATGCGTCGAATGGCACGGTGA\n ATGGCACGGTGA\n ATGTGA\n ATGTGTCCAACGGCAGTCTGA\n ATGCAACCCTGA\n ATGCACTGGCTGGTCCTGTCAATCTGA\n ATGTCACCGCACAAGGCAATGTGA\n ATGTGA\n ATGTGTCCAACGGCAGCCTGA\n ATGCAACCCTGA","category":"page"},{"location":"naivefinder/","page":"Finding ORFs","title":"Finding ORFs","text":"transalate.(orfs)\n\n12-element Vector{LongSubSeq{AminoAcidAlphabet}}:\n MQP*\n MR*\n MRRMAR*\n MAR*\n M*\n MCPTAV*\n MQP*\n MHWLVLSI*\n MSPHKAM*\n M*\n MCPTAA*\n MQP*","category":"page"},{"location":"features/#The-ORF-features","page":"Scoring ORFs","title":"The ORF features","text":"","category":"section"},{"location":"features/","page":"Scoring ORFs","title":"Scoring ORFs","text":"The ORF type is designed to be flexible and can store various types of information about the ORF. This versatility allows it to hold data such as the score of the ORF based on a scoring function, the sequence of the ORF, or even the translated amino acid sequence. For example, in the NaiveFinder method, the score subfield is utilized to store the score of the ORF obtained from the scoring function. This capability is possible because the ORF type not only captures structural details of the ORF, such as the range, strand, and frame, but also provides a convenient field called Features for additional information.","category":"page"},{"location":"features/","page":"Scoring ORFs","title":"Scoring ORFs","text":"phi = dna\"GTGTGAGGTTATAACGCCGAAGCGGTAAAAATTTTAATTTTTGCCGCTGAGGGGTTGACCAAGCGAAGCGCGGTAGGTTTTCTGCTTAGGAGTTTAATCATGTTTCAGACTTTTATTTCTCGCCATAATTCAAACTTTTTTTCTGATAAGCTGGTTCTCACTTCTGTTACTCCAGCTTCTTCGGCACCTGTTTTACAGACACCTAAAGCTACATCGTCAACGTTATATTTTGATAGTTTGACGGTTAATGCTGGTAATGGTGGTTTTCTTCATTGCATTCAGATGGATACATCTGTCAACGCCGCTAATCAGGTTGTTTCTGTTGGTGCTGATATTGCTTTTGATGCCGACCCTAAATTTTTTGCCTGTTTGGTTCGCTTTGAGTCTTCTTCGGTTCCGACTACCCTCCCGACTGCCTATGATGTTTATCCTTTGAATGGTCGCCATGATGGTGGTTATTATACCGTCAAGGACTGTGTGACTATTGACGTCCTTCCCCGTACGCCGGGCAATAACGTTTATGTTGGTTTCATGGTTTGGTCTAACTTTACCGCTACTAAATGCCGCGGATTGGTTTCGCTGAATCAGGTTATTAAAGAGATTATTTGTCTCCAGCCACTTAAGTGAGGTGATTTATGTTTGGTGCTATTGCTGGCGGTATTGCTTCTGCTCTTGCTGGTGGCGCCATGTCTAAATTGTTTGGAGGCGGTCAAAAAGCCGCCTCCGGTGGCATTCAAGGTGATGTGCTTGCTACCGATAACAATACTGTAGGCATGGGTGATGCTGGTATTAAATCTGCCATTCAAGGCTCTAATGTTCCTAACCCTGATGAGGCCGCCCCTAGTTTTGTTTCTGGTGCTATGGCTAAAGCTGGTAAAGGACTTCTTGAAGGTACGTTGCAGGCTGGCACTTCTGCCGTTTCTGATAAGTTGCTTGATTTGGTTGGACTTGGTGGCAAGTCTGCCGCTGATAAAGGAAAGGATACTCGTGATTATCTTGCTGCTGCATTTCCTGAGCTTAATGCTTGGGAGCGTGCTGGTGCTGATGCTTCCTCTGCTGGTATGGTTGACGCCGGATTTGAGAATCAAAAAGAGCTTACTAAAATGCAACTGGACAATCAGAAAGAGATTGCCGAGATGCAAAATGAGACTCAAAAAGAGATTGCTGGCATTCAGTCGGCGACTTCACGCCAGAATACGAAAGACCAGGTATATGCACAAAATGAGATGCTTGCTTATCAACAGAAGGAGTCTACTGCTCGCGTTGCGTCTATTATGGAAAACACCAATCTTTCCAAGCAACAGCAGGTTTCCGAGATTATGCGCCAAATGCTTACTCAAGCTCAAACGGCTGGTCAGTATTTTACCAATGACCAAATCAAAGAAATGACTCGCAAGGTTAGTGCTGAGGTTGACTTAGTTCATCAGCAAACGCAGAATCAGCGGTATGGCTCTTCTCATATTGGCGCTACTGCAAAGGATATTTCTAATGTCGTCACTGATGCTGCTTCTGGTGTGGTTGATATTTTTCATGGTATTGATAAAGCTGTTGCCGATACTTGGAACAATTTCTGGAAAGACGGTAAAGCTGATGGTATTGGCTCTAATTTGTCTAGGAAATAACCGTCAGGATTGACACCCTCCCAATTGTATGTTTTCATGCCTCCAAATCTTGGAGGCTTTTTTATGGTTCGTTCTTATTACCCTTCTGAATGTCACGCTGATTATTTTGACTTTGAGCGTATCGAGGCTCTTAAACCTGCTATTGAGGCTTGTGGCATTTCTACTCTTTCTCAATCCCCAATGCTTGGCTTCCATAAGCAGATGGATAACCGCATCAAGCTCTTGGAAGAGATTCTGTCTTTTCGTATGCAGGGCGTTGAGTTCGATAATGGTGATATGTATGTTGACGGCCATAAGGCTGCTTCTGACGTTCGTGATGAGTTTGTATCTGTTACTGAGAAGTTAATGGATGAATTGGCACAATGCTACAATGTGCTCCCCCAACTTGATATTAATAACACTATAGACCACCGCCCCGAAGGGGACGAAAAATGGTTTTTAGAGAACGAGAAGACGGTTACGCAGTTTTGCCGCAAGCTGGCTGCTGAACGCCCTCTTAAGGATATTCGCGATGAGTATAATTACCCCAAAAAGAAAGGTATTAAGGATGAGTGTTCAAGATTGCTGGAGGCCTCCACTATGAAATCGCGTAGAGGCTTTGCTATTCAGCGTTTGATGAATGCAATGCGACAGGCTCATGCTGATGGTTGGTTTATCGTTTTTGACACTCTCACGTTGGCTGACGACCGATTAGAGGCGTTTTATGATAATCCCAATGCTTTGCGTGACTATTTTCGTGATATTGGTCGTATGGTTCTTGCTGCCGAGGGTCGCAAGGCTAATGATTCACACGCCGACTGCTATCAGTATTTTTGTGTGCCTGAGTATGGTACAGCTAATGGCCGTCTTCATTTCCATGCGGTGCACTTTATGCGGACACTTCCTACAGGTAGCGTTGACCCTAATTTTGGTCGTCGGGTACGCAATCGCCGCCAGTTAAATAGCTTGCAAAATACGTGGCCTTATGGTTACAGTATGCCCATCGCAGTTCGCTACACGCAGGACGCTTTTTCACGTTCTGGTTGGTTGTGGCCTGTTGATGCTAAAGGTGAGCCGCTTAAAGCTACCAGTTATATGGCTGTTGGTTTCTATGTGGCTAAATACGTTAACAAAAAGTCAGATATGGACCTTGCTGCTAAAGGTCTAGGAGCTAAAGAATGGAACAACTCACTAAAAACCAAGCTGTCGCTACTTCCCAAGAAGCTGTTCAGAATCAGAATGAGCCGCAACTTCGGGATGAAAATGCTCACAATGACAAATCTGTCCACGGAGTGCTTAATCCAACTTACCAAGCTGGGTTACGACGCGACGCCGTTCAACCAGATATTGAAGCAGAACGCAAAAAGAGAGATGAGATTGAGGCTGGGAAAAGTTACTGTAGCCGACGTTTTGGCGGCGCAACCTGTGACGACAAATCTGCTCAAATTTATGCGCGCTTCGATAAAAATGATTGGCGTATCCAACCTGCAGAGTTTTATCGCTTCCATGACGCAGAAGTTAACACTTTCGGATATTTCTGATGAGTCGAAAAATTATCTTGATAAAGCAGGAATTACTACTGCTTGTTTACGAATTAAATCGAAGTGGACTGCTGGCGGAAAATGAGAAAATTCGACCTATCCTTGCGCAGCTCGAGAAGCTCTTACTTTGCGACCTTTCGCCATCAACTAACGATTCTGTCAAAAACTGACGCGTTGGATGAGGAGAAGTGGCTTAATATGCTTGGCACGTTCGTCAAGGACTGGTTTAGATATGAGTCACATTTTGTTCATGGTAGAGATTCTCTTGTTGACATTTTAAAAGAGCGTGGATTACTATCTGAGTCCGATGCTGTTCAACCACTAATAGGTAAGAAATCATGAGTCAAGTTACTGAACAATCCGTACGTTTCCAGACCGCTTTGGCCTCTATTAAGCTCATTCAGGCTTCTGCCGTTTTGGATTTAACCGAAGATGATTTCGATTTTCTGACGAGTAACAAAGTTTGGATTGCTACTGACCGCTCTCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTACGCTGGACTTTGTGGGATACCCTCGCTTTCCTGCTCCTGTTGAGTTTATTGCTGCCGTCATTGCTTATTATGTTCATCCCGTCAACATTCAAACGGCCTGTCTCATCATGGAAGGCGCTGAATTTACGGAAAACATTATTAATGGCGTCGAGCGTCCGGTTAAAGCCGCTGAATTGTTCGCGTTTACCTTGCGTGTACGCGCAGGAAACACTGACGTTCTTACTGACGCAGAAGAAAACGTGCGTCAAAAATTACGTGCGGAAGGAGTGATGTAATGTCTAAAGGTAAAAAACGTTCTGGCGCTCGCCCTGGTCGTCCGCAGCCGTTGCGAGGTACTAAAGGCAAGCGTAAAGGCGCTCGTCTTTGGTATGTAGGTGGTCAACAATTTTAATTGCAGGGGCTTCGGCCCCTTACTTGAGGATAAATTATGTCTAATATTCAAACTGGCGCCGAGCGTATGCCGCATGACCTTTCCCATCTTGGCTTCCTTGCTGGTCAGATTGGTCGTCTTATTACCATTTCAACTACTCCGGTTATCGCTGGCGACTCCTTCGAGATGGACGCCGTTGGCGCTCTCCGTCTTTCTCCATTGCGTCGTGGCCTTGCTATTGACTCTACTGTAGACATTTTTACTTTTTATGTCCCTCATCGTCACGTTTATGGTGAACAGTGGATTAAGTTCATGAAGGATGGTGTTAATGCCACTCCTCTCCCGACTGTTAACACTACTGGTTATATTGACCATGCCGCTTTTCTTGGCACGATTAACCCTGATACCAATAAAATCCCTAAGCATTTGTTTCAGGGTTATTTGAATATCTATAACAACTATTTTAAAGCGCCGTGGATGCCTGACCGTACCGAGGCTAACCCTAATGAGCTTAATCAAGATGATGCTCGTTATGGTTTCCGTTGCTGCCATCTCAAAAACATTTGGACTGCTCCGCTTCCTCCTGAGACTGAGCTTTCTCGCCAAATGACGACTTCTACCACATCTATTGACATTATGGGTCTGCAAGCTGCTTATGCTAATTTGCATACTGACCAAGAACGTGATTACTTCATGCAGCGTTACCATGATGTTATTTCTTCATTTGGAGGTAAAACCTCTTATGACGCTGACAACCGTCCTTTACTTGTCATGCGCTCTAATCTCTGGGCATCTGGCTATGATGTTGATGGAACTGACCAAACGTCGTTAGGCCAGTTTTCTGGTCGTGTTCAACAGACCTATAAACATTCTGTGCCGCGTTTCTTTGTTCCTGAGCATGGCACTATGTTTACTCTTGCGCTTGTTCGTTTTCCGCCTACTGCGACTAAAGAGATTCAGTACCTTAACGCTAAAGGTGCTTTGACTTATACCGATATTGCTGGCGACCCTGTTTTGTATGGCAACTTGCCGCCGCGTGAAATTTCTATGAAGGATGTTTTCCGTTCTGGTGATTCGTCTAAGAAGTTTAAGATTGCTGAGGGTCAGTGGTATCGTTATGCGCCTTCGTATGTTTCTCCTGCTTATCACCTTCTTGAAGGCTTCCCATTCATTCAGGAACCGCCTTCTGGTGATTTGCAAGAACGCGTACTTATTCGCCACCATGATTATGACCAGTGTTTCCAGTCCGTTCAGTTGTTGCAGTGGAATAGTCAGGTTAAATTTAATGTGACCGTTTATCGCAATCTGCCGACCACTCGCGATTCAATCATGACTTCGTGATAAAAGATTGA\"\n\nphiorfs = findorfs(phi, finder=NaiveFinder, minlen=75, scheme=lors)\n\n124-element Vector{ORF{4, NaiveFinder}}:\n ORF{NaiveFinder}(9:101, '-', 3)\n ORF{NaiveFinder}(100:627, '+', 1)\n ORF{NaiveFinder}(223:447, '-', 1)\n ORF{NaiveFinder}(248:436, '+', 2)\n ORF{NaiveFinder}(257:436, '+', 2)\n ORF{NaiveFinder}(283:627, '+', 1)\n ORF{NaiveFinder}(344:436, '+', 2)\n ORF{NaiveFinder}(532:627, '+', 1)\n ORF{NaiveFinder}(636:1622, '+', 3)\n ORF{NaiveFinder}(687:1622, '+', 3)\n ORF{NaiveFinder}(774:1622, '+', 3)\n ORF{NaiveFinder}(781:1389, '+', 1)\n ORF{NaiveFinder}(814:1389, '+', 1)\n ORF{NaiveFinder}(829:1389, '+', 1)\n ORF{NaiveFinder}(861:1622, '+', 3)\n ⋮\n ORF{NaiveFinder}(4671:5375, '+', 3)\n ORF{NaiveFinder}(4690:4866, '+', 1)\n ORF{NaiveFinder}(4728:5375, '+', 3)\n ORF{NaiveFinder}(4741:4866, '+', 1)\n ORF{NaiveFinder}(4744:4866, '+', 1)\n ORF{NaiveFinder}(4777:4866, '+', 1)\n ORF{NaiveFinder}(4806:5375, '+', 3)\n ORF{NaiveFinder}(4863:5258, '-', 3)\n ORF{NaiveFinder}(4933:5019, '+', 1)\n ORF{NaiveFinder}(4941:5375, '+', 3)\n ORF{NaiveFinder}(5082:5375, '+', 3)\n ORF{NaiveFinder}(5089:5325, '+', 1)\n ORF{NaiveFinder}(5122:5202, '-', 1)\n ORF{NaiveFinder}(5152:5325, '+', 1)\n ORF{NaiveFinder}(5164:5325, '+', 1)","category":"page"},{"location":"features/","page":"Scoring ORFs","title":"Scoring ORFs","text":"In the example above we calculated a score using the lors scoring scheme (see lors from the BioMarkovChains.jl package). The score is stored in the score subfield of the ORF .","category":"page"},{"location":"features/","page":"Scoring ORFs","title":"Scoring ORFs","text":"All features can be accesed using a conviniente funciton called features that returns a NamedTuple with the features of the ORF and can be broadcasted to the entire collection of ORFs using the . syntax.","category":"page"},{"location":"features/","page":"Scoring ORFs","title":"Scoring ORFs","text":"features.(phiorfs)\n\n124-element Vector{@NamedTuple{score::Float64}}:\n (score = -3.002461366087374,)\n (score = -10.814621287968222,)\n (score = -5.344187934894264,)\n (score = -1.316724559874126,)\n (score = -1.796631200562138,)\n (score = -3.2651518608269856,)\n (score = -1.4019264441082822,)\n (score = -2.3192349590107475,)\n (score = 5.055524446434241,)\n (score = 2.7116397224896436,)\n (score = 2.2564640592402165,)\n (score = 1.777499581940097,)\n (score = 2.3474811908011186,)\n (score = 2.38568188352799,)\n (score = 2.498608044469827,)\n ⋮\n (score = -5.474837954151803,)\n (score = 0.6909362932156138,)\n (score = -5.900045211699447,)\n (score = 1.2010656615619415,)\n (score = 0.8541931309205604,)\n (score = 2.7897961643147777,)\n (score = -4.42890346770467,)\n (score = -5.40624241726446,)\n (score = -0.8080572222081075,)\n (score = -5.571494087742448,)\n (score = -4.882156920421228,)\n (score = -5.639670353834974,)\n (score = -0.8764121443326865,)\n (score = -4.308687693802273,)\n (score = -4.459423419810693,)","category":"page"},{"location":"features/#Analysing-Lamda-ORFs","page":"Scoring ORFs","title":"Analysing Lamda ORFs","text":"","category":"section"},{"location":"features/","page":"Scoring ORFs","title":"Scoring ORFs","text":"In this case the lors calculates the log odds ratio of the ORF sequence given two Markov models (by default: ECOLICDS and ECOLINOCDS), one for the coding region and one for the non-coding region. The score is stored in the score field of the NamedTuple returned by the features function. By default the lors function return the base 2 logarithm of the odds ratio, so it is analogous to the bits of information that the ORF sequence is coding.","category":"page"},{"location":"features/","page":"Scoring ORFs","title":"Scoring ORFs","text":"Now we can even analyse how is the distribution of the ORFs' scores as a function of their lengths compared to random sequences.","category":"page"},{"location":"features/","page":"Scoring ORFs","title":"Scoring ORFs","text":"\nlambda = fasta_to_dna(\"test/data/NC_001416.1.fasta\")[1]\n\nlambaorfs = findorfs(lambda, finder=NaiveFinder, minlen=100, scheme=lors)\n\nlamdascores = score.(lambaorfs)\nlambdalengths = length.(lambaorfs)\n\n## get some random sequences of variable lengths\nvseqs = LongDNA[]\nfor i in 1:708\n push!(vseqs, randdnaseq(rand(100:1000)))\nend\n\n## get the lengths and scores of the random generated sequences\nrandlengths = length.(vseqs)\nrandscores = lors.(vseqs)\n\n## plot the scores as a function of the lengths\nusing CairoMakie\n\nf = Figure()\nax = Axis(f[1, 1], xlabel=\"Length\", ylabel=\"Log-odds ratio (Bits)\")\n\nscatter!(ax,\n randlengths,\n randscores,\n marker = :circle, \n markersize = 6, \n color = :black, \n label = \"Random sequences\"\n)\nscatter!(ax,\n lambdalengths, \n lambdascores, \n marker = :rect, \n markersize = 6, \n color = :blue, \n label = \"Lambda ORFs\"\n)\n\naxislegend(ax)\n\nf","category":"page"},{"location":"features/","page":"Scoring ORFs","title":"Scoring ORFs","text":"(Image: )","category":"page"},{"location":"roadmap/#Roadmap","page":"-","title":"Roadmap","text":"","category":"section"},{"location":"roadmap/#Coding-genes-(CDS-ORFs)","page":"-","title":"Coding genes (CDS - ORFs)","text":"","category":"section"},{"location":"roadmap/","page":"-","title":"-","text":"☒ Finding ORFs\n☐ EasyGene\n☐ GLIMMER\n☐ Prodigal - Pyrodigal\n☐ PHANOTATE\n☐ k-mer based gene finders (?)\n☐ Augustus (?)","category":"page"},{"location":"roadmap/#Non-coding-genes-(RNA)","page":"-","title":"Non-coding genes (RNA)","text":"","category":"section"},{"location":"roadmap/","page":"-","title":"-","text":"☐ Infernal\n☐ tRNAscan","category":"page"},{"location":"roadmap/#Other-features","page":"-","title":"Other features","text":"","category":"section"},{"location":"roadmap/","page":"-","title":"-","text":"☐ parallelism SIMD ?\n☐ memory management (?)\n☐ incorporate Ribosime Binding Sites (RBS)\n☐ incorporate Programmed Reading Frame Shifting (PRFS)\n☐ specialized types\n☒ Gene\n☒ ORF\n☒ Codon\n☒ CDS\n☐ EukaryoticGene (?)\n☐ ProkaryoticGene (?)\n☐ Intron\n☐ Exon\n☐ GFF –\\> See other packages\n☐ FASTX –\\> See I/O in other packages","category":"page"},{"location":"roadmap/#Compatibilities","page":"-","title":"Compatibilities","text":"","category":"section"},{"location":"roadmap/","page":"-","title":"-","text":"Must interact with or extend:","category":"page"},{"location":"roadmap/","page":"-","title":"-","text":"GenomicAnnotations.jl\nBioSequences.jl\nSequenceVariation.jl\nGenomicFeatures.jl\nFASTX.jl\nKmers.jl\nGraphs.jl","category":"page"},{"location":"","page":"Home","title":"Home","text":"\n

\n
\n A Gene Finder framework for Julia.\n

","category":"page"},{"location":"","page":"Home","title":"Home","text":"\n
\n\n\n \"Documentation\"\n\n\n \"Release\"\n\n\n \"DOI\"\n\n\n
\n \"GitHub\n
\n\n \"License\"\n\n\n \"Repo\n\n\n \"Downloads\"\n\n\n \"Aqua\n\n\n
\n","category":"page"},{"location":"","page":"Home","title":"Home","text":"","category":"page"},{"location":"#Overview","page":"Home","title":"Overview","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"This is a species-agnostic and algorithm extensible gene finder library for the Julia Language.","category":"page"},{"location":"#Installation","page":"Home","title":"Installation","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"You can install GeneFinder from the julia REPL. Press ] to enter pkg mode, and enter the following:","category":"page"},{"location":"","page":"Home","title":"Home","text":"add GeneFinder\n","category":"page"},{"location":"#Citing","page":"Home","title":"Citing","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"@misc{GeneFinder.jl,\n\tauthor = {Camilo García},\n\ttitle = {GeneFinder.jl},\n\turl = {https://github.com/camilogarciabotero/GeneFinder.jl},\n\tversion = {v0.3.0},\n\tyear = {2024},\n\tmonth = {04}\n}","category":"page"}] } diff --git a/dev/simplecodingrule/index.html b/dev/simplecodingrule/index.html index 188d516..500c62e 100644 --- a/dev/simplecodingrule/index.html +++ b/dev/simplecodingrule/index.html @@ -1,5 +1,5 @@ -A Simple Coding Rule · GeneFinder.jl

Scoring a sequence using a Markov model

A sequence of DNA could be scored using a Markov model of the transition probabilities of a known sequence. This could be done using a log-odds ratio score, which is the logarithm of the ratio of the transition probabilities of the sequence given a model and. The log-odds ratio score is defined as:

\[\begin{align} +A Simple Coding Rule · GeneFinder.jl

Scoring a sequence using a Markov model

A sequence of DNA could be scored using a Markov model of the transition probabilities of a known sequence. This could be done using a log-odds ratio score, which is the logarithm of the ratio of the transition probabilities of the sequence given a model and. The log-odds ratio score is defined as:

\[\begin{align} S(x) = \sum_{i=1}^{L} \beta_{x_{i}x} = \sum_{i=1} \log \frac{a^{\mathscr{m}_{1}}_{i-1} x_i}{a^{\mathscr{m}_{2}}_{i-1} x_i} \end{align}\]

Where the $a^{\mathscr{m}_{1}}_{i-1} x_i$ is the transition probability of the first model (in this case the calculated for the given sequence) from the state $x_{i-1}$ to the state $x_i$ and $a^{\mathscr{m}_{2}}_{i-1} x_i$ is the transition probability of the second model from the state $x_{i-1}$ to the state $x_i$. The score is the sum of the log-odds ratio of the transition probabilities of the sequence given the two models.

In the current implementation the second model is a CDS transition probability model of E. coli. This classification score is implemented in the naivescorefinder method. This method will return ORFs with the associated score of the sequence given the CDS model of E. coli.

using GeneFinder, BioSequences
 
@@ -41,29 +41,7 @@
  TTGGTCATATTGGGAACAAAAGTTGTAACAATGATGAAA…ACACCAAAAATAAAATATGCGAGTATAAATTTTAGATAA
  TTGGGAACAAAAGTTGTAACAATGATGAAACCAAATACACCAAAAATAAAATATGCGAGTATAAATTTTAGATAA
  ATGCCAACTGGAAGATCAGCACCATCTGGGGCATCGGCT…TACGTAATTAAGTCATACCAAAAGCAGGATTTGTTATGA
-

Now, we can score the sequences using the log-odds ratio score in the same line:

orfsfeat = findorfs(seq, minlen=75, alternative_start=true, scheme=lors) .|> features
-
-20-element Vector{@NamedTuple{score::Float64}}:
- (score = -2.5146325834372343,)
- (score = -4.857592765476053,)
- (score = -1.9986133020444345,)
- (score = -3.4106894574555824,)
- (score = -1.763485388728319,)
- (score = 0.6825864481251348,)
- (score = 0.21287161698917936,)
- (score = -0.28187825646085224,)
- (score = -1.373474082107631,)
- (score = -4.273794970087796,)
- (score = -2.3961559066784597,)
- (score = -2.3663038090046142,)
- (score = -0.8406863072332524,)
- (score = 1.8013554455006733,)
- (score = -2.0768031699080756,)
- (score = -1.734088708668584,)
- (score = -2.9820908143871194,)
- (score = -3.072550585883162,)
- (score = -2.712493281013948,)
- (score = -2.0453354284951786,)

Now the question is which of those sequences can we consider as coding sequences. We can use the iscoding predicate to check whether a sequence is coding or not based on the log-odds ratio decision rule:

iscoding.(orfsdna) # criteria = log_odds_ratio_decision_rule
20-element BitVector:
+

Now the question is which of those sequences can we consider as coding sequences. We can use the iscoding predicate to check whether a sequence is coding or not based on the log-odds ratio decision rule:

iscoding.(orfsdna) # criteria = log_odds_ratio_decision_rule
20-element BitVector:
  0
  0
  0
@@ -94,8 +72,7 @@
 3-element Vector{ORF{4, NaiveFinder}}:
  ORF{NaiveFinder}(194:268, '-', 2, -0.026759927376272922)
  ORF{NaiveFinder}(194:283, '-', 2, -0.010354615336667268)
- ORF{NaiveFinder}(650:727, '-', 2, -0.04303976584597201)

Or in a single line using another genome sequence:


-phi = dna"GTGTGAGGTTATAACGCCGAAGCGGTAAAAATTTTAATTTTTGCCGCTGAGGGGTTGACCAAGCGAAGCGCGGTAGGTTTTCTGCTTAGGAGTTTAATCATGTTTCAGACTTTTATTTCTCGCCATAATTCAAACTTTTTTTCTGATAAGCTGGTTCTCACTTCTGTTACTCCAGCTTCTTCGGCACCTGTTTTACAGACACCTAAAGCTACATCGTCAACGTTATATTTTGATAGTTTGACGGTTAATGCTGGTAATGGTGGTTTTCTTCATTGCATTCAGATGGATACATCTGTCAACGCCGCTAATCAGGTTGTTTCTGTTGGTGCTGATATTGCTTTTGATGCCGACCCTAAATTTTTTGCCTGTTTGGTTCGCTTTGAGTCTTCTTCGGTTCCGACTACCCTCCCGACTGCCTATGATGTTTATCCTTTGAATGGTCGCCATGATGGTGGTTATTATACCGTCAAGGACTGTGTGACTATTGACGTCCTTCCCCGTACGCCGGGCAATAACGTTTATGTTGGTTTCATGGTTTGGTCTAACTTTACCGCTACTAAATGCCGCGGATTGGTTTCGCTGAATCAGGTTATTAAAGAGATTATTTGTCTCCAGCCACTTAAGTGAGGTGATTTATGTTTGGTGCTATTGCTGGCGGTATTGCTTCTGCTCTTGCTGGTGGCGCCATGTCTAAATTGTTTGGAGGCGGTCAAAAAGCCGCCTCCGGTGGCATTCAAGGTGATGTGCTTGCTACCGATAACAATACTGTAGGCATGGGTGATGCTGGTATTAAATCTGCCATTCAAGGCTCTAATGTTCCTAACCCTGATGAGGCCGCCCCTAGTTTTGTTTCTGGTGCTATGGCTAAAGCTGGTAAAGGACTTCTTGAAGGTACGTTGCAGGCTGGCACTTCTGCCGTTTCTGATAAGTTGCTTGATTTGGTTGGACTTGGTGGCAAGTCTGCCGCTGATAAAGGAAAGGATACTCGTGATTATCTTGCTGCTGCATTTCCTGAGCTTAATGCTTGGGAGCGTGCTGGTGCTGATGCTTCCTCTGCTGGTATGGTTGACGCCGGATTTGAGAATCAAAAAGAGCTTACTAAAATGCAACTGGACAATCAGAAAGAGATTGCCGAGATGCAAAATGAGACTCAAAAAGAGATTGCTGGCATTCAGTCGGCGACTTCACGCCAGAATACGAAAGACCAGGTATATGCACAAAATGAGATGCTTGCTTATCAACAGAAGGAGTCTACTGCTCGCGTTGCGTCTATTATGGAAAACACCAATCTTTCCAAGCAACAGCAGGTTTCCGAGATTATGCGCCAAATGCTTACTCAAGCTCAAACGGCTGGTCAGTATTTTACCAATGACCAAATCAAAGAAATGACTCGCAAGGTTAGTGCTGAGGTTGACTTAGTTCATCAGCAAACGCAGAATCAGCGGTATGGCTCTTCTCATATTGGCGCTACTGCAAAGGATATTTCTAATGTCGTCACTGATGCTGCTTCTGGTGTGGTTGATATTTTTCATGGTATTGATAAAGCTGTTGCCGATACTTGGAACAATTTCTGGAAAGACGGTAAAGCTGATGGTATTGGCTCTAATTTGTCTAGGAAATAACCGTCAGGATTGACACCCTCCCAATTGTATGTTTTCATGCCTCCAAATCTTGGAGGCTTTTTTATGGTTCGTTCTTATTACCCTTCTGAATGTCACGCTGATTATTTTGACTTTGAGCGTATCGAGGCTCTTAAACCTGCTATTGAGGCTTGTGGCATTTCTACTCTTTCTCAATCCCCAATGCTTGGCTTCCATAAGCAGATGGATAACCGCATCAAGCTCTTGGAAGAGATTCTGTCTTTTCGTATGCAGGGCGTTGAGTTCGATAATGGTGATATGTATGTTGACGGCCATAAGGCTGCTTCTGACGTTCGTGATGAGTTTGTATCTGTTACTGAGAAGTTAATGGATGAATTGGCACAATGCTACAATGTGCTCCCCCAACTTGATATTAATAACACTATAGACCACCGCCCCGAAGGGGACGAAAAATGGTTTTTAGAGAACGAGAAGACGGTTACGCAGTTTTGCCGCAAGCTGGCTGCTGAACGCCCTCTTAAGGATATTCGCGATGAGTATAATTACCCCAAAAAGAAAGGTATTAAGGATGAGTGTTCAAGATTGCTGGAGGCCTCCACTATGAAATCGCGTAGAGGCTTTGCTATTCAGCGTTTGATGAATGCAATGCGACAGGCTCATGCTGATGGTTGGTTTATCGTTTTTGACACTCTCACGTTGGCTGACGACCGATTAGAGGCGTTTTATGATAATCCCAATGCTTTGCGTGACTATTTTCGTGATATTGGTCGTATGGTTCTTGCTGCCGAGGGTCGCAAGGCTAATGATTCACACGCCGACTGCTATCAGTATTTTTGTGTGCCTGAGTATGGTACAGCTAATGGCCGTCTTCATTTCCATGCGGTGCACTTTATGCGGACACTTCCTACAGGTAGCGTTGACCCTAATTTTGGTCGTCGGGTACGCAATCGCCGCCAGTTAAATAGCTTGCAAAATACGTGGCCTTATGGTTACAGTATGCCCATCGCAGTTCGCTACACGCAGGACGCTTTTTCACGTTCTGGTTGGTTGTGGCCTGTTGATGCTAAAGGTGAGCCGCTTAAAGCTACCAGTTATATGGCTGTTGGTTTCTATGTGGCTAAATACGTTAACAAAAAGTCAGATATGGACCTTGCTGCTAAAGGTCTAGGAGCTAAAGAATGGAACAACTCACTAAAAACCAAGCTGTCGCTACTTCCCAAGAAGCTGTTCAGAATCAGAATGAGCCGCAACTTCGGGATGAAAATGCTCACAATGACAAATCTGTCCACGGAGTGCTTAATCCAACTTACCAAGCTGGGTTACGACGCGACGCCGTTCAACCAGATATTGAAGCAGAACGCAAAAAGAGAGATGAGATTGAGGCTGGGAAAAGTTACTGTAGCCGACGTTTTGGCGGCGCAACCTGTGACGACAAATCTGCTCAAATTTATGCGCGCTTCGATAAAAATGATTGGCGTATCCAACCTGCAGAGTTTTATCGCTTCCATGACGCAGAAGTTAACACTTTCGGATATTTCTGATGAGTCGAAAAATTATCTTGATAAAGCAGGAATTACTACTGCTTGTTTACGAATTAAATCGAAGTGGACTGCTGGCGGAAAATGAGAAAATTCGACCTATCCTTGCGCAGCTCGAGAAGCTCTTACTTTGCGACCTTTCGCCATCAACTAACGATTCTGTCAAAAACTGACGCGTTGGATGAGGAGAAGTGGCTTAATATGCTTGGCACGTTCGTCAAGGACTGGTTTAGATATGAGTCACATTTTGTTCATGGTAGAGATTCTCTTGTTGACATTTTAAAAGAGCGTGGATTACTATCTGAGTCCGATGCTGTTCAACCACTAATAGGTAAGAAATCATGAGTCAAGTTACTGAACAATCCGTACGTTTCCAGACCGCTTTGGCCTCTATTAAGCTCATTCAGGCTTCTGCCGTTTTGGATTTAACCGAAGATGATTTCGATTTTCTGACGAGTAACAAAGTTTGGATTGCTACTGACCGCTCTCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTACGCTGGACTTTGTGGGATACCCTCGCTTTCCTGCTCCTGTTGAGTTTATTGCTGCCGTCATTGCTTATTATGTTCATCCCGTCAACATTCAAACGGCCTGTCTCATCATGGAAGGCGCTGAATTTACGGAAAACATTATTAATGGCGTCGAGCGTCCGGTTAAAGCCGCTGAATTGTTCGCGTTTACCTTGCGTGTACGCGCAGGAAACACTGACGTTCTTACTGACGCAGAAGAAAACGTGCGTCAAAAATTACGTGCGGAAGGAGTGATGTAATGTCTAAAGGTAAAAAACGTTCTGGCGCTCGCCCTGGTCGTCCGCAGCCGTTGCGAGGTACTAAAGGCAAGCGTAAAGGCGCTCGTCTTTGGTATGTAGGTGGTCAACAATTTTAATTGCAGGGGCTTCGGCCCCTTACTTGAGGATAAATTATGTCTAATATTCAAACTGGCGCCGAGCGTATGCCGCATGACCTTTCCCATCTTGGCTTCCTTGCTGGTCAGATTGGTCGTCTTATTACCATTTCAACTACTCCGGTTATCGCTGGCGACTCCTTCGAGATGGACGCCGTTGGCGCTCTCCGTCTTTCTCCATTGCGTCGTGGCCTTGCTATTGACTCTACTGTAGACATTTTTACTTTTTATGTCCCTCATCGTCACGTTTATGGTGAACAGTGGATTAAGTTCATGAAGGATGGTGTTAATGCCACTCCTCTCCCGACTGTTAACACTACTGGTTATATTGACCATGCCGCTTTTCTTGGCACGATTAACCCTGATACCAATAAAATCCCTAAGCATTTGTTTCAGGGTTATTTGAATATCTATAACAACTATTTTAAAGCGCCGTGGATGCCTGACCGTACCGAGGCTAACCCTAATGAGCTTAATCAAGATGATGCTCGTTATGGTTTCCGTTGCTGCCATCTCAAAAACATTTGGACTGCTCCGCTTCCTCCTGAGACTGAGCTTTCTCGCCAAATGACGACTTCTACCACATCTATTGACATTATGGGTCTGCAAGCTGCTTATGCTAATTTGCATACTGACCAAGAACGTGATTACTTCATGCAGCGTTACCATGATGTTATTTCTTCATTTGGAGGTAAAACCTCTTATGACGCTGACAACCGTCCTTTACTTGTCATGCGCTCTAATCTCTGGGCATCTGGCTATGATGTTGATGGAACTGACCAAACGTCGTTAGGCCAGTTTTCTGGTCGTGTTCAACAGACCTATAAACATTCTGTGCCGCGTTTCTTTGTTCCTGAGCATGGCACTATGTTTACTCTTGCGCTTGTTCGTTTTCCGCCTACTGCGACTAAAGAGATTCAGTACCTTAACGCTAAAGGTGCTTTGACTTATACCGATATTGCTGGCGACCCTGTTTTGTATGGCAACTTGCCGCCGCGTGAAATTTCTATGAAGGATGTTTTCCGTTCTGGTGATTCGTCTAAGAAGTTTAAGATTGCTGAGGGTCAGTGGTATCGTTATGCGCCTTCGTATGTTTCTCCTGCTTATCACCTTCTTGAAGGCTTCCCATTCATTCAGGAACCGCCTTCTGGTGATTTGCAAGAACGCGTACTTATTCGCCACCATGATTATGACCAGTGTTTCCAGTCCGTTCAGTTGTTGCAGTGGAATAGTCAGGTTAAATTTAATGTGACCGTTTATCGCAATCTGCCGACCACTCGCGATTCAATCATGACTTCGTGATAAAAGATTGA"
+ ORF{NaiveFinder}(650:727, '-', 2, -0.04303976584597201)

Or in a single line using another genome sequence:

phi = dna"GTGTGAGGTTATAACGCCGAAGCGGTAAAAATTTTAATTTTTGCCGCTGAGGGGTTGACCAAGCGAAGCGCGGTAGGTTTTCTGCTTAGGAGTTTAATCATGTTTCAGACTTTTATTTCTCGCCATAATTCAAACTTTTTTTCTGATAAGCTGGTTCTCACTTCTGTTACTCCAGCTTCTTCGGCACCTGTTTTACAGACACCTAAAGCTACATCGTCAACGTTATATTTTGATAGTTTGACGGTTAATGCTGGTAATGGTGGTTTTCTTCATTGCATTCAGATGGATACATCTGTCAACGCCGCTAATCAGGTTGTTTCTGTTGGTGCTGATATTGCTTTTGATGCCGACCCTAAATTTTTTGCCTGTTTGGTTCGCTTTGAGTCTTCTTCGGTTCCGACTACCCTCCCGACTGCCTATGATGTTTATCCTTTGAATGGTCGCCATGATGGTGGTTATTATACCGTCAAGGACTGTGTGACTATTGACGTCCTTCCCCGTACGCCGGGCAATAACGTTTATGTTGGTTTCATGGTTTGGTCTAACTTTACCGCTACTAAATGCCGCGGATTGGTTTCGCTGAATCAGGTTATTAAAGAGATTATTTGTCTCCAGCCACTTAAGTGAGGTGATTTATGTTTGGTGCTATTGCTGGCGGTATTGCTTCTGCTCTTGCTGGTGGCGCCATGTCTAAATTGTTTGGAGGCGGTCAAAAAGCCGCCTCCGGTGGCATTCAAGGTGATGTGCTTGCTACCGATAACAATACTGTAGGCATGGGTGATGCTGGTATTAAATCTGCCATTCAAGGCTCTAATGTTCCTAACCCTGATGAGGCCGCCCCTAGTTTTGTTTCTGGTGCTATGGCTAAAGCTGGTAAAGGACTTCTTGAAGGTACGTTGCAGGCTGGCACTTCTGCCGTTTCTGATAAGTTGCTTGATTTGGTTGGACTTGGTGGCAAGTCTGCCGCTGATAAAGGAAAGGATACTCGTGATTATCTTGCTGCTGCATTTCCTGAGCTTAATGCTTGGGAGCGTGCTGGTGCTGATGCTTCCTCTGCTGGTATGGTTGACGCCGGATTTGAGAATCAAAAAGAGCTTACTAAAATGCAACTGGACAATCAGAAAGAGATTGCCGAGATGCAAAATGAGACTCAAAAAGAGATTGCTGGCATTCAGTCGGCGACTTCACGCCAGAATACGAAAGACCAGGTATATGCACAAAATGAGATGCTTGCTTATCAACAGAAGGAGTCTACTGCTCGCGTTGCGTCTATTATGGAAAACACCAATCTTTCCAAGCAACAGCAGGTTTCCGAGATTATGCGCCAAATGCTTACTCAAGCTCAAACGGCTGGTCAGTATTTTACCAATGACCAAATCAAAGAAATGACTCGCAAGGTTAGTGCTGAGGTTGACTTAGTTCATCAGCAAACGCAGAATCAGCGGTATGGCTCTTCTCATATTGGCGCTACTGCAAAGGATATTTCTAATGTCGTCACTGATGCTGCTTCTGGTGTGGTTGATATTTTTCATGGTATTGATAAAGCTGTTGCCGATACTTGGAACAATTTCTGGAAAGACGGTAAAGCTGATGGTATTGGCTCTAATTTGTCTAGGAAATAACCGTCAGGATTGACACCCTCCCAATTGTATGTTTTCATGCCTCCAAATCTTGGAGGCTTTTTTATGGTTCGTTCTTATTACCCTTCTGAATGTCACGCTGATTATTTTGACTTTGAGCGTATCGAGGCTCTTAAACCTGCTATTGAGGCTTGTGGCATTTCTACTCTTTCTCAATCCCCAATGCTTGGCTTCCATAAGCAGATGGATAACCGCATCAAGCTCTTGGAAGAGATTCTGTCTTTTCGTATGCAGGGCGTTGAGTTCGATAATGGTGATATGTATGTTGACGGCCATAAGGCTGCTTCTGACGTTCGTGATGAGTTTGTATCTGTTACTGAGAAGTTAATGGATGAATTGGCACAATGCTACAATGTGCTCCCCCAACTTGATATTAATAACACTATAGACCACCGCCCCGAAGGGGACGAAAAATGGTTTTTAGAGAACGAGAAGACGGTTACGCAGTTTTGCCGCAAGCTGGCTGCTGAACGCCCTCTTAAGGATATTCGCGATGAGTATAATTACCCCAAAAAGAAAGGTATTAAGGATGAGTGTTCAAGATTGCTGGAGGCCTCCACTATGAAATCGCGTAGAGGCTTTGCTATTCAGCGTTTGATGAATGCAATGCGACAGGCTCATGCTGATGGTTGGTTTATCGTTTTTGACACTCTCACGTTGGCTGACGACCGATTAGAGGCGTTTTATGATAATCCCAATGCTTTGCGTGACTATTTTCGTGATATTGGTCGTATGGTTCTTGCTGCCGAGGGTCGCAAGGCTAATGATTCACACGCCGACTGCTATCAGTATTTTTGTGTGCCTGAGTATGGTACAGCTAATGGCCGTCTTCATTTCCATGCGGTGCACTTTATGCGGACACTTCCTACAGGTAGCGTTGACCCTAATTTTGGTCGTCGGGTACGCAATCGCCGCCAGTTAAATAGCTTGCAAAATACGTGGCCTTATGGTTACAGTATGCCCATCGCAGTTCGCTACACGCAGGACGCTTTTTCACGTTCTGGTTGGTTGTGGCCTGTTGATGCTAAAGGTGAGCCGCTTAAAGCTACCAGTTATATGGCTGTTGGTTTCTATGTGGCTAAATACGTTAACAAAAAGTCAGATATGGACCTTGCTGCTAAAGGTCTAGGAGCTAAAGAATGGAACAACTCACTAAAAACCAAGCTGTCGCTACTTCCCAAGAAGCTGTTCAGAATCAGAATGAGCCGCAACTTCGGGATGAAAATGCTCACAATGACAAATCTGTCCACGGAGTGCTTAATCCAACTTACCAAGCTGGGTTACGACGCGACGCCGTTCAACCAGATATTGAAGCAGAACGCAAAAAGAGAGATGAGATTGAGGCTGGGAAAAGTTACTGTAGCCGACGTTTTGGCGGCGCAACCTGTGACGACAAATCTGCTCAAATTTATGCGCGCTTCGATAAAAATGATTGGCGTATCCAACCTGCAGAGTTTTATCGCTTCCATGACGCAGAAGTTAACACTTTCGGATATTTCTGATGAGTCGAAAAATTATCTTGATAAAGCAGGAATTACTACTGCTTGTTTACGAATTAAATCGAAGTGGACTGCTGGCGGAAAATGAGAAAATTCGACCTATCCTTGCGCAGCTCGAGAAGCTCTTACTTTGCGACCTTTCGCCATCAACTAACGATTCTGTCAAAAACTGACGCGTTGGATGAGGAGAAGTGGCTTAATATGCTTGGCACGTTCGTCAAGGACTGGTTTAGATATGAGTCACATTTTGTTCATGGTAGAGATTCTCTTGTTGACATTTTAAAAGAGCGTGGATTACTATCTGAGTCCGATGCTGTTCAACCACTAATAGGTAAGAAATCATGAGTCAAGTTACTGAACAATCCGTACGTTTCCAGACCGCTTTGGCCTCTATTAAGCTCATTCAGGCTTCTGCCGTTTTGGATTTAACCGAAGATGATTTCGATTTTCTGACGAGTAACAAAGTTTGGATTGCTACTGACCGCTCTCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTACGCTGGACTTTGTGGGATACCCTCGCTTTCCTGCTCCTGTTGAGTTTATTGCTGCCGTCATTGCTTATTATGTTCATCCCGTCAACATTCAAACGGCCTGTCTCATCATGGAAGGCGCTGAATTTACGGAAAACATTATTAATGGCGTCGAGCGTCCGGTTAAAGCCGCTGAATTGTTCGCGTTTACCTTGCGTGTACGCGCAGGAAACACTGACGTTCTTACTGACGCAGAAGAAAACGTGCGTCAAAAATTACGTGCGGAAGGAGTGATGTAATGTCTAAAGGTAAAAAACGTTCTGGCGCTCGCCCTGGTCGTCCGCAGCCGTTGCGAGGTACTAAAGGCAAGCGTAAAGGCGCTCGTCTTTGGTATGTAGGTGGTCAACAATTTTAATTGCAGGGGCTTCGGCCCCTTACTTGAGGATAAATTATGTCTAATATTCAAACTGGCGCCGAGCGTATGCCGCATGACCTTTCCCATCTTGGCTTCCTTGCTGGTCAGATTGGTCGTCTTATTACCATTTCAACTACTCCGGTTATCGCTGGCGACTCCTTCGAGATGGACGCCGTTGGCGCTCTCCGTCTTTCTCCATTGCGTCGTGGCCTTGCTATTGACTCTACTGTAGACATTTTTACTTTTTATGTCCCTCATCGTCACGTTTATGGTGAACAGTGGATTAAGTTCATGAAGGATGGTGTTAATGCCACTCCTCTCCCGACTGTTAACACTACTGGTTATATTGACCATGCCGCTTTTCTTGGCACGATTAACCCTGATACCAATAAAATCCCTAAGCATTTGTTTCAGGGTTATTTGAATATCTATAACAACTATTTTAAAGCGCCGTGGATGCCTGACCGTACCGAGGCTAACCCTAATGAGCTTAATCAAGATGATGCTCGTTATGGTTTCCGTTGCTGCCATCTCAAAAACATTTGGACTGCTCCGCTTCCTCCTGAGACTGAGCTTTCTCGCCAAATGACGACTTCTACCACATCTATTGACATTATGGGTCTGCAAGCTGCTTATGCTAATTTGCATACTGACCAAGAACGTGATTACTTCATGCAGCGTTACCATGATGTTATTTCTTCATTTGGAGGTAAAACCTCTTATGACGCTGACAACCGTCCTTTACTTGTCATGCGCTCTAATCTCTGGGCATCTGGCTATGATGTTGATGGAACTGACCAAACGTCGTTAGGCCAGTTTTCTGGTCGTGTTCAACAGACCTATAAACATTCTGTGCCGCGTTTCTTTGTTCCTGAGCATGGCACTATGTTTACTCTTGCGCTTGTTCGTTTTCCGCCTACTGCGACTAAAGAGATTCAGTACCTTAACGCTAAAGGTGCTTTGACTTATACCGATATTGCTGGCGACCCTGTTTTGTATGGCAACTTGCCGCCGCGTGAAATTTCTATGAAGGATGTTTTCCGTTCTGGTGATTCGTCTAAGAAGTTTAAGATTGCTGAGGGTCAGTGGTATCGTTATGCGCCTTCGTATGTTTCTCCTGCTTATCACCTTCTTGAAGGCTTCCCATTCATTCAGGAACCGCCTTCTGGTGATTTGCAAGAACGCGTACTTATTCGCCACCATGATTATGACCAGTGTTTCCAGTCCGTTCAGTTGTTGCAGTGGAATAGTCAGGTTAAATTTAATGTGACCGTTTATCGCAATCTGCCGACCACTCGCGATTCAATCATGACTTCGTGATAAAAGATTGA"
 
 filter(x -> iscoding(sequence(x), η=1e-10) && length(x) > 100, findorfs(phi))
 
@@ -130,4 +107,4 @@
  ORF{NaiveFinder}(4540:4644, '+', 1)
  ORF{NaiveFinder}(4690:4866, '+', 1)
  ORF{NaiveFinder}(4741:4866, '+', 1)
- ORF{NaiveFinder}(4744:4866, '+', 1)
+ ORF{NaiveFinder}(4744:4866, '+', 1)