Skip to content

Commit

Permalink
Refactor code for improved readability and variable naming
Browse files Browse the repository at this point in the history
  • Loading branch information
camilogarciabotero committed Jul 29, 2024
1 parent 117a9db commit 115db07
Showing 1 changed file with 45 additions and 47 deletions.
92 changes: 45 additions & 47 deletions docs/src/features.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
## The ORF features
## Scoring ORFs

The `ORF` type is designed to be flexible and can store various types of information about the ORF. This versatility lies on the `Features` field and allows every instance of the `ORF` to hold data such as a score of an ORF based on a scoring scheme, the sequence of the ORF, or even the translated amino acid sequence. For example, in the `NaiveFinder` method, the `score` subfield is used to store the score of the ORF using an scoring scheme (i.e. a function).
The `ORF` type is designed to be flexible and can store various types of information about the ORF. A very neat feature is that stores a view of the sequence that the ORF represents. Since the the ORF sequence is stored in the struct, then any method that can be applied to a sequence can be applied to the ORF sequence. This is useful for scoring the ORFs by overloading a method that calculates a score for a `BioSequence`. For instance the `lors` function from the [BioMarkovChains.jl](https://camilogarciabotero.github.io/BioMarkovChains.jl/dev/) package can be used to calculate a score of the ORFs predicted for the phi genome.

Take the following example:

```julia
phi = dna"GTGTGAGGTTATAACGCCGAAGCGGTAAAAATTTTAATTTTTGCCGCTGAGGGGTTGACCAAGCGAAGCGCGGTAGGTTTTCTGCTTAGGAGTTTAATCATGTTTCAGACTTTTATTTCTCGCCATAATTCAAACTTTTTTTCTGATAAGCTGGTTCTCACTTCTGTTACTCCAGCTTCTTCGGCACCTGTTTTACAGACACCTAAAGCTACATCGTCAACGTTATATTTTGATAGTTTGACGGTTAATGCTGGTAATGGTGGTTTTCTTCATTGCATTCAGATGGATACATCTGTCAACGCCGCTAATCAGGTTGTTTCTGTTGGTGCTGATATTGCTTTTGATGCCGACCCTAAATTTTTTGCCTGTTTGGTTCGCTTTGAGTCTTCTTCGGTTCCGACTACCCTCCCGACTGCCTATGATGTTTATCCTTTGAATGGTCGCCATGATGGTGGTTATTATACCGTCAAGGACTGTGTGACTATTGACGTCCTTCCCCGTACGCCGGGCAATAACGTTTATGTTGGTTTCATGGTTTGGTCTAACTTTACCGCTACTAAATGCCGCGGATTGGTTTCGCTGAATCAGGTTATTAAAGAGATTATTTGTCTCCAGCCACTTAAGTGAGGTGATTTATGTTTGGTGCTATTGCTGGCGGTATTGCTTCTGCTCTTGCTGGTGGCGCCATGTCTAAATTGTTTGGAGGCGGTCAAAAAGCCGCCTCCGGTGGCATTCAAGGTGATGTGCTTGCTACCGATAACAATACTGTAGGCATGGGTGATGCTGGTATTAAATCTGCCATTCAAGGCTCTAATGTTCCTAACCCTGATGAGGCCGCCCCTAGTTTTGTTTCTGGTGCTATGGCTAAAGCTGGTAAAGGACTTCTTGAAGGTACGTTGCAGGCTGGCACTTCTGCCGTTTCTGATAAGTTGCTTGATTTGGTTGGACTTGGTGGCAAGTCTGCCGCTGATAAAGGAAAGGATACTCGTGATTATCTTGCTGCTGCATTTCCTGAGCTTAATGCTTGGGAGCGTGCTGGTGCTGATGCTTCCTCTGCTGGTATGGTTGACGCCGGATTTGAGAATCAAAAAGAGCTTACTAAAATGCAACTGGACAATCAGAAAGAGATTGCCGAGATGCAAAATGAGACTCAAAAAGAGATTGCTGGCATTCAGTCGGCGACTTCACGCCAGAATACGAAAGACCAGGTATATGCACAAAATGAGATGCTTGCTTATCAACAGAAGGAGTCTACTGCTCGCGTTGCGTCTATTATGGAAAACACCAATCTTTCCAAGCAACAGCAGGTTTCCGAGATTATGCGCCAAATGCTTACTCAAGCTCAAACGGCTGGTCAGTATTTTACCAATGACCAAATCAAAGAAATGACTCGCAAGGTTAGTGCTGAGGTTGACTTAGTTCATCAGCAAACGCAGAATCAGCGGTATGGCTCTTCTCATATTGGCGCTACTGCAAAGGATATTTCTAATGTCGTCACTGATGCTGCTTCTGGTGTGGTTGATATTTTTCATGGTATTGATAAAGCTGTTGCCGATACTTGGAACAATTTCTGGAAAGACGGTAAAGCTGATGGTATTGGCTCTAATTTGTCTAGGAAATAACCGTCAGGATTGACACCCTCCCAATTGTATGTTTTCATGCCTCCAAATCTTGGAGGCTTTTTTATGGTTCGTTCTTATTACCCTTCTGAATGTCACGCTGATTATTTTGACTTTGAGCGTATCGAGGCTCTTAAACCTGCTATTGAGGCTTGTGGCATTTCTACTCTTTCTCAATCCCCAATGCTTGGCTTCCATAAGCAGATGGATAACCGCATCAAGCTCTTGGAAGAGATTCTGTCTTTTCGTATGCAGGGCGTTGAGTTCGATAATGGTGATATGTATGTTGACGGCCATAAGGCTGCTTCTGACGTTCGTGATGAGTTTGTATCTGTTACTGAGAAGTTAATGGATGAATTGGCACAATGCTACAATGTGCTCCCCCAACTTGATATTAATAACACTATAGACCACCGCCCCGAAGGGGACGAAAAATGGTTTTTAGAGAACGAGAAGACGGTTACGCAGTTTTGCCGCAAGCTGGCTGCTGAACGCCCTCTTAAGGATATTCGCGATGAGTATAATTACCCCAAAAAGAAAGGTATTAAGGATGAGTGTTCAAGATTGCTGGAGGCCTCCACTATGAAATCGCGTAGAGGCTTTGCTATTCAGCGTTTGATGAATGCAATGCGACAGGCTCATGCTGATGGTTGGTTTATCGTTTTTGACACTCTCACGTTGGCTGACGACCGATTAGAGGCGTTTTATGATAATCCCAATGCTTTGCGTGACTATTTTCGTGATATTGGTCGTATGGTTCTTGCTGCCGAGGGTCGCAAGGCTAATGATTCACACGCCGACTGCTATCAGTATTTTTGTGTGCCTGAGTATGGTACAGCTAATGGCCGTCTTCATTTCCATGCGGTGCACTTTATGCGGACACTTCCTACAGGTAGCGTTGACCCTAATTTTGGTCGTCGGGTACGCAATCGCCGCCAGTTAAATAGCTTGCAAAATACGTGGCCTTATGGTTACAGTATGCCCATCGCAGTTCGCTACACGCAGGACGCTTTTTCACGTTCTGGTTGGTTGTGGCCTGTTGATGCTAAAGGTGAGCCGCTTAAAGCTACCAGTTATATGGCTGTTGGTTTCTATGTGGCTAAATACGTTAACAAAAAGTCAGATATGGACCTTGCTGCTAAAGGTCTAGGAGCTAAAGAATGGAACAACTCACTAAAAACCAAGCTGTCGCTACTTCCCAAGAAGCTGTTCAGAATCAGAATGAGCCGCAACTTCGGGATGAAAATGCTCACAATGACAAATCTGTCCACGGAGTGCTTAATCCAACTTACCAAGCTGGGTTACGACGCGACGCCGTTCAACCAGATATTGAAGCAGAACGCAAAAAGAGAGATGAGATTGAGGCTGGGAAAAGTTACTGTAGCCGACGTTTTGGCGGCGCAACCTGTGACGACAAATCTGCTCAAATTTATGCGCGCTTCGATAAAAATGATTGGCGTATCCAACCTGCAGAGTTTTATCGCTTCCATGACGCAGAAGTTAACACTTTCGGATATTTCTGATGAGTCGAAAAATTATCTTGATAAAGCAGGAATTACTACTGCTTGTTTACGAATTAAATCGAAGTGGACTGCTGGCGGAAAATGAGAAAATTCGACCTATCCTTGCGCAGCTCGAGAAGCTCTTACTTTGCGACCTTTCGCCATCAACTAACGATTCTGTCAAAAACTGACGCGTTGGATGAGGAGAAGTGGCTTAATATGCTTGGCACGTTCGTCAAGGACTGGTTTAGATATGAGTCACATTTTGTTCATGGTAGAGATTCTCTTGTTGACATTTTAAAAGAGCGTGGATTACTATCTGAGTCCGATGCTGTTCAACCACTAATAGGTAAGAAATCATGAGTCAAGTTACTGAACAATCCGTACGTTTCCAGACCGCTTTGGCCTCTATTAAGCTCATTCAGGCTTCTGCCGTTTTGGATTTAACCGAAGATGATTTCGATTTTCTGACGAGTAACAAAGTTTGGATTGCTACTGACCGCTCTCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTACGCTGGACTTTGTGGGATACCCTCGCTTTCCTGCTCCTGTTGAGTTTATTGCTGCCGTCATTGCTTATTATGTTCATCCCGTCAACATTCAAACGGCCTGTCTCATCATGGAAGGCGCTGAATTTACGGAAAACATTATTAATGGCGTCGAGCGTCCGGTTAAAGCCGCTGAATTGTTCGCGTTTACCTTGCGTGTACGCGCAGGAAACACTGACGTTCTTACTGACGCAGAAGAAAACGTGCGTCAAAAATTACGTGCGGAAGGAGTGATGTAATGTCTAAAGGTAAAAAACGTTCTGGCGCTCGCCCTGGTCGTCCGCAGCCGTTGCGAGGTACTAAAGGCAAGCGTAAAGGCGCTCGTCTTTGGTATGTAGGTGGTCAACAATTTTAATTGCAGGGGCTTCGGCCCCTTACTTGAGGATAAATTATGTCTAATATTCAAACTGGCGCCGAGCGTATGCCGCATGACCTTTCCCATCTTGGCTTCCTTGCTGGTCAGATTGGTCGTCTTATTACCATTTCAACTACTCCGGTTATCGCTGGCGACTCCTTCGAGATGGACGCCGTTGGCGCTCTCCGTCTTTCTCCATTGCGTCGTGGCCTTGCTATTGACTCTACTGTAGACATTTTTACTTTTTATGTCCCTCATCGTCACGTTTATGGTGAACAGTGGATTAAGTTCATGAAGGATGGTGTTAATGCCACTCCTCTCCCGACTGTTAACACTACTGGTTATATTGACCATGCCGCTTTTCTTGGCACGATTAACCCTGATACCAATAAAATCCCTAAGCATTTGTTTCAGGGTTATTTGAATATCTATAACAACTATTTTAAAGCGCCGTGGATGCCTGACCGTACCGAGGCTAACCCTAATGAGCTTAATCAAGATGATGCTCGTTATGGTTTCCGTTGCTGCCATCTCAAAAACATTTGGACTGCTCCGCTTCCTCCTGAGACTGAGCTTTCTCGCCAAATGACGACTTCTACCACATCTATTGACATTATGGGTCTGCAAGCTGCTTATGCTAATTTGCATACTGACCAAGAACGTGATTACTTCATGCAGCGTTACCATGATGTTATTTCTTCATTTGGAGGTAAAACCTCTTATGACGCTGACAACCGTCCTTTACTTGTCATGCGCTCTAATCTCTGGGCATCTGGCTATGATGTTGATGGAACTGACCAAACGTCGTTAGGCCAGTTTTCTGGTCGTGTTCAACAGACCTATAAACATTCTGTGCCGCGTTTCTTTGTTCCTGAGCATGGCACTATGTTTACTCTTGCGCTTGTTCGTTTTCCGCCTACTGCGACTAAAGAGATTCAGTACCTTAACGCTAAAGGTGCTTTGACTTATACCGATATTGCTGGCGACCCTGTTTTGTATGGCAACTTGCCGCCGCGTGAAATTTCTATGAAGGATGTTTTCCGTTCTGGTGATTCGTCTAAGAAGTTTAAGATTGCTGAGGGTCAGTGGTATCGTTATGCGCCTTCGTATGTTTCTCCTGCTTATCACCTTCTTGAAGGCTTCCCATTCATTCAGGAACCGCCTTCTGGTGATTTGCAAGAACGCGTACTTATTCGCCACCATGATTATGACCAGTGTTTCCAGTCCGTTCAGTTGTTGCAGTGGAATAGTCAGGTTAAATTTAATGTGACCGTTTATCGCAATCTGCCGACCACTCGCGATTCAATCATGACTTCGTGATAAAAGATTGA"

phiorfs = findorfs(phi, finder=NaiveFinder, minlen=75, scheme=lors)
phiorfs = findorfs(phi, finder=NaiveFinder, minlen=75)

124-element Vector{ORF{4, NaiveFinder}}:
ORF{NaiveFinder}(9:101, '-', 3)
Expand Down Expand Up @@ -43,7 +43,44 @@ phiorfs = findorfs(phi, finder=NaiveFinder, minlen=75, scheme=lors)
ORF{NaiveFinder}(5164:5325, '+', 1)
```

In the example above we calculated a score using the `lors` (`logg_odds_ratio_score`) scoring scheme (see [lors](https://github.com/camilogarciabotero/BioMarkovChains.jl/blob/533e53d97cf5951f1ca050454bce1423ec8d7c36/src/transitions.jl#L179) from the [BioMarkovChains.jl](https://camilogarciabotero.github.io/BioMarkovChains.jl/dev/) package). The score is stored in the `score` subfield of the `ORF.Features`.
We can now calculate a score using the `lors` (`logg_odds_ratio_score`) scoring scheme (see [lors](https://github.com/camilogarciabotero/BioMarkovChains.jl/blob/533e53d97cf5951f1ca050454bce1423ec8d7c36/src/transitions.jl#L179) from the [BioMarkovChains.jl](https://camilogarciabotero.github.io/BioMarkovChains.jl/dev/) package).

```julia
lors.(phiorfs)

204-element Vector{Float64}:
-3.002461366087374
0.1063473191426695
-10.814621287968222
0.0015932011596656963
-2.635307521118871
-5.344187934894264
-1.35785984913167
-1.316724559874126
-1.796631200562138
0.4895532892313278
-3.2651518608269856
-1.4019264441082822
-0.32264884956233186
-0.6695213802037128
1.1023617306499074
-5.40624241726446
-0.8080572222081075
-5.571494087742448
0.8307089386780881
-4.882156920421228
-5.639670353834974
-0.8764121443326865
-4.308687693802273
-4.459423419810693
0.5077309777574499
-2.2808971022330824
-2.138892671183213
0.6106494455192994
1.1981063545591812
0.7566845754226063
```

Briefly, a sequence of DNA could be scored using a Markov model of the transition probabilities of a known sequence. This could be done using a *log-odds ratio score*, which is the logarithm of the ratio of the transition probabilities of the sequence given two models. The log-odds ratio score is defined as:

Expand All @@ -55,50 +92,11 @@ S(x) = \sum_{i=1}^{L} \beta_{x_{i}x} = \sum_{i=1} \log \frac{a^{\mathscr{m}_{1}}

Where the ``a^{\mathscr{m}_{1}}_{i-1} x_i`` is the transition probability of the first model (in this case the calculated for the given sequence) from the state ``x_{i-1}`` to the state ``x_i`` and ``a^{\mathscr{m}_{2}}_{i-1} x_i`` is the transition probability of the second model from the state ``x_{i-1}`` to the state ``x_i``. The score is the sum of the log-odds ratio of the transition probabilities of the sequence given the two models.

In the `lors` case, the two models are the coding and non-coding models of the E. coli genome. The coding model is a Markov model of the transition probabilities of the coding regions of the E. coli genome, and the non-coding model is a Markov model of the transition probabilities of the non-coding regions of the E. coli genome.

All features can be accesed using a conviniente funciton called `features` that returns a `NamedTuple` with the features of the `ORF` and can be broadcasted to the entire collection of `ORF`s using the `.` (broadcast) syntax.

```julia
features.(phiorfs)

124-element Vector{@NamedTuple{score::Float64}}:
(score = -3.002461366087374,)
(score = -10.814621287968222,)
(score = -5.344187934894264,)
(score = -1.316724559874126,)
(score = -1.796631200562138,)
(score = -3.2651518608269856,)
(score = -1.4019264441082822,)
(score = -2.3192349590107475,)
(score = 5.055524446434241,)
(score = 2.7116397224896436,)
(score = 2.2564640592402165,)
(score = 1.777499581940097,)
(score = 2.3474811908011186,)
(score = 2.38568188352799,)
(score = 2.498608044469827,)
(score = -5.474837954151803,)
(score = 0.6909362932156138,)
(score = -5.900045211699447,)
(score = 1.2010656615619415,)
(score = 0.8541931309205604,)
(score = 2.7897961643147777,)
(score = -4.42890346770467,)
(score = -5.40624241726446,)
(score = -0.8080572222081075,)
(score = -5.571494087742448,)
(score = -4.882156920421228,)
(score = -5.639670353834974,)
(score = -0.8764121443326865,)
(score = -4.308687693802273,)
(score = -4.459423419810693,)
```
In the `lors` case, the two models are the coding and non-coding models of the *E. coli* genome. The coding model is a Markov model of the transition probabilities of the coding regions of the *E. coli* genome, and the non-coding model is a Markov model of the transition probabilities of the non-coding regions of the *E. coli* genome.

## Analysing Lambda ORFs

As mentioned above the `lors` calculates the log odds ratio of the ORF sequence given two Markov models (by default: [ECOLICDS](https://github.com/camilogarciabotero/BioMarkovChains.jl/blob/533e53d97cf5951f1ca050454bce1423ec8d7c36/src/models.jl#L3) and [ECOLINOCDS](https://github.com/camilogarciabotero/BioMarkovChains.jl/blob/533e53d97cf5951f1ca050454bce1423ec8d7c36/src/models.jl#L16)), one for the coding region and one for the non-coding region. The score is stored in the `score` field of the `NamedTuple` returned by the `features` function. By default the `lors` function return the base 2 logarithm of the odds ratio, so it is analogous to the bits of information that the ORF sequence is coding.
As mentioned above the `lors` calculates the log odds ratio of the ORF sequence given two Markov models (by default: [ECOLICDS](https://github.com/camilogarciabotero/BioMarkovChains.jl/blob/533e53d97cf5951f1ca050454bce1423ec8d7c36/src/models.jl#L3) and [ECOLINOCDS](https://github.com/camilogarciabotero/BioMarkovChains.jl/blob/533e53d97cf5951f1ca050454bce1423ec8d7c36/src/models.jl#L16)), one for the coding region and one for the non-coding region. By default the `lors` function return the base 2 logarithm of the odds ratio, so it is analogous to the bits of information that the ORF sequence is coding.

Now we can even analyse how is the distribution of the ORFs' scores as a function of their lengths compared to random sequences.

Expand All @@ -113,9 +111,9 @@ open(FASTA.Reader, lambdafile) do reader
end

# find the ORFs in the lambda genome
lambaorfs = findorfs(lambdaseq, finder=NaiveFinder, minlen=100, scheme=lors)
lambaorfs = findorfs(lambdaseq, finder=NaiveFinder, minlen=100)

lambdascores = score.(lambaorfs)
lambdascores = lors.(lambaorfs)
lambdalengths = length.(lambaorfs)

## get some random sequences of variable lengths
Expand Down

0 comments on commit 115db07

Please sign in to comment.