-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathREADME.Rmd
957 lines (794 loc) · 42.9 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
---
title: 'rubias --- genetic stock identification (GSI) in the tidyverse'
date: '`r format(Sys.time(), "%d %B, %Y")`'
output:
github_document:
toc: yes
bibliography: rubias_bib.bibtex
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
<!-- badges: start -->
[![CRAN status](https://www.r-pkg.org/badges/version/rubias)](https://CRAN.R-project.org/package=rubias)
<!-- badges: end -->
```{r, echo = FALSE, message=FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "tools/"
)
```
This is an R package for performing genetic stock identification (GSI) and
associated tasks. Additionally, it includes a method
designed to diagnose and correct a bias recently
documented in genetic stock identification. The bias occurs when mixture
proportion estimates are desired for groups of populations (reporting units)
and the number of populations within each reporting unit are uneven.
In order to run C++ implementations of MCMC, rubias requires the package
Rcpp (and now, also, RcppParallel for the baseline resampling option),
which in turn requires an Rtools installation (if you are on Windows)
or XCode (if you are on a Mac). After cloning into the
repository with the above dependencies installed, build & reload the package
to view further documentation.
The script "/R-main/coalescent_sim" was used to generate coalescent simulations
for bias correction validation. This is unnecessary for testing the applicability
of our methods to any particular dataset, which should be done using
`assess_reference_loo()` and `assess_pb_bias_correction()`.
`coalescent_sim()` creates simulated populations using the `ms` coalescent simulation
program, available from the Hudson lab at UChicago, and the `GSImulator`
and `ms2geno` packages, available at [https://github.com/eriqande](https://github.com/eriqande),
and so requires more dependencies than the rest of the package.
# Input Data
The functions for conducting genetic mixture analysis and for doing simulation
assessment to predict the accuracy of a set of genetic markers for genetic stock
identification require that genetic data be input as a data frame in a
specific format:
- one row per individual
- each locus is represented by two adjacent columns, one for each allele (this package is only
configured for diploids, at the moment). Allelic types can be expressed as any number
or character
- missing data at a locus is expressed with NA values for each gene copy at the locus
- if one gene copy is missing from a locus in an indivividual, then both gene copies must be missing at the locus.
- the name of the locus is taken to be the column name of the _first_ column of each pair
of locus columns. The header on the second column is ignored.
- the data frame must have four columns of meta data for each individual:
* `sample_type`: a column telling whether the sample is a `reference` sample or a `mixture` sample.
* `repunit`: the reporting unit that an individual/collection belongs to. This is required if sample_type is
`reference`. And if sample_type is `mixture` then repunit must be `NA`.
This must be a character vector. Not a factor. The idea of a "reporting unit"
is well-known amongst people doing genetic stock identfication of salmon, but might not be familiar
elsewhere. Briefly, a reporting unit is a group of populations (which we call "collections") that
are typically closely related genetically, and which will likely be aggregrated in the results of the
GSI exercise.
* `collection`: for reference samples, the name of the population that the individual is from. For mixture
samples, this is the name of the particular sample (i.e. stratum or port that is to be treated together in
space and time). This must be a character, not a factor.
* `indiv` a character vector with the ID of the fish. These must be unique.
- When we started developing `rubias`, we intended to allow both the `repunit` and the `collection` columns to be
either character vectors or factors. Having them as factors might be desirable if, for example, a certain
sort order of the collections or repunits was desired. _However_ at some point it became clear to Eric that,
given our approach to converting all the data to a C++ data structure of integers, for rapid analyis,
we would be exposing ourselves to greater opportunities for bugginess
by allowing `repunit` and `collection` to be factors. Accordingly, they **must** be character vectors. If they
are not, `rubias` will throw an error. **Note**: if you do have a specific sort order for your collections or
repunits, you can always change them into factors after analysis with `rubias`. Additionally, you can keep
extra columns in your original data frame (for example `repunit_f` or `collection_f`) in which the repunits or
the collections are stored as factors. See, for example the data file `alewife`. Or you can just keep a character
vector that has the sort order you would like, so as to use it when changing things to factors
after `rubias` analysis. (See, for instance, `chinook_repunit_levels`.)
- The file can have any number of other meta data columns; however, _they must all occur in the data frame **before** the columns of genetic data_.
- When you pass a data frame into any of these functions, you have to tell it which column the genetic data starts
in, and it is assumed that all the columns after that one contain genetic data.
- If you are doing a mixture analysis, the data frame of mixture fish and of the reference fish must have the
same column structure, i.e., they must have exactly the same number of columns with exactly the same
column names, in the same order and of the same type.
## How about haploid markers?
At the request of the good folks at ADFG, I introduced a few hacks to allow the input to
include markers that are haploid (for example mtDNA haplotypes). To denote a marker as haploid
you _still give it two columns of data_ in your data frame, but the second column of the haploid
marker must be entirely NAs. When `rubias` is processing the data and it sees this, it assumes that
the marker is haploid and it treats it appropriately.
Note that if you have a diploid marker it typically does not make sense to mark one of the gene copies as
missing and the other as non-missing. Accordingly, if you have a diploid marker that records just one of the
gene copies as missing in any individual, it is going to throw an error. Likewise, if your haploid marker does
not have every single individual with and NA at the second gene copy, then it's also going to throw an error.
## An example reference data file
Here are the meta data columns and the first two loci for eight individuals in the `chinook` reference data set that comes with the
package:
```{r head-chinook}
library(tidyverse) # load up the tidyverse library, we will use it later...
library(rubias)
head(chinook[, 1:8])
```
## An example mixture data file
Here is the same for the mixture data frame that goes along with that reference data set:
```{r head-chinook-mix}
head(chinook_mix[, 1:8])
```
## Preliminary good practice --- check for duplicate individuals
Sometimes, for a variety of reasons, an individual's genotype might appear more than once in a data
set. `rubias` has a quick and dirty function to spot pairs of individuals that share a large number of
genotypes. Clearly you only want to look at pairs that don't have a whole lot of missing data, so one
parameter is the fraction of loci that are non-missing in either fish. In our experience with Fluidigm
assays, if a fish is missing at > 10% of the SNPs, the remaining genotypes are likely to have
a fairly high error rate. So, to look for matching samples, let's require 85% of the genotypes
to be non-missing in both members of the pair. The last parameter is the
fraction of non-missing loci at which the pair has the same genotype. We will set that to 0.94 first.
Here we see it in action:
```{r}
# combine chinook and chinook_mix into one big data frame,
# but drop the California_Coho collection because Coho all
# have pretty much the same genotype at these loci!
chinook_all <- bind_rows(chinook, chinook_mix) %>%
filter(collection != "California_Coho")
# then toss them into a function. This takes half a minute or so...
matchy_pairs <- close_matching_samples(D = chinook_all,
gen_start_col = 5,
min_frac_non_miss = 0.85,
min_frac_matching = 0.94
)
# see that that looks like:
matchy_pairs %>%
arrange(desc(num_non_miss), desc(num_match))
```
Check that out. This reveals 33 pairs in the data set that are likely duplicate
samples.
If we reduce the min_frac_matching, we get more matches, but these are very unlikely
to be the same individual, unless genotyping error rates are very high.
```{r}
# then toss them into a function. This takes half a minute or so...
matchy_pairs2 <- close_matching_samples(D = chinook_all,
gen_start_col = 5,
min_frac_non_miss = 0.85,
min_frac_matching = 0.85
)
# see that that looks like:
matchy_pairs2 %>%
arrange(desc(num_non_miss), desc(num_match))
```
A more principled approach would be to use the allele frequencies in each collection
and take a likelihood based approach, but this is adequate for finding obvious duplicates.
## How about known-origin individuals in the mixture?
In some cases, you might know (more or less unambiguously) the origin of some fish
in a particular mixture sample. For example, if 10% of the individuals in a mixture
carried coded wire tags, then you would want to include them in the sample, but
make sure that their collections of origin were hard-coded to be what the CWTs said.
Another scenario in which this might occur is when the genetic data were used for parentage-based
tagging of the individuals in the mixture sample. In that case, some individuals might be placed
with very high confidence to parents. Then, they should be included in the mixture as having come
from a known collection. The folks at the DFO in Nanaimo, Canada are doing an amazing
job with PBT and wondered if rubias could be modified to deal with the latter situation.
We've made some small additions to accommodate this. rubias does not do any actual inference of parentage, but if you
know the origin of some fish in the mixture, that can be included in the rubias analysis. The way you do this with
the function `infer_mixture()` is to include a column called `known_collection` in both the reference data frame
and the mixture data frame. In the reference data frame, `known_collection` should just be a copy of the `collection` column. However, in the mixture data frame each entry in `known_collection` should be the collection that the individual is known to be from (i.e. using parentage inference or a CWT), or, if the individual is not known
to be from any collection, it should be NA. Note that the names of the collections in `known_collection` must match
those found in the `collection` column in the reference data set.
These modifications are not allowed for the parametric bootstrap (PB) method in `infer_mixture()`.
# Performing a Genetic Mixture Analysis
This is done with the `infer_mixture` function. In the example data
`chinook_mix` our data consist of fish caught in three different fisheries, `rec1`,
`rec2`, and `rec3` as denoted in the collection column. Each of those collections is
treated as a separate sample, getting its own mixing proportion estimate. This is how
it is run with the default options:
```{r infer_mixture1}
mix_est <- infer_mixture(reference = chinook,
mixture = chinook_mix,
gen_start_col = 5)
```
The result comes back as a list of four tidy data frames:
1. `mixing_proportions`: the mixing proportions. The column `pi` holds the estimated mixing proportion for each collection.
2. `indiv_posteriors`: this holds, for each individual, the posterior means of group membership in each
collection. Column `PofZ` holds those values. Column `log_likelihood` holds the log of the probability
of the individuals genotype given it is from the collection. Also included are `n_non_miss_loci` and `n_miss_loci` which are
the number of observed loci and the number of missing loci at the individual. A list column `missing_loci` contains
vectors with the indices (and the names) of the loci that are missing in that individual. It also includes a column
`z_score` which can be used to diagnose fish that don't belong to any samples in the reference data base (see below).
3. `mix_prop_traces:` MCMC traces of the mixing proportions for each collection. You will use these if you
want to make density estimates of the posterior distribution of the mixing proportions or if you want
to compute credible intervals.
4. `bootstrapped_proportions`: This is NULL in the above example, but if we had chosen
`method = "PB"` then this would be a tibble of bootstrap-corrected reporting unit
mixing proportions.
These data frames look like this:
```{r look-at-mix-est}
lapply(mix_est, head)
```
## Setting the prior for the mixing proportions
In some cases there might be a reason to explicitly set the parameters of the
Dirichlet prior on the mixing proportions of the collections. For a contrived example,
we could imagine that we wanted a Dirichlet prior with all parameters equal to
1/(# of collections), except for the parameters for all the Central Valley Fall Run
populations, to which we would like to assign Dirichlet parameters of 2. That can be
accomplished with the `pi_prior` argument to the `infer_mixture()` function,
which will let you pass in a tibble in which
one column named "collection" gives the collection, and the other column, named "pi_param"
gives the desired parameter.
Here we construct that kind of input:
```{r}
prior_tibble <- chinook %>%
count(repunit, collection) %>%
filter(repunit == "CentralValleyfa") %>%
select(collection) %>%
mutate(pi_param = 2)
# see what it looks like:
prior_tibble
```
Then we can run that in infer_mixture():
```{r}
set.seed(12)
mix_est_with_prior <- infer_mixture(reference = chinook,
mixture = chinook_mix,
gen_start_col = 5,
pi_prior = prior_tibble)
```
and now, for fun, we can compare the results for the mixing proportions of different collections
there with and without the prior for the mixture collection rec1:
```{r}
comp_mix_ests <- list(
`pi (default prior)` = mix_est$mixing_proportions,
`pi (cv fall gets 2s prior)` = mix_est_with_prior$mixing_proportions
) %>%
bind_rows(.id = "prior_type") %>%
filter(mixture_collection == "rec1") %>%
select(prior_type, repunit, collection, pi) %>%
spread(prior_type, pi) %>%
mutate(coll_group = ifelse(repunit == "CentralValleyfa", "CV_fall", "Not_CV_fall"))
ggplot(comp_mix_ests,
aes(x = `pi (default prior)`,
y = `pi (cv fall gets 2s prior)`,
colour = coll_group
)) +
geom_point() +
geom_abline(slope = 1, intercept = 0, linetype = "dashed")
```
Yep, slightly different than before. Let's look at the sums of everything:
```{r}
comp_mix_ests %>%
group_by(coll_group) %>%
summarise(with_explicit_prior = sum(`pi (cv fall gets 2s prior)`),
with_default_prior = sum(`pi (default prior)`))
```
We see that for the most part this change to the prior changed the distribution of
fish into different collections within the Central Valley Fall reporting unit. This
is not suprising---it is very hard to tell apart fish from those different collections. However, it did not greatly change the estimated proportion of the whole reporting unit. This also turns out to make sense if you consider the effect that the extra weight in the prior will have.
## Aggregating collections into reporting units
This is a simple operation in the [tidyverse](https://www.tidyverse.org/):
```{r aggregating}
# for mixing proportions
rep_mix_ests <- mix_est$mixing_proportions %>%
group_by(mixture_collection, repunit) %>%
summarise(repprop = sum(pi)) # adding mixing proportions over collections in the repunit
# for individuals posteriors
rep_indiv_ests <- mix_est$indiv_posteriors %>%
group_by(mixture_collection, indiv, repunit) %>%
summarise(rep_pofz = sum(PofZ))
```
## Creating posterior density curves from the traces
The full MCMC output for the mixing proportions is available by default in the
field `$mix_prop_traces`. This
can be used to obtain an estimate of the posterior density of the mixing proportions.
Here we plot kernel density estimates for the 6 most abundant repunits from the
`rec1` fishery:
```{r plot-6}
# find the top 6 most abundant:
top6 <- rep_mix_ests %>%
filter(mixture_collection == "rec1") %>%
arrange(desc(repprop)) %>%
slice(1:6)
# check how many MCMC sweeps were done:
nsweeps <- max(mix_est$mix_prop_traces$sweep)
# keep only rec1, then discard the first 200 sweeps as burn-in,
# and then aggregate over reporting units
# and then keep only the top6 from above
trace_subset <- mix_est$mix_prop_traces %>%
filter(mixture_collection == "rec1", sweep > 200) %>%
group_by(sweep, repunit) %>%
summarise(repprop = sum(pi)) %>%
filter(repunit %in% top6$repunit)
# now we can plot those:
ggplot(trace_subset, aes(x = repprop, colour = repunit)) +
geom_density()
```
## Computing Credible Intervals from the Traces
Following on from the above example, we will use `trace_subset` to compute
the equal-tail 95% credible intervals for the 6 most abundant reporting units in the
`rec1` fishery:
```{r}
top6_cis <- trace_subset %>%
group_by(repunit) %>%
summarise(loCI = quantile(repprop, probs = 0.025),
hiCI = quantile(repprop, probs = 0.975))
top6_cis
```
## Assessing whether individuals are not from any of the reference populations
Sometimes totally unexpected things happen. One situation we saw in the California Chinook
fishery was samples coming to us that were actually coho salmon. Before we included coho salmon
in the reference sample, these coho always assigned quite
strongly to Alaska populations of Chinook, even though they don't really look like Chinook at all.
In this case, it is useful to look at the raw log-likelihood values computed for the
individual, rather than the scaled posterior probabilities. Because aberrantly low values
of the genotype log-likelihood can indicate that there is something wrong. However, the raw likelihood
that you get will depend on the number of missing loci, etc. `rubias` deals with this
by computing a _z-score_ for each fish. The Z-score is the Z statistic obtained
from the fish's log-likelihood (by subtracting from it the expected log-likelihood and
dividing by the expected standard deviation). `rubias`'s implementation of the z-score
accounts for the pattern of missing data, but it does this without all the simulation that
[`gsi_sim`](https://github.com/eriqande/gsi_sim) does. This makes it much, much, faster---fast
enough that we can compute it be default for every fish and every population.
Here, we will look at the z-score computed for each fish to the population with the highest
posterior. (It is worth noting that you would **never** want to use the z-score to assign
fish to different populations---it is only there to decide whether it looks like it might
not have actually come from the population that it was assigned to, or any other population
in the reference data set.)
```{r plot-zs}
# get the maximum-a-posteriori population for each individual
map_rows <- mix_est$indiv_posteriors %>%
group_by(indiv) %>%
top_n(1, PofZ) %>%
ungroup()
```
If everything is kosher, then we expect that the z-scores we see will be roughly
normally distributed. We can compare the distribution of z-scores we see with
a bunch of simulated normal random variables.
```{r z-score-density}
normo <- tibble(z_score = rnorm(1e06))
ggplot(map_rows, aes(x = z_score)) +
geom_density(colour = "blue") +
geom_density(data = normo, colour = "black")
```
The normal density is in black and the distribution of our observed z_scores is in blue. They fit reasonably
well, suggesting that there is not too much weird stuff going on overall. (That is good!)
The z_score statistic is most useful as a check for individuals. It
is intended to be a quick way to identify aberrant individuals. If you see a
z-score to the maximum-a-posteriori population for an individual
in your mixture sample that is considerably less than z_scores you saw in the reference,
then you might infer that the individual doesn't actually fit any of the
populations in the reference well.
## Individuals of known origin in the mixture
Here I include a small, contrived example. We use the `small_chinook` data set so that it goes
fast.
First, we analyze the data with no fish in the mixture of known collection
```{r}
no_kc <- infer_mixture(small_chinook_ref, small_chinook_mix, gen_start_col = 5)
```
And look at the results for the mixing proportions:
```{r}
no_kc$mixing_proportions %>%
arrange(mixture_collection, desc(pi))
```
Now, we will do the same analysis, but pretend that we know that the first 8 of the 36
fish in fishery rec1 are from the Deer_Cr_sp collection.
First we have to add the known_collection column to the reference.
```{r}
# make reference file that includes the known_collection column
kc_ref <- small_chinook_ref %>%
mutate(known_collection = collection) %>%
select(known_collection, everything())
# see what that looks like
kc_ref[1:10, 1:8]
```
Then we add the known collection column to the mixture. We start out making it all NAs, and
then we change that to Deer_Cr_sp for 8 of the rec1 fish:
```{r}
kc_mix <- small_chinook_mix %>%
mutate(known_collection = NA) %>%
select(known_collection, everything())
kc_mix$known_collection[kc_mix$collection == "rec1"][1:8] <- "Deer_Cr_sp"
# here is what that looks like now (dropping most of the genetic data columns)
kc_mix[1:20, 1:7]
```
And now we can do the mixture analysis:
```{r}
# note that the genetic data start in column 6 now
with_kc <- infer_mixture(kc_ref, kc_mix, 6)
```
And, when we look at the estimated proportions, we see that for rec1 they reflect
the fact that 8 of those fish were singled out as known fish from Deer_Ck_sp:
```{r}
with_kc$mixing_proportions %>%
arrange(mixture_collection, desc(pi))
```
The output from `infer_mixture()` in this case can be used just like it was before
without known individuals in the baseline.
## Fully Bayesian model (with updating of allele freqencies)
The default model in `rubias` is a conditional model in which inference is done
with the baseline allele counts fixed. In a fully Bayesian version, fish from within
the mixture that are allocated (on any particular step of the MCMC) to one of the reference
samples have their alleles added to that reference sample, thus (one hopes) refining the
estimate of allele frequencies in that sample. This is more computationally intensive, and, is
done using parallel computation, by default running one thread for every core on your machine.
The basic way to invoke the fully Bayesian model is to use the `infer_mixture` function with
the `method` option set to "BR". For example:
```{r, eval=FALSE}
full_model_results <- infer_mixture(
reference = chinook,
mixture = chinook_mix,
gen_start_col = 5,
method = "BR"
)
```
More details about different options for working with the fully Bayesian model are
available in the vignette about the fully Bayesian model.
# Assessment of Genetic References
## Self-assigning fish from the reference
A standard analysis in molecular ecology is to assign individuals in the reference
back to the collections in the reference using a _leave-one-out_ procedure.
This is taken care of by the `self_assign()` function.
```{r self-ass}
sa_chinook <- self_assign(reference = chinook, gen_start_col = 5)
```
Now, you can look at the self assignment results:
```{r self-ass-results}
head(sa_chinook, n = 100)
```
The `log_likelihood` is the log probability of the fish's genotype given it is from the `inferred_collection`
computed using leave-one-out. The `scaled_likelihood` is the posterior prob of assigning the fish to the
`inferred_collection` given an equal prior on every collection in the reference. Other columns are as
in the output for `infer_mixture()`. Note that the `z_score` computed here can be used to
assess the distribution of the `z_score` statistic for fish from known, reference populations. This can
be used to compare to values obtained in mixed fisheries.
The output can be summarized by repunit as was done above:
```{r summ2repu}
sa_to_repu <- sa_chinook %>%
group_by(indiv, collection, repunit, inferred_repunit) %>%
summarise(repu_scaled_like = sum(scaled_likelihood))
head(sa_to_repu, n = 200)
```
## Simulated mixtures using a leave-one-out type of approach
If you want to know how much accuracy you can expect given a set of genetic markers and
a grouping of populations (`collection`s) into reporting units (`repunit`s), there are two
different functions you might use:
1. `assess_reference_loo()`: This function carries out simulation of mixtures using the leave-one-out
approach of @Andersonetal2008.
2. `assess_reference_mc()`: This functions breaks the reference data set into different subsets, one of which
is used as the reference data set and the other the mixture. It is difficult to simulate very large mixture
samples using this method, because it is constrained by the number of fish in the reference data set.
Additionally, there are constraints on the mixing proportions that can be simulated because of variation in the
number of fish from each collection in the reference.
Both of the functions take two required arguments: 1) a data frame of reference genetic data, and 2) the
number of the column in which the genetic data start.
Here we use the `chinook` data to simulate 50 mixture samples of size 200 fish
using the default values (Dirichlet parameters of 1.5 for each reporting unit, and
Dirichlet parameters of 1.5 for each collection within a reporting unit...)
```{r chin-sims, message=FALSE}
chin_sims <- assess_reference_loo(reference = chinook,
gen_start_col = 5,
reps = 50,
mixsize = 200)
```
Here is what the output looks like:
```{r show-chin-sims}
chin_sims
```
The columns here are:
- `repunit_scenario` and integer that gives that repunit simulation parameters (see below about simulating
multiple scenarios).
- `collections_scenario` and integer that gives that collection simulation paramters (see below about simulating
multiple scenarios).
- `iter` the simulation number (1 up to `reps`)
- `repunit` the reporting unit
- `collection` the collection
- `true_pi` the true simulated mixing proportion
- `n` the actual number of fish from the collection in the simulated mixture.
- `post_mean_pi` the posterior mean of mixing proportion.
- `mle_pi` the maximum likelihood of `pi` obtained using an EM-algorithm.
## Specifying mixture proportions in `assess_reference_loo()`
By default, each iteration, the proportions of fish from each reporting unit are simulated
from a Dirichlet distribution with parameter (1.5,...,1.5). And, within each reporting unit the mixing
proportions from different collections are
drawn from a Dirichlet distribution with parameter (1.5,...,1.5).
The value of 1.5 for the Dirichlet parameter for reporting units can be changed using the
`alpha_repunit`. The Dirichlet parameter for collections can be set using the `alpha_collection` parameter.
Sometimes, however, more control over the composition of the simulated mixtures is desired. This is achieved
by passing a two-column _data.frame_ to either `alpha_repunit` or `alpha_collection` (or both). If you are
passing the data.frame in for `alpha_repunit`, the first column must be named `repunit` and it must contain
a character vector specifying reporting units. In the data.frame for `alpha_collection` the first column must be
named `collection` and must hold a character vector specifying different collections. It is an error if
a repunit or collection is specified that does not exist in the reference. However, you do not need to
specify a value for every reporting unit or collection. (If they are absent, the value is assumed to be zero.)
The second column of the data frame must be one of `count`, `ppn` or `dirichlet`. These specify, respectively,
1. the exact count of individuals to be simulated from each repunit (or collection);
2. the proportion of individuals from each repunit (or collection). These `ppn`
values will be normalized to sum to one if they do not. As such, they can be regarded as weights.
3. the parameters of a Dirichlet distribution from which the proportion of individuals should be simulated.
Let's say that we want to simulate data that roughly have proportions like what we saw in the
Chinook `rec1` fishery. We have those estimates in the variable `top6`:
```{r top6list}
top6
```
We could, if we put those `repprop` values into a `ppn` column, simulate mixtures with
exactly those proportions. Or if we wanted to simulate exact numbers of fish in a sample
of `r sum(round(top6$repprop * 350))` fish, we could get those values like this:
```{r roundem}
round(top6$repprop * 350)
```
and then put them in a `cnts` column.
However, in this case, we want to simulate mixtures that look similar to the one we
estimated, but have some variation. For that we will want to supply Dirichlet random variable
paramaters in a column named `dirichlet`. If we make the values proportional to the mixing proportions,
then, on average that is what they will be. If the values are large, then there will be little
variation between simulated mixtures. And if the the values are small there will be lots of variation.
We'll scale them so that they sum to 10---that should give some variation, but not too much. Accordingly the
tibble that we pass in as the `alpha_repunit` parameter, which describes the variation in reporting unit
proportions we would like to simulate would look like this:
```{r make-arep}
arep <- top6 %>%
ungroup() %>%
mutate(dirichlet = 10 * repprop) %>%
select(repunit, dirichlet)
arep
```
Let's do some simulations with those repunit parameters. By default, if we don't
specify anything extra for the _collections_, they get dirichlet parameters of 1.5.
```{r chin-sim-top6, message=FALSE}
chin_sims_repu_top6 <- assess_reference_loo(reference = chinook,
gen_start_col = 5,
reps = 50,
mixsize = 200,
alpha_repunit = arep)
```
Now, we can summarise the output by reporting unit...
```{r summ-top6}
# now, call those repunits that we did not specify in arep "OTHER"
# and then sum up over reporting units
tmp <- chin_sims_repu_top6 %>%
mutate(repunit = ifelse(repunit %in% arep$repunit, repunit, "OTHER")) %>%
group_by(iter, repunit) %>%
summarise(true_repprop = sum(true_pi),
reprop_posterior_mean = sum(post_mean_pi),
repu_n = sum(n)) %>%
mutate(repu_n_prop = repu_n / sum(repu_n))
```
...and then plot it for the values we are interested in:
```{r plot-top6}
# then plot them
ggplot(tmp, aes(x = true_repprop, y = reprop_posterior_mean, colour = repunit)) +
geom_point() +
geom_abline(intercept = 0, slope = 1) +
facet_wrap(~ repunit)
```
Or plot comparing to their "n" value, which is the actual number of fish from each
reporting unit in the sample.
```{r plot-top6-n}
ggplot(tmp, aes(x = repu_n_prop, y = reprop_posterior_mean, colour = repunit)) +
geom_point() +
geom_abline(intercept = 0, slope = 1) +
facet_wrap(~ repunit)
```
## Retrieving the individual simulated fish posteriors
Quite often you might be curious about how much you can expect to be able to trust
the posterior for individual fish from a mixture like this. You can retrieve all
the posteriors computed for the fish simulated in `assess_reference_loo()` using the
`return_indiv_posteriors` option. When you do this, the function returns a list
with components `mixture_proportions` (which holds a tibble like `chin_sims_repu_top6`
in the previous section) and `indiv_posteriors`, which holds all the posteriors (`PofZ`s) for
the simulated individuals.
```{r chin-sim-top6-indiv, message=FALSE}
set.seed(100)
chin_sims_with_indivs <- assess_reference_loo(reference = chinook,
gen_start_col = 5,
reps = 50,
mixsize = 200,
alpha_repunit = arep,
return_indiv_posteriors = TRUE)
# print out the indiv posteriors
chin_sims_with_indivs$indiv_posteriors
```
In this tibble:
- `indiv` is an integer specifier of the simulated individual
- `simulated_repunit` is the reporting unit the individual was simulated from
- `simulated_collection` is the collection the simulated genotype came from
- `PofZ` is the mean over the MCMC of the posterior probability that the
individual originated from the collection.
Now that we have done that, we can see what the distribution of posteriors
to the correct reporting unit is for fish from the different simulated collections.
We'll do that with a boxplot, coloring by repunit:
```{r boxplot-pofz-indiv-sim, fig.width=9}
# summarise things
repu_pofzs <- chin_sims_with_indivs$indiv_posteriors %>%
filter(repunit == simulated_repunit) %>%
group_by(iter, indiv, simulated_collection, repunit) %>% # first aggregate over reporting units
summarise(repu_PofZ = sum(PofZ)) %>%
ungroup() %>%
arrange(repunit, simulated_collection) %>%
mutate(simulated_collection = factor(simulated_collection, levels = unique(simulated_collection)))
# also get the number of simulated individuals from each collection
num_simmed <- chin_sims_with_indivs$indiv_posteriors %>%
group_by(iter, indiv) %>%
slice(1) %>%
ungroup() %>%
count(simulated_collection)
# note, the last few steps make simulated collection a factor so that collections within
# the same repunit are grouped together in the plot.
# now, plot it
ggplot(repu_pofzs, aes(x = simulated_collection, y = repu_PofZ)) +
geom_boxplot(aes(colour = repunit)) +
geom_text(data = num_simmed, mapping = aes(y = 1.025, label = n), angle = 90, hjust = 0, vjust = 0.5, size = 3) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 9, vjust = 0.5)) +
ylim(c(NA, 1.05))
```
Great. That is helpful.
## Changing the resampling unit
By default, individuals are simulated in `assess_reference_loo()` by resampling full
multilocus genotypes. This tends to be more realistic, because it includes as missing
in the simulations all the
missing data for individuals in the reference. However, as all the genes in
individuals that have been incorrectly placed in a reference stay together,
that individual might have a low value of PofZ to the population it was simulated
from. Due to the latter issue, it might also yield a more pessimistic assessment'
of the power for GSI.
An alternative is to resample over gene copies---the
CV-GC method of @Andersonetal2008.
Let us do that and see how the simulated PofZ results change. Here we do the simulations...
```{r, message=FALSE}
set.seed(101) # for reproducibility
# do the simulation
chin_sims_by_gc <- assess_reference_loo(reference = chinook,
gen_start_col = 5,
reps = 50,
mixsize = 200,
alpha_repunit = arep,
return_indiv_posteriors = TRUE,
resampling_unit = "gene_copies")
```
and here we process the output and plot it:
```{r boxplot-pofz-gc-sim, fig.width=9}
# summarise things
repu_pofzs_gc <- chin_sims_by_gc$indiv_posteriors %>%
filter(repunit == simulated_repunit) %>%
group_by(iter, indiv, simulated_collection, repunit) %>% # first aggregate over reporting units
summarise(repu_PofZ = sum(PofZ)) %>%
ungroup() %>%
arrange(repunit, simulated_collection) %>%
mutate(simulated_collection = factor(simulated_collection, levels = unique(simulated_collection)))
# also get the number of simulated individuals from each collection
num_simmed_gc <- chin_sims_by_gc$indiv_posteriors %>%
group_by(iter, indiv) %>%
slice(1) %>%
ungroup() %>%
count(simulated_collection)
# note, the last few steps make simulated collection a factor so that collections within
# the same repunit are grouped together in the plot.
# now, plot it
ggplot(repu_pofzs_gc, aes(x = simulated_collection, y = repu_PofZ)) +
geom_boxplot(aes(colour = repunit)) +
geom_text(data = num_simmed_gc, mapping = aes(y = 1.025, label = n), angle = 90, hjust = 0, vjust = 0.5, size = 3) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 9, vjust = 0.5)) +
ylim(c(NA, 1.05))
```
And in that, we find somewhat fewer fish that have low posteriors, but there are
still some. This reminds us that with this dataset, (rather) occasionally it is possible to get
individuals carrying genotypes that make it difficult to correctly assign them to reporting unit.
## "sub-specifying" collection proportions or dirichlet parameters
If you are simulating the reporting unit proportions or numbers, and want to
have more control over which collections those fish are simulated from, within
the reporting units, then the `sub_ppn` and `sub_dirichlet` settings are for you.
These are given as column names in the `alpha_collection` data frame.
For example, let's say we want to simulate reporting unit proportions as before,
using `arep` from above:
```{r}
arep
```
But, now, let's say that within reporting unit we want specific weights
for different collections. Then we could specify those, for example, like
this:
```{r}
arep_subs <- tribble(
~collection, ~sub_ppn,
"Eel_R", 0.1,
"Russian_R", 0.9,
"Butte_Cr_fa", 0.7,
"Feather_H_sp", 0.3
)
```
Collections that are not listed are given equal proportions _within repunits that had no collections listed_.
However, if a collection is not listed, but other collections within its repunit are, then its
simulated proportion will be zero. (Technically, it is not zero, but it is so small---like $10^{-8}$ that is is
effectively 0...doing that made coding it up a lot easier...)
Now, we can simulate with that and see what the resulting proportion
of fish from each collection is:
```{r, message=FALSE}
chin_sims_sub_ppn <- assess_reference_loo(reference = chinook,
gen_start_col = 5,
reps = 50,
mixsize = 200,
alpha_repunit = arep,
alpha_collection = arep_subs,
return_indiv_posteriors = FALSE) # don't bother returning individual posteriors
```
Now observe the average proportions of the collections and repunits that were simulated, and
the average fraction, _within reporting units_ of each of the collection
```{r}
chin_sims_sub_ppn %>%
group_by(repunit, collection) %>%
summarise(mean_pi = mean(true_pi)) %>%
group_by(repunit) %>%
mutate(repunit_mean_pi = sum(mean_pi),
fract_within = mean_pi / repunit_mean_pi) %>%
mutate(fract_within = ifelse(fract_within < 1e-06, 0, fract_within)) %>% # anything less than 1 in a million gets called 0
filter(repunit_mean_pi > 0.0)
```
## Multiple simulation scenarios and "100% Simulations"
In the fisheries world, "100% simulations" have been a staple. In these simulations,
mixtures are simulated in which 100% of the individuals are from one collection (or
reporting unit, I suppose). Eric has never been a big fan of these since they don't
necessarily tell you how you might do inferring actual mixtures that you might encounter.
Nonetheless, since they have been such a mainstay in the field, it is worthwile showing
how to do 100% simulations using `rubias`. Furthermore, when people asked for this feature
it made it clear that Eric had to provide a way to simulate multiple
different scenarios without re-processing the reference data set each time. So this is what
I came up with: the way we do it is to pass a _list_ of
scenarios to the `alpha_repunit` or `alpha_collection` option in `assess_reference_loo()`.
These can be named lists, if desired. So, for example, let's do 100% simulations for each
of the repunits in `arep`:
```{r}
arep$repunit
```
We will let the collections within them just be drawn from a dirichlet distribution
with parameter 10 (so, pretty close to equal proportions).
So, to do this, we make a list of data frames with the proportions. We'll give it some names
too:
```{r six-hundy1}
six_hundy_scenarios <- lapply(arep$repunit, function(x) tibble(repunit = x, ppn = 1.0))
names(six_hundy_scenarios) <- paste("All", arep$repunit, sep = "-")
```
Then, we use it, producing only 5 replicates for each scenario:
```{r six-hundy2}
repu_hundy_results <- assess_reference_loo(reference = chinook,
gen_start_col = 5,
reps = 5,
mixsize = 50,
alpha_repunit = six_hundy_scenarios,
alpha_collection = 10)
repu_hundy_results
```
### Do it again with 100% collections
Just to make sure that it is clear how to do this with collections (rather than reporting units)
as well, lets do 100% simulations for a handful of the collections. Let's just randomly take
5 of them, and do 6 reps for each:
```{r hundy-colls1}
set.seed(10)
hundy_colls <- sample(unique(chinook$collection), 5)
hundy_colls
```
So, now make a list of those with 100% specifications in the tibbles:
```{r hundy-colls2}
hundy_coll_list <- lapply(hundy_colls, function(x) tibble(collection = x, ppn = 1.0)) %>%
setNames(paste("100%", hundy_colls, sep = "_"))
```
Then, do it:
```{r hundy-colls-do, message=FALSE}
hundy_coll_results <- assess_reference_loo(reference = chinook,
gen_start_col = 5,
reps = 6,
mixsize = 50,
alpha_collection = hundy_coll_list)
hundy_coll_results
```
# Bootstrap-Corrected Reporting Unit Proportions
These are obtained using `method = "PB"` in `infer_mixture()`. When invoked, this will return the
regular MCMC results as before, but also will population the `bootstrapped_proportions` field
of the output. Doing so takes a little bit longer, computationally, because there is a good deal
of simulation involved:
```{r infer_mixture_pb, eval=FALSE}
mix_est_pb <- infer_mixture(reference = chinook,
mixture = chinook_mix,
gen_start_col = 5,
method = "PB")
```
And now we can compare the estimates, showing here the 10 most prevalent
repunits, in the `rec1` fishery:
```{r, eval=FALSE}
mix_est_pb$mixing_proportions %>%
group_by(mixture_collection, repunit) %>%
summarise(repprop = sum(pi)) %>%
left_join(mix_est_pb$bootstrapped_proportions) %>%
ungroup() %>%
filter(mixture_collection == "rec1") %>%
arrange(desc(repprop)) %>%
slice(1:10)
```
It gives us a result that we expect: no appreciable difference, because the reporting units
are already very well resolved, so we don't expect that the parametric bootstrap procedure
would find any benefit in correcting them.
# References