**Last updated:** 2019-03-31

**Checks:** 6 0

**Knit directory:** `fiveMinuteStats/analysis/`

This reproducible R Markdown analysis was created with workflowr (version 1.2.0). The *Report* tab describes the reproducibility checks that were applied when the results were created. The *Past versions* tab lists the development history.

`set.seed(12345)`

was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use `wflow_publish`

or `wflow_git_commit`

). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:

```
Ignored files:
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: analysis/.Rhistory
Ignored: analysis/bernoulli_poisson_process_cache/
Untracked files:
Untracked: _workflowr.yml
Untracked: analysis/CI.Rmd
Untracked: analysis/gibbs_structure.Rmd
Untracked: analysis/libs/
Untracked: analysis/results.Rmd
Untracked: analysis/shiny/tester/
Untracked: docs/MH_intro_files/
Untracked: docs/citations.bib
Untracked: docs/hmm_files/
Untracked: docs/libs/
Untracked: docs/shiny/tester/
Unstaged changes:
Modified: analysis/BF_and_pvalue.Rmd
Modified: analysis/CI_conditional_coverage.Rmd
Modified: analysis/Importance_sampling.Rmd
Modified: analysis/LR_and_BF.Rmd
Modified: analysis/LR_error.Rmd
Modified: analysis/MH-examples1.Rmd
Modified: analysis/MH_intro.Rmd
Modified: analysis/_site.yml
Modified: analysis/approx_wright_fisher_model.Rmd
Modified: analysis/asymptotic_normality_mle.Rmd
Modified: analysis/bayes_beta_binomial.Rmd
Modified: analysis/bayes_conjugate.Rmd
Modified: analysis/bayes_conjugate_normal_mean.Rmd
Modified: analysis/bayes_independent.Rmd
Modified: analysis/bayes_multiclass.Rmd
Modified: analysis/bernoulli_poisson_process.Rmd
Modified: analysis/beta.Rmd
Deleted: analysis/chunks.R
Modified: analysis/citations.bib
Modified: analysis/decision_theory_bayes_rule.Rmd
Modified: analysis/decisions_costs_intro.Rmd
Modified: analysis/dirichlet.Rmd
Modified: analysis/ebnm_normal.Rmd
Modified: analysis/gibbs1.Rmd
Modified: analysis/gibbs2.Rmd
Modified: analysis/gibbs_structure_simple.Rmd
Modified: analysis/hmm.Rmd
Modified: analysis/index.Rmd
Modified: analysis/integral.Rmd
Modified: analysis/intro_to_em.Rmd
Modified: analysis/intro_to_mixture_models.Rmd
Modified: analysis/inverse_transform_sampling.Rmd
Modified: analysis/is_simple.Rmd
Modified: analysis/likelihood_combining_independent.Rmd
Modified: analysis/likelihood_do_dont.Rmd
Modified: analysis/likelihood_function.Rmd
Modified: analysis/likelihood_ratio_simple_continuous_data.Rmd
Modified: analysis/likelihood_ratio_simple_models.Rmd
Modified: analysis/loglikelihood.Rmd
Modified: analysis/markov_chains_discrete_intro.Rmd
Modified: analysis/markov_chains_discrete_stationary_dist.Rmd
Modified: analysis/markov_chains_time_reversibility.Rmd
Modified: analysis/model.Rmd
Modified: analysis/mvnorm.Rmd
Modified: analysis/normal_markov_chain.Rmd
Modified: analysis/poisson_process_time_dependent_thinning.Rmd
Modified: analysis/pvalue_difficult_calibrate_example.Rmd
Modified: analysis/shiny_normal_example.Rmd
Modified: analysis/simple_nonconjugate.Rmd
Modified: analysis/simulating_discrete_chains_1.Rmd
Modified: analysis/simulating_discrete_chains_2.Rmd
Modified: analysis/stationary_distribution.Rmd
Modified: analysis/summarize_interpret_posterior.Rmd
Modified: analysis/wilks.Rmd
Modified: fiveMinuteStats.Rproj
```

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see `?wflow_git_remote`

), click on the hyperlinks in the table below to view them.

File | Version | Author | Date | Message |
---|---|---|---|---|

html | 34bcc51 | John Blischak | 2017-03-06 | Build site. |

Rmd | 5fbc8b5 | John Blischak | 2017-03-06 | Update workflowr project with wflow_update (version 0.4.0). |

Rmd | 391ba3c | John Blischak | 2017-03-06 | Remove front and end matter of non-standard templates. |

html | fb0f6e3 | stephens999 | 2017-03-03 | Merge pull request #33 from mdavy86/f/review |

html | 0713277 | stephens999 | 2017-03-03 | Merge pull request #31 from mdavy86/f/review |

Rmd | d674141 | Marcus Davy | 2017-02-27 | typos, refs |

html | c3b365a | John Blischak | 2017-01-02 | Build site. |

Rmd | 67a8575 | John Blischak | 2017-01-02 | Use external chunk to set knitr chunk options. |

Rmd | 5ec12c7 | John Blischak | 2017-01-02 | Use session-info chunk. |

Rmd | ed83285 | jhmarcus | 2016-04-07 | added popgen section to index and wright_fisher tutorials |

A basic knowledge of:

- introductory probability
- genetics terminology

- discrete-time Markov chains.

A major goal of population genetics is to understand the effect of various evolutionary forces, such as genetic drift, selection, mutation and migration, on the change in allele frequencies within a population. Named after early pioneers of theoretical population genetics, Sewall Wright and Ronald A. Fisher, the Wright-Fisher model describes the sampling of alleles in a population with no selection, no mutation, no migration, non-overlapping generation times and random mating. Of course, real populations in nature do not adhere to these assumptions, yet the Wright-Fisher model provides a tool for studying how introducing more complex evolutionary forces can effect a relatively simple model.

Let \(A\) and \(a\) denote two alleles segregating at a locus in a given population. The Wright-Fisher model is a discrete-time Markov chain that describes the evolution of the count of one of these alleles over time. Let \(X_t\) be the count of the \(A\) allele in a population with \(N\) diploid individuals at generation \(t\). The state space of this Markov chain is the set of possible counts of the \(A\) allele i.e. \(X_t \in \{0, 1, \dots, 2N\}\). Each generation, a collection of alleles are sampled, with replacement, from the current population at generation \(t\) to form a new population at generation \(t + 1\). This process describes the binomial sampling of alleles each generation, which allows us to write the probability transition matrix for the Markov chain as:

\[P_{ij} = {2N \choose j} (\frac{i}{2N})^{j} (1 - \frac{i}{2N})^{2N - j}\]

In other words the probability of transitioning from an allele count of \(i\), at generation \(t-1\), to an allele count of \(j\), at generation \(t\), can be computed from binomial probability mass function with size \(2N\) and success probability being equal to the frequency of the \(A\) allele at time \(t-1\):

\[X_{t} \mid X_{t-1} = x_{t-1} \sim Binomial(n = 2N, p = \frac{x_{t-1}}{2N})\]

One useful property to know is the expected value of \(X_t\) at any time throughout the process. This can be computed using the fact that the expectation of a binomial distribution is \(np\) and the law of total expectation:

\[E(X_t) = E(E(X_t \mid X_{t-1})) = E(2N \frac{X_{t-1}}{2N})\] \[= E(X_{t-1}) = E(E(X_{t-1} \mid X_{t-2})) = E(2N \frac{X_{t-2}}{2N})\] \[= E[X_{t-2}] = \dots = X_0\]

We see that the expected value of the Wright-Fisher process at any time-point is just the count of the \(A\) allele in the first generation. If \(Y_t\) is the frequency of the \(A\) allele at time \(t\) then the above result can be extended such that:

\[E(Y_t) = Y_0\]

It is a little bit harder to find the variance of \(X_t\) but it is useful to know and will be important for later tutorials. Recall from basic probability theory the definition of the variance:

\[Var(X_t) = E(X^2_t) - E(X_t)^2\]

We have shown about that \(E(X_t) = X_0\) thus:

\[Var(X_t) = E(X^2_t) - X^2_0\]

Using the law of total variance and the mean and variance of the binomial distribution:

\[E(X^2_{t}) = E(E(X^2_{t} \mid X_{t-1})) = E(Var(X_{t} \mid X_{t-1}) + E(X_{t} \mid X_{t-1})^2)\] \[ = E(2N\frac{X_{t-1}}{2N}(1 - \frac{X_{t-1}}{2N}) + (2N\frac{X_{t-1}}{2N})^2) \] \[ = E(X_{t-1} - \frac{X^2_{t-1}}{2N} + X^2_{t-1}) = E(X^2_{t-1} - \frac{X^2_{t-1}}{2N} + X_{t-1})\] \[= E(X^2_{t-1}) (1 - \frac{1}{2N}) + E(X_{t-1})\]

thus we can rewrite \(E(X^2_{t})\) as:

\[Var(X_{t}) + X^2_0 = (Var(X_{t-1}) + X^2_0) (1 - \frac{1}{2N}) + X_0\] \[= Var(X_{t-1})(1 - \frac{1}{2N}) + X^2_0 - \frac{X^2_0}{2N} + X_0\] \[\Rightarrow Var(X_{t}) = Var(X_{t-1})(1 - \frac{1}{2N}) + X_0(1 - \frac{X_0}{2N})\]

solving the above recurrence results in:

\[Var(X_{t}) = 2N X_0 (1 - \frac{X_0}{2N}) (1 - (1 - \frac{1}{2N})^t)\]

and finally in terms of the frequency of the \(A\) allele:

\[Var(Y_{t}) = Y_0 (1 - Y_0) (1 - (1 - \frac{1}{2N})^t)\]

Using the above moments we can compute some interesting features of the Markov chain which also have important biological relevance. Recall that absorbing states of a Markov chain are the states, once entered, can never be exited. More formally, if \(X^{*}_t\) is an absorbing state then:

\[P_{\{i = {X^{*}_t}\}j} = 0\]

In the Wright-Fisher model we can intuitively see that the fixation or loss of an allele are absorbing states i.e. if all of the individuals in the population carry two copies of the \(A\) allele or \(a\) allele the allele of the other type cannot be sampled without mutation or migration. The absorbing states in the Wright-Fisher model are \(X^{*}_t \in \{0, 2N\}\). We can compute the probability of fixation of the \(A\) allele, transitioning to the absorbing state \(X^{*}_{t+1} = 2N\) from any other state, using the conditional expectation described above:

\[E(X^{*}_{t+1} \mid X_t = i) = 2NP(X_{t+1} = 2N \mid X_t = i) \] \[\Rightarrow i = 2NP(X_{t+1} = 2N \mid X_{t} = i)\] \[\Rightarrow P(X_{t+1} = 2N \mid X_{t} = i) = \frac{i}{2N}\]

Therefore we can reciprocally write the probability of losing the \(A\) allele (fixation of the \(a\) allele):

\[ P(X_{t+1} = 0 \mid X_{t} = i) = 1 - P(X_{t+1} = 2N \mid X_{t} = i) = 1 - \frac{i}{2N}\]

We can see that the probability of fixation or loss of an allele in a pure drift Wright-Fisher model only depends on the previous count of the allele and the effective population size.

Visualization of the probability transition matrix can provide some intuition on the how the process can proceed given different starting points.

```
library(ggplot2)
library(dplyr)
library(tidyr)
library(viridis)
N <- 10 # effective population size
# fill up the probility transition matrix
P <- matrix(NA, ncol = 2*N + 1, 2*N + 1)
P_df <- data.frame()
for(i in 0:(2*N)){
for(j in 0:(2*N)){
# add to matrix for examples
P[i+1, j+1] <- dbinom(j, 2*N, (i / (2*N)))
# add to data.frame for viz below
P_df <- bind_rows(P_df, data.frame(i = i, j = j, p = P[i+1, j+1]))
}
}
# plot it up!
p <- ggplot(P_df, aes(x = i, y = j, fill = p)) +
geom_tile() + scale_fill_viridis(option="D")
p
```

Version | Author | Date |
---|---|---|

c3b365a | John Blischak | 2017-01-02 |

Lets simulate a grid of parameter values to explore the concepts introduced above:

```
# data.frame to be filled
wf_df <- data.frame()
# effective population sizes
sizes <- c(50, 100, 1000, 5000)
# starting allele frequencies
starting_p <- c(.01, .1, .5, .8)
# number of generations
n_gen <- 100
# number of replicates per simulation
n_reps <- 50
# run the simulations
for(N in sizes){
for(p in starting_p){
p0 <- p
for(j in 1:n_gen){
X <- rbinom(n_reps, 2*N, p)
p <- X / (2*N)
rows <- data.frame(replicate = 1:n_reps, N = rep(N, n_reps),
gen = rep(j, n_reps), p0 = rep(p0, n_reps),
p = p)
wf_df <- bind_rows(wf_df, rows)
}
}
}
# plot it up!
p <- ggplot(wf_df, aes(x = gen, y = p, group = replicate)) +
geom_path(alpha = .5) + facet_grid(N ~ p0) + guides(colour=FALSE)
p
```

Version | Author | Date |
---|---|---|

c3b365a | John Blischak | 2017-01-02 |

The horizontal facet is the starting frequency and the vertical facet is the effective population size. We see that as we increase the population size the less noisy the allele frequency trajectory as expected from our derivation. Additionally we see that probability of fixation or loss is more likely when lower and higher starting allele frequencies (again as expected).

`sessionInfo()`

```
R version 3.5.2 (2018-12-20)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] viridis_0.5.1 viridisLite_0.3.0 tidyr_0.8.3 dplyr_0.8.0.1
[5] ggplot2_3.1.0
loaded via a namespace (and not attached):
[1] Rcpp_1.0.0 compiler_3.5.2 pillar_1.3.1 git2r_0.24.0
[5] plyr_1.8.4 workflowr_1.2.0 tools_3.5.2 digest_0.6.18
[9] evaluate_0.12 tibble_2.0.1 gtable_0.2.0 pkgconfig_2.0.2
[13] rlang_0.3.1 yaml_2.2.0 xfun_0.4 gridExtra_2.3
[17] withr_2.1.2 stringr_1.3.1 knitr_1.21 fs_1.2.6
[21] rprojroot_1.3-2 grid_3.5.2 tidyselect_0.2.5 glue_1.3.0
[25] R6_2.3.0 rmarkdown_1.11 reshape2_1.4.3 purrr_0.3.1
[29] magrittr_1.5 whisker_0.3-2 backports_1.1.3 scales_1.0.0
[33] htmltools_0.3.6 assertthat_0.2.0 colorspace_1.4-0 labeling_0.3
[37] stringi_1.2.4 lazyeval_0.2.1 munsell_0.5.0 crayon_1.3.4
```

This site was created with R Markdown