Last updated: 2019-03-31

Checks: 6 0

Knit directory: fiveMinuteStats/analysis/

This reproducible R Markdown analysis was created with workflowr (version 1.2.0). The Report tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(12345) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
Ignored:    .Rhistory
Ignored:    .Rproj.user/
Ignored:    analysis/.Rhistory
Ignored:    analysis/bernoulli_poisson_process_cache/

Untracked files:
Untracked:  _workflowr.yml
Untracked:  analysis/CI.Rmd
Untracked:  analysis/gibbs_structure.Rmd
Untracked:  analysis/libs/
Untracked:  analysis/results.Rmd
Untracked:  analysis/shiny/tester/
Untracked:  docs/MH_intro_files/
Untracked:  docs/citations.bib
Untracked:  docs/hmm_files/
Untracked:  docs/libs/
Untracked:  docs/shiny/tester/

Unstaged changes:
Modified:   analysis/BF_and_pvalue.Rmd
Modified:   analysis/CI_conditional_coverage.Rmd
Modified:   analysis/Importance_sampling.Rmd
Modified:   analysis/LR_and_BF.Rmd
Modified:   analysis/LR_error.Rmd
Modified:   analysis/MH-examples1.Rmd
Modified:   analysis/MH_intro.Rmd
Modified:   analysis/_site.yml
Modified:   analysis/approx_wright_fisher_model.Rmd
Modified:   analysis/asymptotic_normality_mle.Rmd
Modified:   analysis/bayes_beta_binomial.Rmd
Modified:   analysis/bayes_conjugate.Rmd
Modified:   analysis/bayes_conjugate_normal_mean.Rmd
Modified:   analysis/bayes_independent.Rmd
Modified:   analysis/bayes_multiclass.Rmd
Modified:   analysis/bernoulli_poisson_process.Rmd
Modified:   analysis/beta.Rmd
Deleted:    analysis/chunks.R
Modified:   analysis/citations.bib
Modified:   analysis/decision_theory_bayes_rule.Rmd
Modified:   analysis/decisions_costs_intro.Rmd
Modified:   analysis/dirichlet.Rmd
Modified:   analysis/ebnm_normal.Rmd
Modified:   analysis/gibbs1.Rmd
Modified:   analysis/gibbs2.Rmd
Modified:   analysis/gibbs_structure_simple.Rmd
Modified:   analysis/hmm.Rmd
Modified:   analysis/index.Rmd
Modified:   analysis/integral.Rmd
Modified:   analysis/intro_to_em.Rmd
Modified:   analysis/intro_to_mixture_models.Rmd
Modified:   analysis/inverse_transform_sampling.Rmd
Modified:   analysis/is_simple.Rmd
Modified:   analysis/likelihood_combining_independent.Rmd
Modified:   analysis/likelihood_do_dont.Rmd
Modified:   analysis/likelihood_function.Rmd
Modified:   analysis/likelihood_ratio_simple_continuous_data.Rmd
Modified:   analysis/likelihood_ratio_simple_models.Rmd
Modified:   analysis/loglikelihood.Rmd
Modified:   analysis/markov_chains_discrete_intro.Rmd
Modified:   analysis/markov_chains_discrete_stationary_dist.Rmd
Modified:   analysis/markov_chains_time_reversibility.Rmd
Modified:   analysis/model.Rmd
Modified:   analysis/mvnorm.Rmd
Modified:   analysis/normal_markov_chain.Rmd
Modified:   analysis/poisson_process_time_dependent_thinning.Rmd
Modified:   analysis/pvalue_difficult_calibrate_example.Rmd
Modified:   analysis/shiny_normal_example.Rmd
Modified:   analysis/simple_nonconjugate.Rmd
Modified:   analysis/simulating_discrete_chains_1.Rmd
Modified:   analysis/simulating_discrete_chains_2.Rmd
Modified:   analysis/stationary_distribution.Rmd
Modified:   analysis/summarize_interpret_posterior.Rmd
Modified:   analysis/wilks.Rmd
Modified:   fiveMinuteStats.Rproj



Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view them.

File Version Author Date Message
html 34bcc51 John Blischak 2017-03-06 Build site.
Rmd 5fbc8b5 John Blischak 2017-03-06 Update workflowr project with wflow_update (version 0.4.0).
Rmd 391ba3c John Blischak 2017-03-06 Remove front and end matter of non-standard templates.
html fb0f6e3 stephens999 2017-03-03 Merge pull request #33 from mdavy86/f/review
html 0713277 stephens999 2017-03-03 Merge pull request #31 from mdavy86/f/review
Rmd d674141 Marcus Davy 2017-02-27 typos, refs
html c3b365a John Blischak 2017-01-02 Build site.
Rmd 67a8575 John Blischak 2017-01-02 Use external chunk to set knitr chunk options.
Rmd 5ec12c7 John Blischak 2017-01-02 Use session-info chunk.
Rmd ed83285 jhmarcus 2016-04-07 added popgen section to index and wright_fisher tutorials

# Pre-requisites

A basic knowledge of:

• introductory probability
• genetics terminology
• discrete-time Markov chains.

# Overview

A major goal of population genetics is to understand the effect of various evolutionary forces, such as genetic drift, selection, mutation and migration, on the change in allele frequencies within a population. Named after early pioneers of theoretical population genetics, Sewall Wright and Ronald A. Fisher, the Wright-Fisher model describes the sampling of alleles in a population with no selection, no mutation, no migration, non-overlapping generation times and random mating. Of course, real populations in nature do not adhere to these assumptions, yet the Wright-Fisher model provides a tool for studying how introducing more complex evolutionary forces can effect a relatively simple model.

# Definition

Let $$A$$ and $$a$$ denote two alleles segregating at a locus in a given population. The Wright-Fisher model is a discrete-time Markov chain that describes the evolution of the count of one of these alleles over time. Let $$X_t$$ be the count of the $$A$$ allele in a population with $$N$$ diploid individuals at generation $$t$$. The state space of this Markov chain is the set of possible counts of the $$A$$ allele i.e. $$X_t \in \{0, 1, \dots, 2N\}$$. Each generation, a collection of alleles are sampled, with replacement, from the current population at generation $$t$$ to form a new population at generation $$t + 1$$. This process describes the binomial sampling of alleles each generation, which allows us to write the probability transition matrix for the Markov chain as:

$P_{ij} = {2N \choose j} (\frac{i}{2N})^{j} (1 - \frac{i}{2N})^{2N - j}$

In other words the probability of transitioning from an allele count of $$i$$, at generation $$t-1$$, to an allele count of $$j$$, at generation $$t$$, can be computed from binomial probability mass function with size $$2N$$ and success probability being equal to the frequency of the $$A$$ allele at time $$t-1$$:

$X_{t} \mid X_{t-1} = x_{t-1} \sim Binomial(n = 2N, p = \frac{x_{t-1}}{2N})$

### Properties

#### Mean

One useful property to know is the expected value of $$X_t$$ at any time throughout the process. This can be computed using the fact that the expectation of a binomial distribution is $$np$$ and the law of total expectation:

$E(X_t) = E(E(X_t \mid X_{t-1})) = E(2N \frac{X_{t-1}}{2N})$ $= E(X_{t-1}) = E(E(X_{t-1} \mid X_{t-2})) = E(2N \frac{X_{t-2}}{2N})$ $= E[X_{t-2}] = \dots = X_0$

We see that the expected value of the Wright-Fisher process at any time-point is just the count of the $$A$$ allele in the first generation. If $$Y_t$$ is the frequency of the $$A$$ allele at time $$t$$ then the above result can be extended such that:

$E(Y_t) = Y_0$

#### Variance

It is a little bit harder to find the variance of $$X_t$$ but it is useful to know and will be important for later tutorials. Recall from basic probability theory the definition of the variance:

$Var(X_t) = E(X^2_t) - E(X_t)^2$

We have shown about that $$E(X_t) = X_0$$ thus:

$Var(X_t) = E(X^2_t) - X^2_0$

Using the law of total variance and the mean and variance of the binomial distribution:

$E(X^2_{t}) = E(E(X^2_{t} \mid X_{t-1})) = E(Var(X_{t} \mid X_{t-1}) + E(X_{t} \mid X_{t-1})^2)$ $= E(2N\frac{X_{t-1}}{2N}(1 - \frac{X_{t-1}}{2N}) + (2N\frac{X_{t-1}}{2N})^2)$ $= E(X_{t-1} - \frac{X^2_{t-1}}{2N} + X^2_{t-1}) = E(X^2_{t-1} - \frac{X^2_{t-1}}{2N} + X_{t-1})$ $= E(X^2_{t-1}) (1 - \frac{1}{2N}) + E(X_{t-1})$

thus we can rewrite $$E(X^2_{t})$$ as:

$Var(X_{t}) + X^2_0 = (Var(X_{t-1}) + X^2_0) (1 - \frac{1}{2N}) + X_0$ $= Var(X_{t-1})(1 - \frac{1}{2N}) + X^2_0 - \frac{X^2_0}{2N} + X_0$ $\Rightarrow Var(X_{t}) = Var(X_{t-1})(1 - \frac{1}{2N}) + X_0(1 - \frac{X_0}{2N})$

solving the above recurrence results in:

$Var(X_{t}) = 2N X_0 (1 - \frac{X_0}{2N}) (1 - (1 - \frac{1}{2N})^t)$

and finally in terms of the frequency of the $$A$$ allele:

$Var(Y_{t}) = Y_0 (1 - Y_0) (1 - (1 - \frac{1}{2N})^t)$

#### Fixation probabilities

Using the above moments we can compute some interesting features of the Markov chain which also have important biological relevance. Recall that absorbing states of a Markov chain are the states, once entered, can never be exited. More formally, if $$X^{*}_t$$ is an absorbing state then:

$P_{\{i = {X^{*}_t}\}j} = 0$

In the Wright-Fisher model we can intuitively see that the fixation or loss of an allele are absorbing states i.e. if all of the individuals in the population carry two copies of the $$A$$ allele or $$a$$ allele the allele of the other type cannot be sampled without mutation or migration. The absorbing states in the Wright-Fisher model are $$X^{*}_t \in \{0, 2N\}$$. We can compute the probability of fixation of the $$A$$ allele, transitioning to the absorbing state $$X^{*}_{t+1} = 2N$$ from any other state, using the conditional expectation described above:

$E(X^{*}_{t+1} \mid X_t = i) = 2NP(X_{t+1} = 2N \mid X_t = i)$ $\Rightarrow i = 2NP(X_{t+1} = 2N \mid X_{t} = i)$ $\Rightarrow P(X_{t+1} = 2N \mid X_{t} = i) = \frac{i}{2N}$

Therefore we can reciprocally write the probability of losing the $$A$$ allele (fixation of the $$a$$ allele):

$P(X_{t+1} = 0 \mid X_{t} = i) = 1 - P(X_{t+1} = 2N \mid X_{t} = i) = 1 - \frac{i}{2N}$

We can see that the probability of fixation or loss of an allele in a pure drift Wright-Fisher model only depends on the previous count of the allele and the effective population size.

# Examples

## Visualizing $$P_{ij}$$

Visualization of the probability transition matrix can provide some intuition on the how the process can proceed given different starting points.

library(ggplot2)
library(dplyr)
library(tidyr)
library(viridis)

N <- 10 # effective population size

# fill up the probility transition matrix
P <- matrix(NA, ncol = 2*N + 1, 2*N + 1)
P_df <- data.frame()
for(i in 0:(2*N)){
for(j in 0:(2*N)){
# add to matrix for examples
P[i+1, j+1] <- dbinom(j, 2*N, (i / (2*N)))

# add to data.frame for viz below
P_df <- bind_rows(P_df, data.frame(i = i, j = j, p = P[i+1, j+1]))
}
}

# plot it up!
p <- ggplot(P_df, aes(x = i, y = j, fill = p)) +
geom_tile() + scale_fill_viridis(option="D")
p

Version Author Date
c3b365a John Blischak 2017-01-02

## Simulating the Wright-Fisher Model

Lets simulate a grid of parameter values to explore the concepts introduced above:

# data.frame to be filled
wf_df <- data.frame()

# effective population sizes
sizes <- c(50, 100, 1000, 5000)

# starting allele frequencies
starting_p <- c(.01, .1, .5, .8)

# number of generations
n_gen <- 100

# number of replicates per simulation
n_reps <- 50

# run the simulations
for(N in sizes){
for(p in starting_p){
p0 <- p
for(j in 1:n_gen){
X <- rbinom(n_reps, 2*N, p)
p <- X / (2*N)
rows <- data.frame(replicate = 1:n_reps, N = rep(N, n_reps),
gen = rep(j, n_reps), p0 = rep(p0, n_reps),
p = p)
wf_df <- bind_rows(wf_df, rows)
}
}
}

# plot it up!
p <- ggplot(wf_df, aes(x = gen, y = p, group = replicate)) +
geom_path(alpha = .5) + facet_grid(N ~ p0) + guides(colour=FALSE)
p

Version Author Date
c3b365a John Blischak 2017-01-02

The horizontal facet is the starting frequency and the vertical facet is the effective population size. We see that as we increase the population size the less noisy the allele frequency trajectory as expected from our derivation. Additionally we see that probability of fixation or loss is more likely when lower and higher starting allele frequencies (again as expected).

sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.1

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] viridis_0.5.1     viridisLite_0.3.0 tidyr_0.8.3       dplyr_0.8.0.1
[5] ggplot2_3.1.0

loaded via a namespace (and not attached):
[1] Rcpp_1.0.0       compiler_3.5.2   pillar_1.3.1     git2r_0.24.0
[5] plyr_1.8.4       workflowr_1.2.0  tools_3.5.2      digest_0.6.18
[9] evaluate_0.12    tibble_2.0.1     gtable_0.2.0     pkgconfig_2.0.2
[13] rlang_0.3.1      yaml_2.2.0       xfun_0.4         gridExtra_2.3
[17] withr_2.1.2      stringr_1.3.1    knitr_1.21       fs_1.2.6
[21] rprojroot_1.3-2  grid_3.5.2       tidyselect_0.2.5 glue_1.3.0
[25] R6_2.3.0         rmarkdown_1.11   reshape2_1.4.3   purrr_0.3.1
[29] magrittr_1.5     whisker_0.3-2    backports_1.1.3  scales_1.0.0
[33] htmltools_0.3.6  assertthat_0.2.0 colorspace_1.4-0 labeling_0.3
[37] stringi_1.2.4    lazyeval_0.2.1   munsell_0.5.0    crayon_1.3.4    

This site was created with R Markdown