Last updated: 2019-03-31

Checks: 2 0

Knit directory: fiveMinuteStats/analysis/

This reproducible R Markdown analysis was created with workflowr (version 1.2.0). The Report tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Repository version: 0cd28bd

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    analysis/.Rhistory
    Ignored:    analysis/bernoulli_poisson_process_cache/

Untracked files:
    Untracked:  _workflowr.yml
    Untracked:  analysis/CI.Rmd
    Untracked:  analysis/gibbs_structure.Rmd
    Untracked:  analysis/libs/
    Untracked:  analysis/results.Rmd
    Untracked:  analysis/shiny/tester/
    Untracked:  docs/MH_intro_files/
    Untracked:  docs/citations.bib
    Untracked:  docs/figure/MH_intro.Rmd/
    Untracked:  docs/hmm_files/
    Untracked:  docs/libs/
    Untracked:  docs/shiny/tester/

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view them.

File	Version	Author	Date	Message
html	34bcc51	John Blischak	2017-03-06	Build site.
Rmd	5fbc8b5	John Blischak	2017-03-06	Update workflowr project with wflow_update (version 0.4.0).
html	fb0f6e3	stephens999	2017-03-03	Merge pull request #33 from mdavy86/f/review
html	d272376	stephens999	2017-02-20	Build site.
Rmd	fdabcb7	stephens999	2017-02-20	Files commited by wflow_commit.
Rmd	02d2d36	stephens999	2017-02-20	add shiny binomial example
html	02d2d36	stephens999	2017-02-20	add shiny binomial example

Pre-requisites

You should be familiar with Bayesian inference for a continuous parameter.

Summary

Suppose we want to do inference for multiple parameters, and suppose that the data that are informative for each parameter are independent. Then provided the prior distributions on these parameters are independent, the posterior distributions are also independent. This is useful as it essentially means we can do Bayesian inference for all the parameters by doing the inference for each parameter separately.

Overview

Suppose we have data \(D_1\) that depend on parameter \(\theta_1\), and independent data \(D_2\) that depend on a second parameter \(\theta_2\). That is, suppose that the joint distribution of the data \((D_1,D_2)\) factorizes as \[p(D_1,D_2 | \theta_1, \theta_2) = p(D_1 | \theta_1)p(D_2 | \theta_2).\]

Now assume that our prior distribution on \((\theta_1,\theta_2)\) has the property that \(\theta_1, \theta_2\) are independent. (This is sometimes said “\(\theta_1\) and \(\theta_2\) are a priori independent”.) Intuitively this independence assumption means that telling you \(\theta_1\) would not tell you anything about \(\theta_2\). Mathematically, the independence assumption means that the prior distribution for \(\theta_1,\theta_2\) factorizes as \[p(\theta_1,\theta_2) = p(\theta_1)p(\theta_2).\]

Applying Bayes theorem we have

\[\begin{align} p(\theta_1, \theta_2 | D_1,D_2) & \propto p(D_1, D_2 | \theta_1, \theta_2) p(\theta_1, \theta_2) \\ & \propto p(D_1 | \theta_1) p(D_2 | \theta_2) p(\theta_1) p(\theta_2) \\ & = p(D_1 | \theta_1)p(\theta_1) \, p(D_2 | \theta_2) p(\theta_2) \\ & \propto p(\theta_1 | D_1) \, p(\theta_2 | D_2) \end{align}\]

That is, the posterior distribution on \(\theta_1,\theta_2\) factorizes into independent parts \(p(\theta_1 | D_1)\) and \(p(\theta_2 | D_2)\). We say “\(\theta_1\) and \(\theta_2\) are a posteriori independent”.

Generalization

This result extends naturally from 2 parameters to \(J\) parameters. That is, if we have independent data sets \(D_1,\dots,D_J\) that depend on parameters \(\theta_1,\dots,\theta_J\), with \[p(D_1,\dots, D_J | \theta_1,\dots,\theta_J) = \prod_{j=1}^J p(D_j | \theta_j)\] and we assume independent priors \[p(\theta_1,\dots,\theta_J) = \prod_{j=1}^J p(\theta_j)\] then the posteriors also factorize \[p(\theta_1,\dots, \theta_J | D_1,\dots, D_J) = \prod_{j=1}^J p(\theta_j | D_j).\]

Example

Suppose we collect genetic data on \(n\) elephants at \(J\) locations along the genome (“loci”). Suppose that at each location there are two genetic types (“alleles”) that we label “0” and “1”. Our goal is to estimate the frequency of the “1” allele, \(q_j\), at each locus \(j=1,\dots,J\).

Let \(n_{ja}\) denote the number of alleles of type \(a\) observed at locus \(j\) (\(a \in \{0,1\}\), \(j \in \{1,2,\dots,J\}\)). Let \(n_j\) denote the data at locus \(j\) (so \(n_j = (n_{j0},n_{j1})\)) and \(n\) denote the data at all \(J\) loci.

Also let \(q\) denote the vector \((q_1,\dots,q_J)\).

Thus, \(n\) denotes the data and \(q\) denotes the unknown parameters. To do Bayesian inference for \(q\) we want to compute the posterior distribution \(p(q | n)\).

To apply the above results we must assume that

data at different loci are independent, so \[p(n | q) = \prod p(n_j | q_j),\] and
the \(q_j\) are a priori independent. This would imply, for example, that telling you \(q_1\) (the frequency of the 1 allele at locus 1) would not tell
you anything about \(q_2\) (the frequency of the 1 allele at locus 2).

In practice these are reasonable assumptions provided that the loci are well separated along the genome and the samples are taken from a well-mixing (“random-mating”) population of elephants without substructure.

Under these assumptions we have that the \(q_j\) are a posteriori independent, with \[p(q | n ) = \prod_j p(q_j | n_j).\]

Furthermore, we know from conjugacy that if the prior distribution on \(q_j\) is a Beta distribution, say \(q_j \sim \text{Beta}(a_j,b_j)\), then the posterior \(p(q_j | n_j)\) is also a Beta distribution, with \(q_j | n_j \sim \text{Beta}(a_j + n_{j1}, b_j + n_{j0})\).

This site was created with R Markdown

Bayesian inference for multiple parameters under independence

Matthew Stephens

2017-02-19

Pre-requisites

Summary

Overview

Generalization

Example