Last updated: 2017-03-06
Code version: 80315be
This is a very brief intro to some data from Sebastian Pott (a research associate here), which he kindly agreed to share with us for teaching purposes. The initial goal is to fit a Markov chain to some of these data, and compare it to a model where the data are independent. (I suspect that neither of these models will be a good fit to the data, although I have not actually tried it yet.)
The data related to methylation or non-methylation at positions along the genome. However, these are not standard methylation data - essentially the experimental protocol is such that locations where the DNA is more “open” are more likely to read out as “methylated”. So think of the methylation/non-methlation as a noisy measure of “open”/“closed”.
Sebastian processed the sequence data, aligning reads to the genome, to produce a bed file Test_region_NOM37_methylation_C_in_GpC_hg38_chr9_131946471_134832670.bed
. Columns 1-6 contain the position of the nucleotide in the genome. Column 7 shows the the number of reads with a methylated C in this position. Column 8 contains the total number of reads mapping to the correct strand at this position. So Column 7 should be less than Column 8, and the difference in the two columns is the number of unmethylated reads at that position.
x = read.table("../data/potts/Test_region_NOM37_methylation_C_in_GpC_hg38_chr9_131946471_134832670.bed",stringsAsFactors = FALSE, header=FALSE)
head(x)
V1 V2 V3 V4 V5 V6 V7 V8
1 chr9 131946692 131946693 region1 0 + 0 3
2 chr9 131946703 131946704 region1 0 + 0 3
3 chr9 131946722 131946723 region1 0 + 2 3
4 chr9 131946726 131946727 region1 0 + 0 1
5 chr9 131946749 131946750 region1 0 + 0 1
6 chr9 131946796 131946797 region1 0 + 0 2
#check that, as claimed, V7 is never larger than V8
sum(x$V7>x$V8)
[1] 0
#make tables of coverage and counts
table(x$V7)
0 1 2 3 4 5 6 7 8 9 10 11
62784 11464 1550 345 99 56 30 8 8 4 3 1
12 14 21
1 1 1
table(x$V8)
1 2 3 4 5 6 7 8 9 10 11 12
27285 18698 11944 7585 4652 2736 1633 857 469 249 122 59
13 14 15 17 20 21
36 21 5 2 1 1
The coverage (column 8) is pretty low, and for now - mostly for simplicity - we are going to ignore differences in coverage. So we’re just going to ignore column 8. And, again for simplicity, we are going to binarize column 7 (most observations are 0 or 1 anyway!). So here goes:
m = 1*(x$V7>0)
head(m,100)
[1] 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[36] 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0
[71] 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 1 1 0 1 0 0 0 0 0 0 1
OK, so now m
represents a very noisy binary observation of whether each position is “open” or not. The exercise here is to fit and compare likelihoods for models where i) elements of \(m\) are independent; b) \(m\) follows a Markov chain. (It would also be good form to do some more simple plots or summaries to familiarize yourself more with the data before fitting the models!)
Fit a 2-state (0/1) Markov Chain to the vector \(m\) by maximum likelihood. Here the parameter is the 2 by 2 transition matrix \(P\) between the two states 0 and 1. [To get the Markov chain started, assume that the first observation \(m_1\) is Bernoulli($), where \(\hat{q}\) is the parameter estimated above.]
Again, compute the log-likelihood value at the mle you have found. That is, if \(l_{markov}\) denotes the log-likelihood function and \(\hat{P}\) the mle for \(P\), compute \(l_{markov}(\hat{P}; m)\).
sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] workflowr_0.4.0 rmarkdown_1.3.9004
loaded via a namespace (and not attached):
[1] Rcpp_0.12.9 lattice_0.20-34 digest_0.6.12 rprojroot_1.2
[5] grid_3.3.2 backports_1.0.5 git2r_0.18.0 magrittr_1.5
[9] evaluate_0.10 stringi_1.1.2 geosphere_1.5-5 rstudioapi_0.6
[13] sp_1.2-4 whisker_0.3-2 tools_3.3.2 stringr_1.2.0
[17] yaml_2.1.14 htmltools_0.3.5 knitr_1.15.1
This R Markdown site was created with workflowr