Last updated: 2017-03-06

Code version: 80315be

This is a very brief intro to some data from Sebastian Pott (a research associate here), which he kindly agreed to share with us for teaching purposes. The initial goal is to fit a Markov chain to some of these data, and compare it to a model where the data are independent. (I suspect that neither of these models will be a good fit to the data, although I have not actually tried it yet.)

Read Data

The data related to methylation or non-methylation at positions along the genome. However, these are not standard methylation data - essentially the experimental protocol is such that locations where the DNA is more “open” are more likely to read out as “methylated”. So think of the methylation/non-methlation as a noisy measure of “open”/“closed”.

Sebastian processed the sequence data, aligning reads to the genome, to produce a bed file Test_region_NOM37_methylation_C_in_GpC_hg38_chr9_131946471_134832670.bed. Columns 1-6 contain the position of the nucleotide in the genome. Column 7 shows the the number of reads with a methylated C in this position. Column 8 contains the total number of reads mapping to the correct strand at this position. So Column 7 should be less than Column 8, and the difference in the two columns is the number of unmethylated reads at that position.

x = read.table("../data/potts/Test_region_NOM37_methylation_C_in_GpC_hg38_chr9_131946471_134832670.bed",stringsAsFactors = FALSE, header=FALSE)
head(x)
    V1        V2        V3      V4 V5 V6 V7 V8
1 chr9 131946692 131946693 region1  0  +  0  3
2 chr9 131946703 131946704 region1  0  +  0  3
3 chr9 131946722 131946723 region1  0  +  2  3
4 chr9 131946726 131946727 region1  0  +  0  1
5 chr9 131946749 131946750 region1  0  +  0  1
6 chr9 131946796 131946797 region1  0  +  0  2
#check that, as claimed, V7 is never larger than V8
sum(x$V7>x$V8)
[1] 0
#make tables of coverage and counts
table(x$V7)

    0     1     2     3     4     5     6     7     8     9    10    11 
62784 11464  1550   345    99    56    30     8     8     4     3     1 
   12    14    21 
    1     1     1 
table(x$V8)

    1     2     3     4     5     6     7     8     9    10    11    12 
27285 18698 11944  7585  4652  2736  1633   857   469   249   122    59 
   13    14    15    17    20    21 
   36    21     5     2     1     1 

The coverage (column 8) is pretty low, and for now - mostly for simplicity - we are going to ignore differences in coverage. So we’re just going to ignore column 8. And, again for simplicity, we are going to binarize column 7 (most observations are 0 or 1 anyway!). So here goes:

m = 1*(x$V7>0)
head(m,100)
  [1] 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 [36] 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0
 [71] 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 1 1 0 1 0 0 0 0 0 0 1

OK, so now m represents a very noisy binary observation of whether each position is “open” or not. The exercise here is to fit and compare likelihoods for models where i) elements of \(m\) are independent; b) \(m\) follows a Markov chain. (It would also be good form to do some more simple plots or summaries to familiarize yourself more with the data before fitting the models!)

  1. Bernoulli model
    1. Fit a Bernoulli model to the data. That is, assuming the elements \(m_i\) are independent Bernoulli(\(q\)) for some \(q\), estimate \(q\) by maximum likelihood.
    2. If \(\hat{q}\) denotes the maximum likelihood estimate, and \(l_{bern}\) denotes the log-likelihood function, compute the value of \(l_{bern}(\hat{q}; m)\).
  2. Markov Chain
    1. Fit a 2-state (0/1) Markov Chain to the vector \(m\) by maximum likelihood. Here the parameter is the 2 by 2 transition matrix \(P\) between the two states 0 and 1. [To get the Markov chain started, assume that the first observation \(m_1\) is Bernoulli($), where \(\hat{q}\) is the parameter estimated above.]

    2. Again, compute the log-likelihood value at the mle you have found. That is, if \(l_{markov}\) denotes the log-likelihood function and \(\hat{P}\) the mle for \(P\), compute \(l_{markov}(\hat{P}; m)\).

  3. Compare: comment on how different the two log-likelihoods you have computed are.

Session information

sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] workflowr_0.4.0    rmarkdown_1.3.9004

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.9     lattice_0.20-34 digest_0.6.12   rprojroot_1.2  
 [5] grid_3.3.2      backports_1.0.5 git2r_0.18.0    magrittr_1.5   
 [9] evaluate_0.10   stringi_1.1.2   geosphere_1.5-5 rstudioapi_0.6 
[13] sp_1.2-4        whisker_0.3-2   tools_3.3.2     stringr_1.2.0  
[17] yaml_2.1.14     htmltools_0.3.5 knitr_1.15.1   

This R Markdown site was created with workflowr