selective_inference

Last updated: 2019-04-19

Checks: 6 0

Knit directory: misc/analysis/

This reproducible R Markdown analysis was created with workflowr (version 1.2.0). The Report tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(12345)

The command set.seed(12345) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Repository version: d3b402a

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    analysis/.RData
    Ignored:    analysis/.Rhistory
    Ignored:    analysis/ALStruct_cache/
    Ignored:    data/.Rhistory
    Ignored:    docs/figure/.DS_Store

Untracked files:
    Untracked:  .dropbox
    Untracked:  Icon
    Untracked:  _workflowr.yml
    Untracked:  analysis/GTEX-cogaps.Rmd
    Untracked:  analysis/SPCAvRP.rmd
    Untracked:  analysis/cormotif.Rmd
    Untracked:  analysis/eQTL.perm.rand.pdf
    Untracked:  analysis/flash_test_tree.Rmd
    Untracked:  analysis/ieQTL.perm.rand.pdf
    Untracked:  analysis/m6amash.Rmd
    Untracked:  analysis/mash_bhat_z.Rmd
    Untracked:  analysis/mash_ieqtl_permutations.Rmd
    Untracked:  analysis/mixsqp.Rmd
    Untracked:  analysis/normalize.Rmd
    Untracked:  analysis/pseudodata.Rmd
    Untracked:  analysis/sc_bimodal.Rmd
    Untracked:  analysis/susie_z_investigate.Rmd
    Untracked:  analysis/svd-timing.Rmd
    Untracked:  analysis/test_sparse.Rmd
    Untracked:  analysis/z.txt
    Untracked:  code/multivariate_testfuncs.R
    Untracked:  data/4matthew/
    Untracked:  data/4matthew2/
    Untracked:  data/E-MTAB-2805.processed.1/
    Untracked:  data/ENSG00000156738.Sim_Y2.RDS
    Untracked:  data/GDS5363_full.soft.gz
    Untracked:  data/GSE41265_allGenesTPM.txt
    Untracked:  data/Muscle_Skeletal.ACTN3.pm1Mb.RDS
    Untracked:  data/Thyroid.FMO2.pm1Mb.RDS
    Untracked:  data/bmass.HaemgenRBC2016.MAF01.Vs2.MergedDataSources.200kRanSubset.ChrBPMAFMarkerZScores.vs1.txt.gz
    Untracked:  data/bmass.HaemgenRBC2016.Vs2.NewSNPs.ZScores.hclust.vs1.txt
    Untracked:  data/bmass.HaemgenRBC2016.Vs2.PreviousSNPs.ZScores.hclust.vs1.txt
    Untracked:  data/finemap_data/fmo2.sim/b.txt
    Untracked:  data/finemap_data/fmo2.sim/dap_out.txt
    Untracked:  data/finemap_data/fmo2.sim/dap_out2.txt
    Untracked:  data/finemap_data/fmo2.sim/dap_out2_snp.txt
    Untracked:  data/finemap_data/fmo2.sim/dap_out_snp.txt
    Untracked:  data/finemap_data/fmo2.sim/data
    Untracked:  data/finemap_data/fmo2.sim/fmo2.sim.config
    Untracked:  data/finemap_data/fmo2.sim/fmo2.sim.k
    Untracked:  data/finemap_data/fmo2.sim/fmo2.sim.k4.config
    Untracked:  data/finemap_data/fmo2.sim/fmo2.sim.k4.snp
    Untracked:  data/finemap_data/fmo2.sim/fmo2.sim.ld
    Untracked:  data/finemap_data/fmo2.sim/fmo2.sim.snp
    Untracked:  data/finemap_data/fmo2.sim/fmo2.sim.z
    Untracked:  data/finemap_data/fmo2.sim/pos.txt
    Untracked:  data/logm.csv
    Untracked:  data/m.cd.RDS
    Untracked:  data/m.cdu.old.RDS
    Untracked:  data/m.new.cd.RDS
    Untracked:  data/m.old.cd.RDS
    Untracked:  data/mainbib.bib.old
    Untracked:  data/mat.csv
    Untracked:  data/mat.txt
    Untracked:  data/mat_new.csv
    Untracked:  data/matrix_lik.rds
    Untracked:  data/paintor_data/
    Untracked:  data/temp.txt
    Untracked:  data/y.txt
    Untracked:  data/y_f.txt
    Untracked:  data/zscore_jointLCLs_m6AQTLs_susie_eQTLpruned.rds
    Untracked:  data/zscore_jointLCLs_random.rds
    Untracked:  docs/figure/eigen.Rmd/
    Untracked:  docs/figure/fmo2.sim.Rmd/
    Untracked:  docs/figure/newVB.elbo.Rmd/
    Untracked:  docs/figure/rbc_zscore_mash2.Rmd/
    Untracked:  docs/figure/rbc_zscore_mash2_analysis.Rmd/
    Untracked:  docs/figure/rbc_zscores.Rmd/
    Untracked:  docs/trend_files/
    Untracked:  docs/z.txt
    Untracked:  explore_udi.R
    Untracked:  output/fit.varbvs.RDS
    Untracked:  output/glmnet.fit.RDS
    Untracked:  output/test.bv.txt
    Untracked:  output/test.gamma.txt
    Untracked:  output/test.hyp.txt
    Untracked:  output/test.log.txt
    Untracked:  output/test.param.txt
    Untracked:  output/test2.bv.txt
    Untracked:  output/test2.gamma.txt
    Untracked:  output/test2.hyp.txt
    Untracked:  output/test2.log.txt
    Untracked:  output/test2.param.txt
    Untracked:  output/test3.bv.txt
    Untracked:  output/test3.gamma.txt
    Untracked:  output/test3.hyp.txt
    Untracked:  output/test3.log.txt
    Untracked:  output/test3.param.txt
    Untracked:  output/test4.bv.txt
    Untracked:  output/test4.gamma.txt
    Untracked:  output/test4.hyp.txt
    Untracked:  output/test4.log.txt
    Untracked:  output/test4.param.txt
    Untracked:  output/test5.bv.txt
    Untracked:  output/test5.gamma.txt
    Untracked:  output/test5.hyp.txt
    Untracked:  output/test5.log.txt
    Untracked:  output/test5.param.txt

Unstaged changes:
    Modified:   analysis/_site.yml
    Deleted:    analysis/chunks.R
    Modified:   analysis/eigen.Rmd
    Modified:   analysis/fmo2.sim.Rmd
    Modified:   analysis/newVB.Rmd
    Modified:   analysis/wSVD.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view them.

File	Version	Author	Date	Message
Rmd	d3b402a	Matthew Stephens	2019-04-19	workflowr::wflow_publish(“selective_inference.Rmd”)
html	900273f	Matthew Stephens	2019-04-19	Build site.
Rmd	eb867f8	Matthew Stephens	2019-04-19	wflow_publish(“selective_inference.Rmd”)

Introduction

The aim here is to illustrate selective inference with correlated predictors.

library("selectiveInference")

Loading required package: glmnet

Loading required package: Matrix

Loading required package: foreach

Loaded glmnet 2.0-16

Loading required package: intervals


Attaching package: 'intervals'

The following object is masked from 'package:Matrix':

    expand

Loading required package: survival

This function simulates matrix x with highly columns, and then \(y=x\beta+e\) with \(\beta\) having one non-zero element (\(b\)).

simdata = function(n,p,b,signal_index,cor_sd,sigma=1){
  z = rnorm(n)
  x = matrix(rep(z,p),nrow=n) + matrix(rnorm(n*p,sd=cor_sd),nrow=n)

  beta = rep(0,p)
  beta[signal_index] = b
  y = x %*% beta + sigma*rnorm(n)
  return(list(y=y,x=x,beta=beta))
}

Very high correlation

Here the correlation is very high (\(>0.999\) on average). The true non-zero element is \(beta[5]=3\). Here the method selects the wrong variable. It was unexpected to me that the method considered the first (wrong) variable entered highly significant. I thought the presence of other highly correlated variables would mean that the estimate of that coefficient be highly uncertain. I guess that maybe the method is estimating the coefficient of the selected variable in a univariate regression, rather than the multiple regression coefficient?

set.seed(33)
dat= simdata(50,10,3,5,0.01,1)
mean(cor(dat$x))

[1] 0.9998704

# run forward stepwise, plot results
fsfit = fs(dat$x,dat$y)
plot(fsfit)

Version	Author	Date
900273f	Matthew Stephens	2019-04-19

# compute sequential p-values and confidence intervals
# (sigma estimated from full model)
out = fsInf(fsfit)
print(out)


Call:
fsInf(obj = fsfit)

Standard deviation of noise (specified or estimated) sigma = 0.962

Sequential testing results with alpha = 0.100
 Step Var   Coef Z-score P-value LowConfPt UpConfPt LowTailArea UpTailArea
    1  10  3.113  19.195   0.000     2.844    3.381       0.049      0.049
    2   6 -8.301  -0.826   0.909    -9.130  418.832       0.050      0.050
    3   1  7.490   0.704   0.963      -Inf  -72.373       0.000      0.045
    4   8 -8.556  -0.719   0.521      -Inf      Inf       0.000      0.000
    5   2 -5.977  -0.542   0.105      -Inf  615.374       0.000      0.050
    6   4 -5.093  -0.432   0.072      -Inf  103.535       0.000      0.050
    7   7  5.150   0.360   0.816      -Inf  387.692       0.000      0.050
    8   3 -3.132  -0.227   0.788  -589.574      Inf       0.050      0.000
    9   9  2.065   0.136   0.837      -Inf  263.414       0.000      0.050
   10   5  1.719   0.112   0.942      -Inf  199.868       0.000      0.050

Estimated stopping point from ForwardStop rule = 1

High correlation

Here I try with a less extreme, but still high correlation (\(0.986\) average). Here it picks the right variable first.

set.seed(33)
dat= simdata(50,10,3,5,0.1,1)
mean(cor(dat$x))

[1] 0.9869493

# run forward stepwise, plot results
fsfit = fs(dat$x,dat$y)
plot(fsfit)

Version	Author	Date
900273f	Matthew Stephens	2019-04-19

# compute sequential p-values and confidence intervals
# (sigma estimated from full model)
out = fsInf(fsfit)
print(out)


Call:
fsInf(obj = fsfit)

Standard deviation of noise (specified or estimated) sigma = 0.962

Sequential testing results with alpha = 0.100
 Step Var   Coef Z-score P-value LowConfPt UpConfPt LowTailArea UpTailArea
    1   5  3.099  19.495   0.000     2.837    3.362       0.049      0.049
    2  10  0.862   0.902   0.894      -Inf    5.609       0.000      0.050
    3   6 -0.793  -0.607   0.737   -15.527   54.765       0.050      0.050
    4   1  0.948   0.804   0.819  -101.155   14.770       0.050      0.050
    5   8 -0.780  -0.631   0.739   -54.642      Inf       0.050      0.000
    6   2 -0.603  -0.510   0.280      -Inf  114.566       0.000      0.050
    7   4 -0.533  -0.441   0.226      -Inf   34.909       0.000      0.050
    8   7  0.514   0.358   0.354   -27.841   54.030       0.050      0.050
    9   3 -0.287  -0.207   0.208      -Inf   32.315       0.000      0.050
   10   9  0.212   0.139   0.290   -50.990  143.626       0.050      0.050

Estimated stopping point from ForwardStop rule = 1

High correlation, smaller effect

Try reducing effect size to \(b=1\).

set.seed(33)
dat= simdata(50,10,1,5,0.1,1)
mean(cor(dat$x))

[1] 0.9869493

# run forward stepwise, plot results
fsfit = fs(dat$x,dat$y)
plot(fsfit)

Version	Author	Date
900273f	Matthew Stephens	2019-04-19

# compute sequential p-values and confidence intervals
# (sigma estimated from full model)
out = fsInf(fsfit)
print(out)


Call:
fsInf(obj = fsfit)

Standard deviation of noise (specified or estimated) sigma = 0.962

Sequential testing results with alpha = 0.100
 Step Var   Coef Z-score P-value LowConfPt UpConfPt LowTailArea UpTailArea
    1  10  1.135   6.967   0.000     0.813    1.404       0.049      0.049
    2   1  0.418   0.532   0.550   -16.629   12.778       0.050      0.050
    3   6 -1.210  -0.879   0.766   -10.567   48.569       0.050      0.050
    4   3 -0.681  -0.571   0.338      -Inf   64.672       0.000      0.050
    5   2 -0.375  -0.341   0.685   -80.765      Inf       0.050      0.000
    6   5  0.680   0.467   0.398   -60.147   94.450       0.050      0.050
    7   4 -0.494  -0.397   0.367      -Inf  118.456       0.000      0.050
    8   8 -0.475  -0.348   0.136      -Inf   26.384       0.000      0.050
    9   7  0.516   0.360   0.577      -Inf  121.478       0.000      0.050
   10   9  0.212   0.139   0.923      -Inf   11.036       0.000      0.050

Estimated stopping point from ForwardStop rule = 1

High correlation, smaller effect

Try lower correlation (about 0.95), and repeat 10 times. Each time we store the first selected variable and the corresponding \(p\) value.

We also run susie. Note that it outputs large CSs (because the variables are correlated) and 96 out of 100 include the true effect variable.

set.seed(32)
n_iter = 100
out_select = rep(0,n_iter)
out_pv = rep(0,n_iter)
out_susie_cs_size =rep(0,n_iter) # size of first CS
out_susie_cs_inc = rep(0,n_iter) # whether CS includes true effect variable
out_susie_cs_num = rep(0,n_iter) # number of CSs reported (just to check not more than 1)
for(i in 1:n_iter){
  dat= simdata(50,10,1,5,0.2,1)
  mean(cor(dat$x))
# run forward stepwise, plot results
  fsfit = fs(dat$x,dat$y)

# compute sequential p-values and confidence intervals
# (sigma estimated from full model)
  out = fsInf(fsfit)
  out_select[i] = out$vars[1]
  out_pv[i] = out$pv[1]
  
  s.fit = susieR::susie(dat$x,dat$y,estimate_prior_variance = TRUE)
  
  s.cs = susieR::susie_get_cs(s.fit)$cs
  out_susie_cs_num[i] = length(s.cs)
  s.cs = s.cs[[1]]
  out_susie_cs_size[i] = length(s.cs)
  out_susie_cs_inc[i] = is.element(5,s.cs)
  
}
hist(out_select)

Version	Author	Date
900273f	Matthew Stephens	2019-04-19

summary(out_pv)

     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
0.0000000 0.0000000 0.0000000 0.0282402 0.0008448 0.9067797

mean(out_select[out_pv<0.05]==5)

[1] 0.5164835

hist(out_susie_cs_size,breaks=seq(0.5,10.5,length=11))

mean(out_susie_cs_inc)

[1] 0.96

mean(out_susie_cs_num)

[1] 1

sessionInfo()

R version 3.5.2 (2018-12-20)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.1

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] selectiveInference_1.2.4 survival_2.43-3         
[3] intervals_0.15.1         glmnet_2.0-16           
[5] foreach_1.4.4            Matrix_1.2-15           

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.0         knitr_1.21         whisker_0.3-2     
 [4] magrittr_1.5       workflowr_1.2.0    splines_3.5.2     
 [7] lattice_0.20-38    susieR_0.7.1.0487  stringr_1.3.1     
[10] tools_3.5.2        grid_3.5.2         xfun_0.4          
[13] git2r_0.24.0       matrixStats_0.54.0 htmltools_0.3.6   
[16] iterators_1.0.10   yaml_2.2.0         rprojroot_1.3-2   
[19] digest_0.6.18      fs_1.2.6           codetools_0.2-15  
[22] glue_1.3.0         evaluate_0.12      rmarkdown_1.11    
[25] stringi_1.2.4      compiler_3.5.2     backports_1.1.3

selective_inference

Matthew Stephens

2019-04-19

Introduction

Very high correlation

High correlation

High correlation, smaller effect

High correlation, smaller effect