Dynamic Statistical Comparisons (I)

I think it is time I started to outline what I have in mind in more detail. Remember the goal: to develop a framework for performing “dynamic statistical comparisons” (DSCs)- that is, statistical comparisons that are more reproducible and easily extensible (by adding new data sets or new methods to the comparison).

I’m going to start simple, which I think is a good way to start most projects. So initially I am going to restrict myself to comparisons that can be done fully in R. By doing this I hope to put off a lot of the trickier issues on platform compatibilities etc. that will have to be dealt with later.

Concrete Example - comparing regression methods

To take a concrete example, I plan to take a simple simulation study from the Elastic Net paper by Zou and Hastie (Section 5 in this pdf), and try to turn it into a DSC. This is going to be a medium-term project - not finished in a single blog post. By outlining my thinking here I’m hoping to give people a chance to make suggestions on how to improve, or even to contribute directly to a github repository.

Let’s list the components of a typical (simple) simulation study, like this one:

It has parameters, which indicate how the data are to be simulated.
It has a way (or ways) of simulating input data from those parameters.
It has methods that turn input into output.
It has a way of scoring methods by comparing the output with something – often with the parameters, but perhaps also with some other meta-data – to assess how good the performance is.

I think most of the terms I have highlighted above are somewhat self-explanatory, but maybe meta-data deserves some elaboration. Here I am thinking of pretty much anything that might be needed to score methods. In most cases the meta-data will be generated at the same time as the input data. For the regression DSC I think we probably don’t need meta-data, but I believe including it here may allow additional flexibility in the future.

To make these ideas more concrete, let’s look at how some of them apply in the regression simulation. In this case we are considering regression models of the form $Y=XB + E$ where $E$ are iid $N(0,\sigma^2)$.

The input $ (Y,X) $ is obtained by i) simulating $ X $, and ii) simulating $ Y|X $ from $ Y=XB+E. $ The parameters needed to do this are i) the covariance of $ X $, and ii) $ B $ and $ \sigma $. The methods (in this case, ridge regression, LASSO, EN, and naive EN) are given input $ (Y, X) $ and must output an estimate of $ B $, call it $ \hat{B} $.

The score computed for each simulation is the squared error $ \sum_j (B_j-\hat{B}_j)^2. $ (Methods are compared by the median squared error, MSE, over simulations in their Table 2).

Recap

So in outline a typical comparison consists of the following:

Make parameters
Make input and meta-data from the parameters
Run methods to turn input into output
Score each method by comparing it’s output with the parameters and meta-data
Make a graph or table of the results

The first 4 steps may well be repeated many times - for example, for multiple seeds of the random number generator, and for multiple different simulation scenarios (ways of producing input data). And indeed, some methods may be run multiple times with different parameter settings. For example, we might run LASSO with both 2-fold CV and 10-fold CV. We will refer to this as different flavors of the method.

Putting it together

Based on this, I propose the following basic structure for a simple DSC repository implementing a simulation study. First, we will index simulations by the seed (an integer) and scenario (a string). We will have directories param/, data/, methods/, output/ and results/ to store various steps in the process. (There is some question about exactly how much we want to store - for now I will go over the top and store everything, which will be OK if the simulations are not too large.) We’ll use subdirectories to store different scenarios. So for example, the parameters used in simulation with seed 7 and scenario A will be stored in a file with a name along the lines of param/A/param.7.RData. (In general we may not want to restrict ourselves to the RData format, but I’m going with this for now.)

Now the user will have to supply the following:

A function parammaker for making parameters from a given seed and scenario.
A function datamaker for making data=(input,meta) from parameters (could also depend on seed and scenario)
Methods that turn input into output.
A function score for turning output, data and parameters into scores.
A list of seeds to be used for each simulation scenario.

Given this, we’ll provide a function that generates all the parameter and data files, runs the methods to produce output files, and scores all the methods (perhaps with multiple flavors of some methods), outputting a dataframe of results.

An Example

Initially I thought I would simply post this and wait for feedback, but during the writing phase I found myself on a long flight and decided to start putting together the code for this. If you are interested, I hope you will take a look at my github repository dscr which contains the start of an R package. You should be able to install the dscr package directly from github using devtools::install_github("stephens999/dscr"). You should also be able to clone the repository and take a look at the example in vignette/one_sample_location.rmd, perhaps even run it! As you will see I decided to start with an example even simpler than the regression one. And functionality is basic at best, but it does run (for me at least!) and illustrate the ideas. The output I get from running this rmd file is here.

If you have comments I welcome them, either on the blog here, or preferably by opening an Issue. And if you further develop or improve the code, go ahead and put in a pull request!