Friday, April 12, 2013

R: Table 1 - Done in One?

Every paper going to a biomedical journal has the formulaic Table 1: it's something I've done time and time again- and had to revise, time and time again. Since it's ubiquitous and highly formulaic, it seems like an ideal candidate for automating with a one-line function, but I still see these tables being constructed piecemeal. Typically you see continuous variables presented as means and standard deviations or quartiles, and categorical variables are presented as frequencies - we don't have too many options to choose from here, so why shouldn't we be able to easily construct our table using some sort of formula?

We can use operators to denote each of the types of summaries we'd like to do:
  • + : mean (±SD), for continuous variables that are roughly symmetric.
  • % : frequency, for categorical variables
  • - : IQR/Quartiles, for skewed continuous variables
R formulas may not play nicely with certain symbols, but we don't have to use formulas per se, we can just pass it a string much like a formula, and parse it using regular expressions. Ideally, I'd love to be able to generate Table 1 with a simple call like this:

# Take variables X1-X3 from my.data.frame, and summarize them by the factor "group."
# Present X1 as Mean (±SD), break down X2 by frequency, and display the Quartiles of X3.
table.summary("group ~ +X1 %X2 -X3", my.data.frame)

In essence, we'd be able to construct our summary table using the familiar formula syntax of regression, and Table 1 is done in one line, ready to be beautifully presented using xtable. I haven't had time to fully flesh this out- for all I know, someone's beat me to this, and it's floating about CRAN somewhere, while I'm obliviously re-inventing the wheel.

table.summary <- function (table.fmla, table.labels = NULL, dataset)

  # Replace all whitespace in formula, split at tilde (~)
  table.fmla <- unlist(strsplit(gsub(" ", "", table.fmla), split = "~"))
  group.var <- table.fmla[1]

  # Split formula up by the operators
  table.var <- unlist(strsplit(table.fmla[2],
                               split = "[\\+\\-\\%]", perl = TRUE))[-1] 

  # Find the operators associated with each variable
  fmla.chars <- unlist(strsplit(table.fmla[2], split = ""))
  table.ops <- fmla.chars[fmla.chars %in% c("+", "-", "%")]

  table.sections <- list()
  for (i in 1: length(table.var)){
    table.sections[i] <- switch(table.ops[i],
           "+" = with(dataset, mean.sd(get(table.var[i]), get(group.var))),
           "%" = with(dataset, xtab(get(table.var[i]), get(group.var))),
           "-" = with(dataset, med.iqr(get(table.var[i]), get(group.var))))
  }


# combine elements of list into a data frame, adjust row/column names, and return
...

Basically all this needs is some functions (mean.sd, xtab, and med.iqr) which will get the appropriate statistics, round to proper significant figures, and format them - add parens for the SD, % for percentages, and brackets around the quartiles- a combination of unlist, by, round, format, and paste. We could also add an ability to do paired/unpaired tests, but I think these should be passed as arguments, and not chosen by default to avoid abuse.

It seems like this is something which should exist- I know there are packages out there which can do tables, however I think a simple syntax would really make this highly accessible and highly utilized. I'm sure I'm not the only researcher who yearns for the day when Table 1 is a one-liner.

UPDATE: After a bit of tinkering, I have a working version- there's a couple features I'd like to add, but we can always fine tune down the line. Let's create some data for a motivating example: a randomized trial in Stroke. We'll be summarizing age, gender, neurological status (measured by the Glasgow Coma Scale), and blood pressure:

rm(list = ls(all = T))
set.seed(1)
N <- 150
Age <- rpois(N, 65)
Gender <- cut(rbinom(N, 1, 0.5), breaks = c(-Inf, 0, Inf),
              labels = c("Female", "Male"))
Group <- cut(rbinom(N, 1, 0.5), breaks = c(-Inf, 0, Inf),
             labels = c("Control", "Treatment"))
GCS <- cut(ceiling(runif(N, 2, 15)), breaks = c(-Inf, 8, 12, Inf),
           labels = c("Severe (3-8)", "Moderate (9-12)", "Mild (13-15)"))
SBP <- rpois(N, 160)
DBP <- rpois(N, 100)
stroke <- data.frame(Age, Gender, GCS, SBP, DBP, Group)

# Look at the first few records:
head(stroke)


  Age Gender             GCS SBP DBP     Group
1  59 Female Moderate (9-12) 182 113 Treatment
2  75   Male    Severe (3-8) 161 116 Treatment
3  75   Male Moderate (9-12) 177 116   Control
4  68 Female    Mild (13-15) 167 106 Treatment
5  52   Male    Severe (3-8) 161  83 Treatment
6  68   Male    Severe (3-8) 171 115   Control


Now, let's look at the function to generate the table:

table.summary <- function (table.fmla, dataset, suppress.missing = F) {
  # Replace all whitespace in formula, split at tilde (~)
  table.fmla <- unlist(strsplit(gsub(" ", "", table.fmla), split = "~"))
  group.var <- table.fmla[1]
 
  # Split formula up by the operators
  table.var <- unlist(strsplit(table.fmla[2],
                               split = "[\\+\\-\\%]", perl = TRUE))[-1] 
 
  # Find the operators associated with each variable
  fmla.chars <- unlist(strsplit(table.fmla[2], split = ""))
  table.ops <- fmla.chars[fmla.chars %in% c("+", "-", "%")]

  xtab <- function (variable, group){
    if (suppress.missing == T ) na.flag <- "no" else na.flag <- "ifany"
    freqs <- t(rbind(table(group, variable, useNA = na.flag)))
    props <- round(100*prop.table(freqs, 2),1)
    rownames(freqs)[which(is.na(rownames(freqs)))] <- "Missing"
    freqs <- data.frame(freqs) 
   
    for (i in 1: ncol(freqs)) {
      freqs[,i] <- paste0(freqs[,i], " (", props[,i], "%)")
    }
    colnames(freqs) <-levels(group)
    return(freqs)
  } 
 
  mean.sd <- function (variable, group) {
    means <- paste0(round(matrix(unlist(by(variable, group, mean, na.rm = T))), 1),
                    " (", round(matrix(unlist(by(variable, group, sd, na.rm = T))), 1), ")")
    missing.values <- unlist(by(variable, group, function (x) sum(is.na(x))))
    means <- data.frame(t(cbind(means, missing.values)))
    rownames(means)[1] <- "(±SD)"
    if (sum(is.na(variable)) == 0 | suppress.missing == T) means <- means[-2,]
    colnames(means) <- levels(group)
    return(means)
  }
 

  med.iqr <- function (variable, group) {
    qs <- paste0(round(matrix(unlist(by(variable, group, median, na.rm = T))), 1),
      " [", round(matrix(unlist(by(variable, group, quantile, .25, na.rm = T))), 1),
      ", ", round(matrix(unlist(by(variable, group, quantile, .75, na.rm = T))), 1), "]")
    missing.values <- unlist(by(variable, group, function (x) sum(is.na(x))))
    qs <- data.frame(t(cbind(qs, missing.values)))
    rownames(qs)[1] <- "[IQR]"
    if (sum(is.na(variable)) == 0 | suppress.missing == T) qs <- qs[-2,]
    colnames(qs) <- levels(group)
    return(qs)
  }
   
  table.sum <- NULL
  for (i in 1: length(table.var)) {
    switch(table.ops[i],
           "+" = table.section <-
             with(dataset, mean.sd(get(table.var[i]), get(group.var))),
           "%" = table.section <-
             with(dataset, xtab(get(table.var[i]), get(group.var))),
           "-" = table.section <-
             with(dataset, med.iqr(get(table.var[i]), get(group.var))))

    rownames(table.section) <- paste(gsub("\\s*\\s", " ",
        gsub("*[_|.]", " ", table.var[i])),
        gsub("\\s*\\s", " ",   gsub("*[_|.]", " ", rownames(table.section))))
    table.sum <- rbind(table.sum, table.section)
  }
 
  colnames(table.sum) <- paste0(levels(get(group.var,dataset) ), " (N=",
    table(get(group.var,dataset))[match(levels(get(group.var,dataset)),
    rownames(table(get(group.var,dataset))))], ")")
 
  return(table.sum)
}

For the most part, it's shoe-horning results into a data frame of character data: It's not terribly beautiful to look at, and I'm sure there's a more elegant way about it, but let's see the call and the results:

table1 <- table.summary(table.fmla = "Group ~ +Age %Gender %GCS -SBP -DBP",
              dataset = stroke)
library(xtable)
print(xtable(table1), type = "html")



Control (N=70)Treatment (N=80)
Age (±SD)64.5 (6.8)65.8 (7.9)
Gender Female33 (47.1%)45 (56.2%)
Gender Male37 (52.9%)35 (43.8%)
GCS Severe (3-8)33 (47.1%)36 (45%)
GCS Moderate (9-12)29 (41.4%)21 (26.2%)
GCS Mild (13-15)8 (11.4%)23 (28.7%)
SBP [IQR]160.5 [151.2, 169]158 [147.8, 167]
DBP [IQR]100 [93.2, 107.5]99 [93.8, 107]

Not too shabby for a one-liner! Now, let's look at some missing or incomplete data by tossing out some of our variables:

stroke.incomplete <- stroke
stroke.incomplete$Age[sample(1:N, 4)] <- NA
stroke.incomplete$GCS[sample(1:N, 4)] <- NA
stroke.incomplete$SBP[sample(1:N, 4)] <- NA
stroke.incomplete$DBP[sample(1:N, 4)] <- NA


Let's have a look at the default option: suppress.data = FALSE

# Incomplete Data
table1.m <- table.summary(table.fmla = "Group ~ +Age %Gender %GCS -SBP -DBP",
              dataset = stroke.incomplete)
print(xtable(table1.m), type = "html")



Control (N=70)Treatment (N=80)
Age (±SD)64.7 (6.8)65.9 (7.9)
Age missing values22
Gender Female33 (47.1%)45 (56.2%)
Gender Male37 (52.9%)35 (43.8%)
GCS Severe (3-8)32 (45.7%)35 (43.8%)
GCS Moderate (9-12)28 (40%)20 (25%)
GCS Mild (13-15)8 (11.4%)23 (28.7%)
GCS Missing2 (2.9%)2 (2.5%)
SBP [IQR]161 [151, 169]158 [148, 167]
SBP missing values13
DBP [IQR]100 [93, 108.2]99.5 [94, 107]
DBP missing values22


We can suppress the missingness information by setting suppress.data = TRUE:

# Incomplete Data - Suppress missingness information
table1.sm <- table.summary(table.fmla = "Group ~ +Age %Gender %GCS -SBP -DBP",
              dataset = stroke.incomplete, suppress.missing = T)
print(xtable(table1.sm), type = "html")



Control (N=70)Treatment (N=80)
Age (±SD)64.7 (6.8)65.9 (7.9)
Gender Female33 (47.1%)45 (56.2%)
Gender Male37 (52.9%)35 (43.8%)
GCS Severe (3-8)32 (47.1%)35 (44.9%)
GCS Moderate (9-12)28 (41.2%)20 (25.6%)
GCS Mild (13-15)8 (11.8%)23 (29.5%)
SBP [IQR]161 [151, 169]158 [148, 167]
DBP [IQR]100 [93, 108.2]99.5 [94, 107]


There's a lot of things we could add, like customize the labels, add hypothesis tests, and so fourth, but this is a good start. Maybe next time we can port our results to file formats more conducive to collaboration, editing, and submission.

Wednesday, March 20, 2013

SAS: Adding Watermarks to ODS Report - LIFETEST, LIFEREG, and PHREG

Building off the Weibull survival simulation, and a quick snippet of code to create a watermark background, we can create a quick report with SAS ODS. If you want to play along at home, just add the following code to the survival simulation: it will seed the random number generator so you should get the exact same data:

set.seed(1)

Now, let's start up a report with SAS 9.2 using ODS: we can load in the data using IMPORT, and start playing around with LIFETEST, LIFEREG, and PHREG. This won't be extensive or exhaustive- just an excuse to show off a new trick. If you're new to survival analysis in SAS, or survival analysis in general, there's plenty of material over at UCLA's Institute for Digital Research and Education- possibly one of the best stat resources out there.

/* This will keep the watermark properly aligned*/
options papersize = (8.5in 11in);

proc import
    datafile = "C:\Data\Didactic\Data\Survival sim 2013-03-20.csv"
    out = surv dbms = csv replace;
run;


/* create our template using the results */
proc template;
    define style watermark;
    parent = styles.printer;
    style header from header / background=_undef_;
    style body from document / background=_undef_
        backgroundimage = "C:\Data\Didactic\Results\2013-03-20\wm - CONFIDENTIAL 03-19-2013 300 dpi.png";
    end;
run;


/* begin the PDF report */
ods pdf file = "C:\Data\Didactic\Results\2013-03-20\Survival Sim - 2013-03-05.pdf"
    style = watermark;
ods noproctitle;
ods escapechar="^";
ods pdf text = "^S={just = center font_weight = bold font_style = italic
    font_size = 18pt}Survival Simulated Data";
 

/* plot survival, log survival, and log-log survival */
ods proclabel = "Survival - X1";
title "Survival Function - Stratified by X1";
proc lifetest data = surv plots = (s, ls, lls) graphics notable;
    time time_to_event*event(0);
    strata x1;
run;

ods proclabel = "Survival - X2";
title "Survival Function - Stratified by X2";
proc lifetest data = surv plots = (s, ls, lls) graphics notable;
    time time_to_event*event(0);
    strata x2;
run;

ods proclabel = "Survival - X3";
title "Survival Function - Stratified by X3";
proc lifetest data = surv plots = (s, ls, lls) graphics notable;
    time time_to_event*event(0);
    strata x3;
run;


ods proclabel = "Weibull GLM";
title "Parametric Regression - Weibull";
proc lifereg data = surv;
    class x1 x2 x3;
    model time_to_event*event(0) = x1 x2 x3 x4 x5 x6
        x7 x8 x9 x10 x11 / dist = weibull;
run;

ods proclabel = "Cox PH Model";
title "Nonparametric Regression - Cox Proportional Hazards";
proc phreg data = surv;
    class x1 x2 x3;
    model time_to_event*event(0) = x1 x2 x3
        x4 x5 x6 x7 x8 x9 x10 x11;
run;
ods pdf close;


The results aren't too shabby, and should give you something like this:


Another day, we can delve deeper into the dataset in SAS, or maybe boot up the Survival package in R.

Monday, March 18, 2013

R: Create Watermark for SAS ODS Reports

Now to see if we can add some watermarks in SAS ODS - I've seen pages on how to create watermarked reports using background images. The first part is creating the background image: we can steal some code from the earlier post on watermarks in R, and adapt it to create PNG backdrops for our reports.

So let's create a function which does this for us: we just tell it what we want the watermark to say, and pass it a couple parameters: the color, size, angle, and font of the text.

# create a date-stamped folder for the results
run.date <- format(Sys.Date(), "%m-%d-%Y")
working.dir <- paste0("c:/Data/Didactic/Results/", run.date)
create.watermark <- function(wm = "", rep.wm = T, n.reps = 5,
    result.dir,    res.wm = 1200, cex.wm = 2, font.wm = 2,
    col.wm = rgb(1, 0, 0, .2), angle.wm = 45,
    page.width = 8.5, page.height = 11, page.unit = "in") {
    if (file.exists(working.dir)) {
        setwd(file.path(working.dir))
    } else {
        dir.create(working.dir, recursive = T, showWarnings = F)
        setwd(working.dir)
    }

    if (rep.wm == T) {
        x.loc <- seq(0, 1, length.out = n.reps)
        y.loc <- seq(1, 0, length.out = n.reps)
    } else {
        x.loc <- .5; y.loc <- .5;
        n.reps <- 1
    }

  
    setwd(result.dir)

    # create a date-stamped .png file
    png(paste0("Watermark Background - ", wm, " ", run.date,
    " ", res.wm, " dpi.png"),
        width = page.width, height = page.height, res = res.wm,
        units = page.unit)

        par(mar = c(0, 0, 0, 0))
        plot.new()
        text(grconvertX(x.loc, from = "npc"),
            grconvertY(y.loc, from = "npc"),
            labels = paste(rep(wm, n.reps), collapse = "   "),
            cex = cex.wm, font = font.wm,
            col = col.wm,
            srt = angle.wm)
    dev.off()
}



The program creates the image, and labels it appropriately with the watermark, run date, and resolution. In part 2, I'll see if I can make a nice looking report with the background image, maybe even using a report on simulated survival data.


Now, let's see if it works. First, let's create a repeated watermark across the page:

create.watermark(wm = "CONFIDENTIAL", result.dir = working.dir, res.wm = 300)


Next, just one single large watermark in the background:

create.watermark(wm = "FOR PEER REVIEW ONLY", rep.wm = F,
col.wm = rgb(0, 0, 1, .2), result.dir = working.dir, cex.wm = 5, res.wm = 300)




Suggestions for further improvements welcome!

R: Reproducible Research - Watermarks in Plots

Research is an iterative process: data is constantly being acquired, cleaned, and finalized while the analysis is being conducted. Since we work in a constantly evolving data and program landscape, version control is extremely important when programming, making research decisions, drafting reports, and writing manuscripts. Thankfully, reproducible research techniques are making this easier, which can mean less version control headaches, and more productivity.  Watermarks are a quick way to improve version control and better communicate the scope of results.

The problem: Images can be abstracted and embedded, obscuring their context or data provenance. This can lead to preliminary, dummy, or simulation results being confused for a final result or privileged/confidential results being accidentally disseminated.

The solution: Image watermarks can’t be easily removed by cropping and embedding by the lay user, don’t obscure results, and provide information about data provenance and information sensitivity. These can be added in your software package of choice, so that no post-processing of images is necessary, and can be easily suppressed in final results by flag variables.

The tools:
  • mtext, text: add text to margins (mtext) or within the plot (text)
  • grconvertX, grconvertY: find coordinates in a device, independent of axes
  • rgb: create translucent text from RGB values
 Let's save our results in a date stamped.png format: first we open a new device, use the generic R plot function, and use text and mtext to add watermarks in the image.

run.date <- format(Sys.Date(), "%m-%d-%Y")
    png(paste0("Watermark 1 - ", run.date, ".png"))
    plot(rnorm(100))
    text(x = grconvertX(0.5, from = "npc"),  # align to center of plot X axis
       y = grconvertY(0.5, from = "npc"), # align to center of plot Y axis
        labels = "CONFIDENTIAL", # our watermark
        cex = 3, font = 2, # large, bold font - hard to miss
        col = rgb(1, 0, 0, .2), # translucent (0.2 = 20%) red color
        srt = 45) # srt = angle of text: 45 degree angle to X axis
    # Add another watermark in lower (side = 1) right (adj = 1) corner
    watermark <- paste("Data embargo until", run.date)
    mtext(watermark, side = 1, line = -1, adj = 1, col = rgb(1, 0, 0, .2), cex = 1.2)
dev.off() # close device, create image


The result:
 

Watermarks in ggplot2 and lattice:

Sunday, March 17, 2013

R: Time To Event Simulation - Weibull Model

This was a quick Saturday afternoon project- I wanted to write the guts of a program to simulate survival data. While there's plenty of survival datasets around to play with, I wanted to make something that could eventually be used to explore models which I don't get a chance to use that often, such as competing risks, discrete-time survival, and joint longitudinal-survival models. This will get us started in the right direction, and utilize some code I already had kicking around. First, let's get started by clearing the workspace, and setting up a directory for the resulting dataset.

rm(list = ls(all = T))
working.dir <- "c:/Data/Didactic/Data"


if (file.exists(working.dir)) {
    setwd(file.path(working.dir))
} else {
    dir.create(working.dir, recursive = T, showWarnings = F)

    setwd(working.dir)
}


Now let's get to some actual fun stuff. First, a quick function that gives us a positive definite correlation matrix, so we can simulate some covariates from the multivariate normal distribution. We use the QR decomposition to get an orthogonal basis, then fills in eigenvalues, and converts the resulting covariance matrix to a correlation matrix.

generate.rho <- function (p = 10) {
    # Generate Orthogonal Columns
    E <- qr.Q(qr(matrix(runif(p), p, p)))

    # Generate Eigenvalues from Gamma Distribution
    evalues <- runif(p)

    # Scale eigenvalues to give a sum of 1
    evalues <- evalues/sum(evalues)
    L <- diag(evalues)

    rho <- cov2cor(E %*% diag(evalues) %*% E)
}


Let's sample some covariates - these will play into our censoring and event processes. Since I am sampling my covariates from a multivariate normal distribution with zero mean and correlation matrix, all the covariates will have marginal distributions that are normal zero mean and unit variance. We can create categorical covariates by using the cut function, and control the proportions in each category using the quantiles of the standard normal distribution.

# Number of Observations
n.obs <- 500

# Time at which all observations will be censored
censoring.time <- 10

# Number of Covariates - Generate correlation matrices for covariates
n.categorical <- 4

n.cens.cov   <- 3;
n.event.cov  <- 3;
n.common.cov <- 2;
n.null.cov   <- 3;
n.cov <- n.cens.cov + n.event.cov + n.common.cov + n.null.cov

rho <- generate.rho(n.cov)

# Sample covariates & coefficients, assign sign at random
library(MASS) # for mvrnorm function
X <- matrix(mvrnorm(n.obs, matrix(0, n.cov), rho), n.obs)

# Create Categorical Covariates - Tertiles based on Z scores
X[,1:n.categorical] <- apply(X[,1:n.categorical], 2, cut, breaks = c(-Inf, -.43, .43, Inf), labels = F)


So now that we have our covariates, let's randomly assign each of our covariates to the censoring process, the event process, both processes, or being unrelated to the processes. Once we have our covariates, we sample some regression parameters, and create our linear predictors.

# Switch up the order, so we don't know which variables are related to the
# censoring and event processes
var.type <- sample(c(rep("cens", n.cens.cov), rep("event", n.event.cov),
    rep("common", n.common.cov), rep("null", n.null.cov)))
X.cens   <- X[, which(var.type == "cens")]
X.event  <- X[, which(var.type == "event")]
X.common <- X[, which(var.type == "common")]
X.null   <- X[, which(var.type == "null")]

B.cens   <- matrix(rbeta(n.cens.cov, 8, 4)*sign(rnorm(n.cens.cov)), n.cens.cov)
B.event  <- matrix(rbeta(n.event.cov, 20, 2)*sign(rnorm(n.event.cov)),
    n.event.cov)
B.common <- matrix(rbeta(n.common.cov, 6, 4)*sign(rnorm(n.common.cov)),
    n.common.cov)

Eta.cens <- cbind(X.cens, X.common) %*% rbind(B.cens, B.common)
Eta.event <- cbind(X.event, X.common) %*% rbind(B.event, B.common)


The last thing we need to do is sample our censoring and event times: in our current example, we only have two competing processes, censoring and one event. This could easily be generalized to several competing events.

Our event times will be sampled from a Weibull distribution. In the Weibull model, the shape parameter is fixed, and the linear predictor enters into the scale parameter. The date we generate can be adapted to both parametric and non-parametric survival models.

Once we have our events and times, we gather all our variables into a data frame, and write our results into a data frame, put it into a .CSV, and we're ready to go.

# Set scale and shape parameters for distribution
event.shape <- 2
cens.shape  <- 2

event.scale <- .9*censoring.time
cens.scale  <- .9*censoring.time


# Parameterize scale with linear predictor
event.time <- rweibull(n.obs, event.shape, scale = event.scale + Eta.event)
cens.time  <- rweibull(n.obs, event.shape, scale = cens.scale + Eta.cens)

time_to_event <- apply(cbind(event.time, cens.time), 1, min)
censored <- matrix(0, n.obs)
censored[which(time_to_event == cens.time)] <- 1
censored[which(event.time >= 10)] <- 1
time_to_event[which(time_to_event > censoring.time)] <- 10
event = 1 - censored

surv.data <- data.frame(censored, event, time_to_event, X)

write.csv(surv.data, paste0("Survival sim ", format(Sys.Date(), "%Y-%m-%d"), ".csv"),

    row.names = F)

This will definitely need some tweaks, but it has some potential. With a little modification, we could introduce other factors like left truncation, discrete time scale, and so on. This was a spur-of-the-moment project, so give it a try, tweak it, and let me know the results.

Tuesday, March 5, 2013

I've got woes from different missingness codes

Now for yet another chapter in... Tales from Statistical Consulting:
A client came in with data from a survey (which shall remain nameless), looking for assistance with evaluating functional form of logistic regression. The main covariate of interest was proportion- it was positively skewed, with a clump of density at one extreme of the data. This clump was still within the potential range of a percentage, but only assumed one value, and it was distinct from the right tail of the data. From prior experience, I immediately began to suspect that this was not actual data, but metadata- a missingness or exception code. While the client had not read the codebook, they didn’t believe that anyone would use a plausible value for a variable as a missingness code.
People do strange things, and this is precisely why you should always read the codebooks. Sure enough, this was actually metadata, but until this consultation, it was being treated as actual data. This is an important cautionary tale, not just about the value of codebooks, but about the value of good data set design: try to make exception and missing data codes as obvious as possible.

Monday, March 4, 2013

Baseline Imbalance in RCTs: To test or not to test?

In the biomedical research world, stakes don't get any higher than randomized trials, where no stone goes unturned. The notion of confounding and baseline imbalance is a frequent source of heartburn, especially for statisticians, as they're often asked about testing for "baseline balance" and "to see if randomization worked."
When we perform hypothesis tests, we are using a sample to infer about a population, for example:
  • Is the mean of one population equal to a given value? (One sample t-test)
  • Are the means of two independent populations equal? (Two sample t-test)
In a randomized trial, our sample is from one population, which is divided at random into two groups. If we're testing the equality of means between two populations, a natural question to ask is what two populations are we talking about? Even if we could come up with a sensible answer for this question, the usual purpose for these hypothesis tests is about the comparability of the allocated groups, not populations- something that's beyond the scope of hypothesis testing. If we ever reject the null hypothesis, it's by definition a Type I Error, since both groups are from the exact same population.

Let's set aside the muddled logic behind the tests, and see if they give us some practical value. Suppose the p-value from the resulting test is 0.06: does this mean that the variable is not a confounder? Remember, if the covariate is not related to both treatment and response, it can not be a confounder, and it can be safely ignored. If it is a known confounder from previous studies, we should be adjusting for it anyway, regardless of the degree of balance. Additionally, if we have a large study, we can have a statistically significant result with absolutely no practical significance. The hypothesis test is not only nonsensical, but its results can't even inform us about confounding.

In summary, a statistically significant difference doesn't mean that a variable is confounder, lack of a statistically significant difference doesn't preclude confounding, and any rejection of the null hypothesis is inherently a false positive, or indicative of deep problems with the allocation mechanism- I can't really see the utility of hypothesis testing in randomized trials, even if somehow there were a coherent justification for it.

If a variable is known to be associated with the outcome of interest, it should be included as a covariate in analyses regardless of balance between groups. This is especially important if the response variable (e.g. blood pressure) is measured at baseline, as past history is usually the best predictor of future events.

Don't take my word for it, read it the people who know far more than I do:



What's Ailing Introductory Statistics?



Introductory courses are the most important in any academic department. They are often a student's first and only exposure to a discipline, and reach the widest audience of any course in the department. A bad first impression can not only cheats people out of a potential career option, but can leave them with a lasting aversion to an entire field. Introductory statistics has the added importance of being the basis of every scientific field – no pressure there.

So why are there so many introductory statistics horror stories? Introductory statistics has a high conceptual overhead, but very little computational demands – you can get very far with basic arithmetic, and even further with a little calculus. Without a firm grasp of the conceptual basis, introductory statistics can easily become a vacant exercise in arithmetical procedures.
How do we keep introductory statistics from becoming a rote arithmetic exercise, devoid of all its utility? In my experience, there are a couple concepts that people don’t seem to be grasping and retaining:
  • What is randomness? Where does it come from? How does it fit into research?
  • What’s the difference between probability and statistics?
  • What does a null hypothesis mean? Why can we only collect evidence against it?
  • What are Type I Error and Power? What’s the intuition behind test statistics and null distributions?
There are a lot of challenges in constructing a good introductory statistics curriculum. Lots of introductory statistics books are garbage, and developing good lecture material takes a lot of time, which is often in very short supply. We have a wide audience to reach, from the next generation of statisticians to those who just care about degree requirements and letter grades – our job is to convert as much of the latter as possible into the former. This is a formidable task, but not an impossible one.

Fortunately, the tools available to teachers are evolving. I think R has the potential to make a huge impact in statistics education at all levels, for a number of reasons:
·       Free – students can use it outside of computer labs and after they graduate.
·       Open source – makes tinkering easier, and that’s the best way to learn a great deal in a limited time.
·       Community – Users groups, bloggers, free open courses, and more, all lending support.
·       R Markdown – weave R into blogs, labs, presentations, and such. When people are copying from a blackboard or slides, they’re not spending as much effort listening.
·       Shiny – This seems like an incredibly powerful tool for creating and disseminating interactive didactic tools which previously were only accomplished using JS.
R isn’t the only game in town, but I do think it’s the best way forward for students.