Friday, April 12, 2013

R: Table 1 - Done in One?

Every paper going to a biomedical journal has the formulaic Table 1: it's something I've done time and time again- and had to revise, time and time again. Since it's ubiquitous and highly formulaic, it seems like an ideal candidate for automating with a one-line function, but I still see these tables being constructed piecemeal. Typically you see continuous variables presented as means and standard deviations or quartiles, and categorical variables are presented as frequencies - we don't have too many options to choose from here, so why shouldn't we be able to easily construct our table using some sort of formula?

We can use operators to denote each of the types of summaries we'd like to do:
  • + : mean (±SD), for continuous variables that are roughly symmetric.
  • % : frequency, for categorical variables
  • - : IQR/Quartiles, for skewed continuous variables
R formulas may not play nicely with certain symbols, but we don't have to use formulas per se, we can just pass it a string much like a formula, and parse it using regular expressions. Ideally, I'd love to be able to generate Table 1 with a simple call like this:

# Take variables X1-X3 from my.data.frame, and summarize them by the factor "group."
# Present X1 as Mean (±SD), break down X2 by frequency, and display the Quartiles of X3.
table.summary("group ~ +X1 %X2 -X3", my.data.frame)

In essence, we'd be able to construct our summary table using the familiar formula syntax of regression, and Table 1 is done in one line, ready to be beautifully presented using xtable. I haven't had time to fully flesh this out- for all I know, someone's beat me to this, and it's floating about CRAN somewhere, while I'm obliviously re-inventing the wheel.

table.summary <- function (table.fmla, table.labels = NULL, dataset)

  # Replace all whitespace in formula, split at tilde (~)
  table.fmla <- unlist(strsplit(gsub(" ", "", table.fmla), split = "~"))
  group.var <- table.fmla[1]

  # Split formula up by the operators
  table.var <- unlist(strsplit(table.fmla[2],
                               split = "[\\+\\-\\%]", perl = TRUE))[-1] 

  # Find the operators associated with each variable
  fmla.chars <- unlist(strsplit(table.fmla[2], split = ""))
  table.ops <- fmla.chars[fmla.chars %in% c("+", "-", "%")]

  table.sections <- list()
  for (i in 1: length(table.var)){
    table.sections[i] <- switch(table.ops[i],
           "+" = with(dataset, mean.sd(get(table.var[i]), get(group.var))),
           "%" = with(dataset, xtab(get(table.var[i]), get(group.var))),
           "-" = with(dataset, med.iqr(get(table.var[i]), get(group.var))))
  }


# combine elements of list into a data frame, adjust row/column names, and return
...

Basically all this needs is some functions (mean.sd, xtab, and med.iqr) which will get the appropriate statistics, round to proper significant figures, and format them - add parens for the SD, % for percentages, and brackets around the quartiles- a combination of unlist, by, round, format, and paste. We could also add an ability to do paired/unpaired tests, but I think these should be passed as arguments, and not chosen by default to avoid abuse.

It seems like this is something which should exist- I know there are packages out there which can do tables, however I think a simple syntax would really make this highly accessible and highly utilized. I'm sure I'm not the only researcher who yearns for the day when Table 1 is a one-liner.

UPDATE: After a bit of tinkering, I have a working version- there's a couple features I'd like to add, but we can always fine tune down the line. Let's create some data for a motivating example: a randomized trial in Stroke. We'll be summarizing age, gender, neurological status (measured by the Glasgow Coma Scale), and blood pressure:

rm(list = ls(all = T))
set.seed(1)
N <- 150
Age <- rpois(N, 65)
Gender <- cut(rbinom(N, 1, 0.5), breaks = c(-Inf, 0, Inf),
              labels = c("Female", "Male"))
Group <- cut(rbinom(N, 1, 0.5), breaks = c(-Inf, 0, Inf),
             labels = c("Control", "Treatment"))
GCS <- cut(ceiling(runif(N, 2, 15)), breaks = c(-Inf, 8, 12, Inf),
           labels = c("Severe (3-8)", "Moderate (9-12)", "Mild (13-15)"))
SBP <- rpois(N, 160)
DBP <- rpois(N, 100)
stroke <- data.frame(Age, Gender, GCS, SBP, DBP, Group)

# Look at the first few records:
head(stroke)


  Age Gender             GCS SBP DBP     Group
1  59 Female Moderate (9-12) 182 113 Treatment
2  75   Male    Severe (3-8) 161 116 Treatment
3  75   Male Moderate (9-12) 177 116   Control
4  68 Female    Mild (13-15) 167 106 Treatment
5  52   Male    Severe (3-8) 161  83 Treatment
6  68   Male    Severe (3-8) 171 115   Control


Now, let's look at the function to generate the table:

table.summary <- function (table.fmla, dataset, suppress.missing = F) {
  # Replace all whitespace in formula, split at tilde (~)
  table.fmla <- unlist(strsplit(gsub(" ", "", table.fmla), split = "~"))
  group.var <- table.fmla[1]
 
  # Split formula up by the operators
  table.var <- unlist(strsplit(table.fmla[2],
                               split = "[\\+\\-\\%]", perl = TRUE))[-1] 
 
  # Find the operators associated with each variable
  fmla.chars <- unlist(strsplit(table.fmla[2], split = ""))
  table.ops <- fmla.chars[fmla.chars %in% c("+", "-", "%")]

  xtab <- function (variable, group){
    if (suppress.missing == T ) na.flag <- "no" else na.flag <- "ifany"
    freqs <- t(rbind(table(group, variable, useNA = na.flag)))
    props <- round(100*prop.table(freqs, 2),1)
    rownames(freqs)[which(is.na(rownames(freqs)))] <- "Missing"
    freqs <- data.frame(freqs) 
   
    for (i in 1: ncol(freqs)) {
      freqs[,i] <- paste0(freqs[,i], " (", props[,i], "%)")
    }
    colnames(freqs) <-levels(group)
    return(freqs)
  } 
 
  mean.sd <- function (variable, group) {
    means <- paste0(round(matrix(unlist(by(variable, group, mean, na.rm = T))), 1),
                    " (", round(matrix(unlist(by(variable, group, sd, na.rm = T))), 1), ")")
    missing.values <- unlist(by(variable, group, function (x) sum(is.na(x))))
    means <- data.frame(t(cbind(means, missing.values)))
    rownames(means)[1] <- "(±SD)"
    if (sum(is.na(variable)) == 0 | suppress.missing == T) means <- means[-2,]
    colnames(means) <- levels(group)
    return(means)
  }
 

  med.iqr <- function (variable, group) {
    qs <- paste0(round(matrix(unlist(by(variable, group, median, na.rm = T))), 1),
      " [", round(matrix(unlist(by(variable, group, quantile, .25, na.rm = T))), 1),
      ", ", round(matrix(unlist(by(variable, group, quantile, .75, na.rm = T))), 1), "]")
    missing.values <- unlist(by(variable, group, function (x) sum(is.na(x))))
    qs <- data.frame(t(cbind(qs, missing.values)))
    rownames(qs)[1] <- "[IQR]"
    if (sum(is.na(variable)) == 0 | suppress.missing == T) qs <- qs[-2,]
    colnames(qs) <- levels(group)
    return(qs)
  }
   
  table.sum <- NULL
  for (i in 1: length(table.var)) {
    switch(table.ops[i],
           "+" = table.section <-
             with(dataset, mean.sd(get(table.var[i]), get(group.var))),
           "%" = table.section <-
             with(dataset, xtab(get(table.var[i]), get(group.var))),
           "-" = table.section <-
             with(dataset, med.iqr(get(table.var[i]), get(group.var))))

    rownames(table.section) <- paste(gsub("\\s*\\s", " ",
        gsub("*[_|.]", " ", table.var[i])),
        gsub("\\s*\\s", " ",   gsub("*[_|.]", " ", rownames(table.section))))
    table.sum <- rbind(table.sum, table.section)
  }
 
  colnames(table.sum) <- paste0(levels(get(group.var,dataset) ), " (N=",
    table(get(group.var,dataset))[match(levels(get(group.var,dataset)),
    rownames(table(get(group.var,dataset))))], ")")
 
  return(table.sum)
}

For the most part, it's shoe-horning results into a data frame of character data: It's not terribly beautiful to look at, and I'm sure there's a more elegant way about it, but let's see the call and the results:

table1 <- table.summary(table.fmla = "Group ~ +Age %Gender %GCS -SBP -DBP",
              dataset = stroke)
library(xtable)
print(xtable(table1), type = "html")



Control (N=70)Treatment (N=80)
Age (±SD)64.5 (6.8)65.8 (7.9)
Gender Female33 (47.1%)45 (56.2%)
Gender Male37 (52.9%)35 (43.8%)
GCS Severe (3-8)33 (47.1%)36 (45%)
GCS Moderate (9-12)29 (41.4%)21 (26.2%)
GCS Mild (13-15)8 (11.4%)23 (28.7%)
SBP [IQR]160.5 [151.2, 169]158 [147.8, 167]
DBP [IQR]100 [93.2, 107.5]99 [93.8, 107]

Not too shabby for a one-liner! Now, let's look at some missing or incomplete data by tossing out some of our variables:

stroke.incomplete <- stroke
stroke.incomplete$Age[sample(1:N, 4)] <- NA
stroke.incomplete$GCS[sample(1:N, 4)] <- NA
stroke.incomplete$SBP[sample(1:N, 4)] <- NA
stroke.incomplete$DBP[sample(1:N, 4)] <- NA


Let's have a look at the default option: suppress.data = FALSE

# Incomplete Data
table1.m <- table.summary(table.fmla = "Group ~ +Age %Gender %GCS -SBP -DBP",
              dataset = stroke.incomplete)
print(xtable(table1.m), type = "html")



Control (N=70)Treatment (N=80)
Age (±SD)64.7 (6.8)65.9 (7.9)
Age missing values22
Gender Female33 (47.1%)45 (56.2%)
Gender Male37 (52.9%)35 (43.8%)
GCS Severe (3-8)32 (45.7%)35 (43.8%)
GCS Moderate (9-12)28 (40%)20 (25%)
GCS Mild (13-15)8 (11.4%)23 (28.7%)
GCS Missing2 (2.9%)2 (2.5%)
SBP [IQR]161 [151, 169]158 [148, 167]
SBP missing values13
DBP [IQR]100 [93, 108.2]99.5 [94, 107]
DBP missing values22


We can suppress the missingness information by setting suppress.data = TRUE:

# Incomplete Data - Suppress missingness information
table1.sm <- table.summary(table.fmla = "Group ~ +Age %Gender %GCS -SBP -DBP",
              dataset = stroke.incomplete, suppress.missing = T)
print(xtable(table1.sm), type = "html")



Control (N=70)Treatment (N=80)
Age (±SD)64.7 (6.8)65.9 (7.9)
Gender Female33 (47.1%)45 (56.2%)
Gender Male37 (52.9%)35 (43.8%)
GCS Severe (3-8)32 (47.1%)35 (44.9%)
GCS Moderate (9-12)28 (41.2%)20 (25.6%)
GCS Mild (13-15)8 (11.8%)23 (29.5%)
SBP [IQR]161 [151, 169]158 [148, 167]
DBP [IQR]100 [93, 108.2]99.5 [94, 107]


There's a lot of things we could add, like customize the labels, add hypothesis tests, and so fourth, but this is a good start. Maybe next time we can port our results to file formats more conducive to collaboration, editing, and submission.