Friday, April 12, 2013

R: Table 1 - Done in One?

Every paper going to a biomedical journal has the formulaic Table 1: it's something I've done time and time again- and had to revise, time and time again. Since it's ubiquitous and highly formulaic, it seems like an ideal candidate for automating with a one-line function, but I still see these tables being constructed piecemeal. Typically you see continuous variables presented as means and standard deviations or quartiles, and categorical variables are presented as frequencies - we don't have too many options to choose from here, so why shouldn't we be able to easily construct our table using some sort of formula?

We can use operators to denote each of the types of summaries we'd like to do:
  • + : mean (±SD), for continuous variables that are roughly symmetric.
  • % : frequency, for categorical variables
  • - : IQR/Quartiles, for skewed continuous variables
R formulas may not play nicely with certain symbols, but we don't have to use formulas per se, we can just pass it a string much like a formula, and parse it using regular expressions. Ideally, I'd love to be able to generate Table 1 with a simple call like this:

# Take variables X1-X3 from my.data.frame, and summarize them by the factor "group."
# Present X1 as Mean (±SD), break down X2 by frequency, and display the Quartiles of X3.
table.summary("group ~ +X1 %X2 -X3", my.data.frame)

In essence, we'd be able to construct our summary table using the familiar formula syntax of regression, and Table 1 is done in one line, ready to be beautifully presented using xtable. I haven't had time to fully flesh this out- for all I know, someone's beat me to this, and it's floating about CRAN somewhere, while I'm obliviously re-inventing the wheel.

table.summary <- function (table.fmla, table.labels = NULL, dataset)

  # Replace all whitespace in formula, split at tilde (~)
  table.fmla <- unlist(strsplit(gsub(" ", "", table.fmla), split = "~"))
  group.var <- table.fmla[1]

  # Split formula up by the operators
  table.var <- unlist(strsplit(table.fmla[2],
                               split = "[\\+\\-\\%]", perl = TRUE))[-1] 

  # Find the operators associated with each variable
  fmla.chars <- unlist(strsplit(table.fmla[2], split = ""))
  table.ops <- fmla.chars[fmla.chars %in% c("+", "-", "%")]

  table.sections <- list()
  for (i in 1: length(table.var)){
    table.sections[i] <- switch(table.ops[i],
           "+" = with(dataset, mean.sd(get(table.var[i]), get(group.var))),
           "%" = with(dataset, xtab(get(table.var[i]), get(group.var))),
           "-" = with(dataset, med.iqr(get(table.var[i]), get(group.var))))
  }


# combine elements of list into a data frame, adjust row/column names, and return
...

Basically all this needs is some functions (mean.sd, xtab, and med.iqr) which will get the appropriate statistics, round to proper significant figures, and format them - add parens for the SD, % for percentages, and brackets around the quartiles- a combination of unlist, by, round, format, and paste. We could also add an ability to do paired/unpaired tests, but I think these should be passed as arguments, and not chosen by default to avoid abuse.

It seems like this is something which should exist- I know there are packages out there which can do tables, however I think a simple syntax would really make this highly accessible and highly utilized. I'm sure I'm not the only researcher who yearns for the day when Table 1 is a one-liner.

UPDATE: After a bit of tinkering, I have a working version- there's a couple features I'd like to add, but we can always fine tune down the line. Let's create some data for a motivating example: a randomized trial in Stroke. We'll be summarizing age, gender, neurological status (measured by the Glasgow Coma Scale), and blood pressure:

rm(list = ls(all = T))
set.seed(1)
N <- 150
Age <- rpois(N, 65)
Gender <- cut(rbinom(N, 1, 0.5), breaks = c(-Inf, 0, Inf),
              labels = c("Female", "Male"))
Group <- cut(rbinom(N, 1, 0.5), breaks = c(-Inf, 0, Inf),
             labels = c("Control", "Treatment"))
GCS <- cut(ceiling(runif(N, 2, 15)), breaks = c(-Inf, 8, 12, Inf),
           labels = c("Severe (3-8)", "Moderate (9-12)", "Mild (13-15)"))
SBP <- rpois(N, 160)
DBP <- rpois(N, 100)
stroke <- data.frame(Age, Gender, GCS, SBP, DBP, Group)

# Look at the first few records:
head(stroke)


  Age Gender             GCS SBP DBP     Group
1  59 Female Moderate (9-12) 182 113 Treatment
2  75   Male    Severe (3-8) 161 116 Treatment
3  75   Male Moderate (9-12) 177 116   Control
4  68 Female    Mild (13-15) 167 106 Treatment
5  52   Male    Severe (3-8) 161  83 Treatment
6  68   Male    Severe (3-8) 171 115   Control


Now, let's look at the function to generate the table:

table.summary <- function (table.fmla, dataset, suppress.missing = F) {
  # Replace all whitespace in formula, split at tilde (~)
  table.fmla <- unlist(strsplit(gsub(" ", "", table.fmla), split = "~"))
  group.var <- table.fmla[1]
 
  # Split formula up by the operators
  table.var <- unlist(strsplit(table.fmla[2],
                               split = "[\\+\\-\\%]", perl = TRUE))[-1] 
 
  # Find the operators associated with each variable
  fmla.chars <- unlist(strsplit(table.fmla[2], split = ""))
  table.ops <- fmla.chars[fmla.chars %in% c("+", "-", "%")]

  xtab <- function (variable, group){
    if (suppress.missing == T ) na.flag <- "no" else na.flag <- "ifany"
    freqs <- t(rbind(table(group, variable, useNA = na.flag)))
    props <- round(100*prop.table(freqs, 2),1)
    rownames(freqs)[which(is.na(rownames(freqs)))] <- "Missing"
    freqs <- data.frame(freqs) 
   
    for (i in 1: ncol(freqs)) {
      freqs[,i] <- paste0(freqs[,i], " (", props[,i], "%)")
    }
    colnames(freqs) <-levels(group)
    return(freqs)
  } 
 
  mean.sd <- function (variable, group) {
    means <- paste0(round(matrix(unlist(by(variable, group, mean, na.rm = T))), 1),
                    " (", round(matrix(unlist(by(variable, group, sd, na.rm = T))), 1), ")")
    missing.values <- unlist(by(variable, group, function (x) sum(is.na(x))))
    means <- data.frame(t(cbind(means, missing.values)))
    rownames(means)[1] <- "(±SD)"
    if (sum(is.na(variable)) == 0 | suppress.missing == T) means <- means[-2,]
    colnames(means) <- levels(group)
    return(means)
  }
 

  med.iqr <- function (variable, group) {
    qs <- paste0(round(matrix(unlist(by(variable, group, median, na.rm = T))), 1),
      " [", round(matrix(unlist(by(variable, group, quantile, .25, na.rm = T))), 1),
      ", ", round(matrix(unlist(by(variable, group, quantile, .75, na.rm = T))), 1), "]")
    missing.values <- unlist(by(variable, group, function (x) sum(is.na(x))))
    qs <- data.frame(t(cbind(qs, missing.values)))
    rownames(qs)[1] <- "[IQR]"
    if (sum(is.na(variable)) == 0 | suppress.missing == T) qs <- qs[-2,]
    colnames(qs) <- levels(group)
    return(qs)
  }
   
  table.sum <- NULL
  for (i in 1: length(table.var)) {
    switch(table.ops[i],
           "+" = table.section <-
             with(dataset, mean.sd(get(table.var[i]), get(group.var))),
           "%" = table.section <-
             with(dataset, xtab(get(table.var[i]), get(group.var))),
           "-" = table.section <-
             with(dataset, med.iqr(get(table.var[i]), get(group.var))))

    rownames(table.section) <- paste(gsub("\\s*\\s", " ",
        gsub("*[_|.]", " ", table.var[i])),
        gsub("\\s*\\s", " ",   gsub("*[_|.]", " ", rownames(table.section))))
    table.sum <- rbind(table.sum, table.section)
  }
 
  colnames(table.sum) <- paste0(levels(get(group.var,dataset) ), " (N=",
    table(get(group.var,dataset))[match(levels(get(group.var,dataset)),
    rownames(table(get(group.var,dataset))))], ")")
 
  return(table.sum)
}

For the most part, it's shoe-horning results into a data frame of character data: It's not terribly beautiful to look at, and I'm sure there's a more elegant way about it, but let's see the call and the results:

table1 <- table.summary(table.fmla = "Group ~ +Age %Gender %GCS -SBP -DBP",
              dataset = stroke)
library(xtable)
print(xtable(table1), type = "html")



Control (N=70)Treatment (N=80)
Age (±SD)64.5 (6.8)65.8 (7.9)
Gender Female33 (47.1%)45 (56.2%)
Gender Male37 (52.9%)35 (43.8%)
GCS Severe (3-8)33 (47.1%)36 (45%)
GCS Moderate (9-12)29 (41.4%)21 (26.2%)
GCS Mild (13-15)8 (11.4%)23 (28.7%)
SBP [IQR]160.5 [151.2, 169]158 [147.8, 167]
DBP [IQR]100 [93.2, 107.5]99 [93.8, 107]

Not too shabby for a one-liner! Now, let's look at some missing or incomplete data by tossing out some of our variables:

stroke.incomplete <- stroke
stroke.incomplete$Age[sample(1:N, 4)] <- NA
stroke.incomplete$GCS[sample(1:N, 4)] <- NA
stroke.incomplete$SBP[sample(1:N, 4)] <- NA
stroke.incomplete$DBP[sample(1:N, 4)] <- NA


Let's have a look at the default option: suppress.data = FALSE

# Incomplete Data
table1.m <- table.summary(table.fmla = "Group ~ +Age %Gender %GCS -SBP -DBP",
              dataset = stroke.incomplete)
print(xtable(table1.m), type = "html")



Control (N=70)Treatment (N=80)
Age (±SD)64.7 (6.8)65.9 (7.9)
Age missing values22
Gender Female33 (47.1%)45 (56.2%)
Gender Male37 (52.9%)35 (43.8%)
GCS Severe (3-8)32 (45.7%)35 (43.8%)
GCS Moderate (9-12)28 (40%)20 (25%)
GCS Mild (13-15)8 (11.4%)23 (28.7%)
GCS Missing2 (2.9%)2 (2.5%)
SBP [IQR]161 [151, 169]158 [148, 167]
SBP missing values13
DBP [IQR]100 [93, 108.2]99.5 [94, 107]
DBP missing values22


We can suppress the missingness information by setting suppress.data = TRUE:

# Incomplete Data - Suppress missingness information
table1.sm <- table.summary(table.fmla = "Group ~ +Age %Gender %GCS -SBP -DBP",
              dataset = stroke.incomplete, suppress.missing = T)
print(xtable(table1.sm), type = "html")



Control (N=70)Treatment (N=80)
Age (±SD)64.7 (6.8)65.9 (7.9)
Gender Female33 (47.1%)45 (56.2%)
Gender Male37 (52.9%)35 (43.8%)
GCS Severe (3-8)32 (47.1%)35 (44.9%)
GCS Moderate (9-12)28 (41.2%)20 (25.6%)
GCS Mild (13-15)8 (11.8%)23 (29.5%)
SBP [IQR]161 [151, 169]158 [148, 167]
DBP [IQR]100 [93, 108.2]99.5 [94, 107]


There's a lot of things we could add, like customize the labels, add hypothesis tests, and so fourth, but this is a good start. Maybe next time we can port our results to file formats more conducive to collaboration, editing, and submission.

6 comments:

  1. I have coded something vaguely similar for Stata, because students needed it and it did not exist.

    ReplyDelete
  2. I have a crude-ish version we use at the University of Louisville. This function does a fisher's exact for categorical vars and a Mann Whitney U for continuous. All you need is your dataset (data here) and a data dictionary (dictionary) which has the format you want the variable names to have (eg if your variable name is demog_age, in the dictionary you may have one column called "variable" that has demog_age and another column called "label" that has Age so it shows up more professionally in the table). The only major downfall (besides no Chi square or T-test) is that it only works for dichotomous categorical variables. Hopefully this helps someone -- if you use it please feel free to cite our work! You can DL the script here: http://www.wikiupload.com/ER5R3QERE23Y7T2

    Timothy Wiemken, PhD MPH CIC
    Assistant Professor of Medicine
    Assistant Director of Epidemiology and Biostatistics
    University of Louisville
    Division of Infectious Diseases
    501 E. Broadway Suite 120
    Louisville, KY 40202
    Office: 502.852.4627
    Fax: 502.852.1147
    Email: tlwiem01@louisville.edu
    www.iduofl.us
    www.ctrsc.net

    ReplyDelete
    Replies
    1. Thanks, Timothy- I'll give the code a look. I figure if someone gets a good, polished version out there, it'd be a real boon to researchers. It's surprising how tables are still a chore, even with all of the packages out there.

      Delete
  3. Thanks for the heads up, Fran├žois- I'll keep that bookmarked. Right now I've only done reports in ODS, R Markdown, and knitR, but Stata + TeX is next in line.

    ReplyDelete
  4. I totally agree. Please let me know what you think and if you make any changes. We are constantly tweaking everything to try to make it easier to do these. I probably do at least 4 of these a week -- painful!

    ReplyDelete
  5. Not my work, but have a look at this http://rpubs.com/kaz_yos/tableone-demo-e

    ReplyDelete