Asymptotically Unbiased: R: Table 1

Every paper going to a biomedical journal has the formulaic Table 1: it's something I've done time and time again- and had to revise, time and time again. Since it's ubiquitous and highly formulaic, it seems like an ideal candidate for automating with a one-line function, but I still see these tables being constructed piecemeal. Typically you see continuous variables presented as means and standard deviations or quartiles, and categorical variables are presented as frequencies - we don't have too many options to choose from here, so why shouldn't we be able to easily construct our table using some sort of formula?

We can use operators to denote each of the types of summaries we'd like to do:

+ : mean (±SD), for continuous variables that are roughly symmetric.
% : frequency, for categorical variables
- : IQR/Quartiles, for skewed continuous variables

R formulas may not play nicely with certain symbols, but we don't have to use formulas per se, we can just pass it a string much like a formula, and parse it using regular expressions. Ideally, I'd love to be able to generate Table 1 with a simple call like this:

# Take variables X1-X3 from my.data.frame, and summarize them by the factor "group."
# Present X1 as Mean (±SD), break down X2 by frequency, and display the Quartiles of X3.
table.summary("group ~ +X1 %X2 -X3", my.data.frame)

In essence, we'd be able to construct our summary table using the familiar formula syntax of regression, and Table 1 is done in one line, ready to be beautifully presented using xtable. I haven't had time to fully flesh this out- for all I know, someone's beat me to this, and it's floating about CRAN somewhere, while I'm obliviously re-inventing the wheel.

table.summary <- function (table.fmla, table.labels = NULL, dataset)

# Replace all whitespace in formula, split at tilde (~)
table.fmla <- unlist(strsplit(gsub(" ", "", table.fmla), split = "~"))
group.var <- table.fmla[1]

# Split formula up by the operators
table.var <- unlist(strsplit(table.fmla[2],
                               split = "[\\+\\-\\%]", perl = TRUE))[-1]

# Find the operators associated with each variable
fmla.chars <- unlist(strsplit(table.fmla[2], split = ""))
table.ops <- fmla.chars[fmla.chars %in% c("+", "-", "%")]

table.sections <- list()
for (i in 1: length(table.var)){
    table.sections[i] <- switch(table.ops[i],
           "+" = with(dataset, mean.sd(get(table.var[i]), get(group.var))),
           "%" = with(dataset, xtab(get(table.var[i]), get(group.var))),
           "-" = with(dataset, med.iqr(get(table.var[i]), get(group.var))))
}

# combine elements of list into a data frame, adjust row/column names, and return
...

Basically all this needs is some functions (mean.sd, xtab, and med.iqr) which will get the appropriate statistics, round to proper significant figures, and format them - add parens for the SD, % for percentages, and brackets around the quartiles- a combination of unlist, by, round, format, and paste. We could also add an ability to do paired/unpaired tests, but I think these should be passed as arguments, and not chosen by default to avoid abuse.

It seems like this is something which should exist- I know there are packages out there which can do tables, however I think a simple syntax would really make this highly accessible and highly utilized. I'm sure I'm not the only researcher who yearns for the day when Table 1 is a one-liner.

UPDATE: After a bit of tinkering, I have a working version- there's a couple features I'd like to add, but we can always fine tune down the line. Let's create some data for a motivating example: a randomized trial in Stroke. We'll be summarizing age, gender, neurological status (measured by the Glasgow Coma Scale), and blood pressure:

rm(list = ls(all = T))
set.seed(1)
N <- 150
Age <- rpois(N, 65)
Gender <- cut(rbinom(N, 1, 0.5), breaks = c(-Inf, 0, Inf),
              labels = c("Female", "Male"))
Group <- cut(rbinom(N, 1, 0.5), breaks = c(-Inf, 0, Inf),
             labels = c("Control", "Treatment"))
GCS <- cut(ceiling(runif(N, 2, 15)), breaks = c(-Inf, 8, 12, Inf),
           labels = c("Severe (3-8)", "Moderate (9-12)", "Mild (13-15)"))
SBP <- rpois(N, 160)
DBP <- rpois(N, 100)
stroke <- data.frame(Age, Gender, GCS, SBP, DBP, Group)

# Look at the first few records:
head(stroke)

  Age Gender             GCS SBP DBP     Group
1  59 Female Moderate (9-12) 182 113 Treatment
2  75   Male    Severe (3-8) 161 116 Treatment
3  75   Male Moderate (9-12) 177 116   Control
4  68 Female    Mild (13-15) 167 106 Treatment
5  52   Male    Severe (3-8) 161  83 Treatment
6  68   Male    Severe (3-8) 171 115   Control

Now, let's look at the function to generate the table:

table.summary <- function (table.fmla, dataset, suppress.missing = F) {
# Replace all whitespace in formula, split at tilde (~)
table.fmla <- unlist(strsplit(gsub(" ", "", table.fmla), split = "~"))
group.var <- table.fmla[1]

# Split formula up by the operators
table.var <- unlist(strsplit(table.fmla[2],
                               split = "[\\+\\-\\%]", perl = TRUE))[-1]

# Find the operators associated with each variable
fmla.chars <- unlist(strsplit(table.fmla[2], split = ""))
table.ops <- fmla.chars[fmla.chars %in% c("+", "-", "%")]

xtab <- function (variable, group){
    if (suppress.missing == T ) na.flag <- "no" else na.flag <- "ifany"
    freqs <- t(rbind(table(group, variable, useNA = na.flag)))
    props <- round(100*prop.table(freqs, 2),1)
    rownames(freqs)[which(is.na(rownames(freqs)))] <- "Missing"
    freqs <- data.frame(freqs)

    for (i in 1: ncol(freqs)) {
      freqs[,i] <- paste0(freqs[,i], " (", props[,i], "%)")
    }
    colnames(freqs) <-levels(group)
    return(freqs)
}

mean.sd <- function (variable, group) {
    means <- paste0(round(matrix(unlist(by(variable, group, mean, na.rm = T))), 1),
                    " (", round(matrix(unlist(by(variable, group, sd, na.rm = T))), 1), ")")
    missing.values <- unlist(by(variable, group, function (x) sum(is.na(x))))
    means <- data.frame(t(cbind(means, missing.values)))
    rownames(means)[1] <- "(±SD)"
    if (sum(is.na(variable)) == 0 | suppress.missing == T) means <- means[-2,]
    colnames(means) <- levels(group)
    return(means)
}

med.iqr <- function (variable, group) {
    qs <- paste0(round(matrix(unlist(by(variable, group, median, na.rm = T))), 1),
      " [", round(matrix(unlist(by(variable, group, quantile, .25, na.rm = T))), 1),
      ", ", round(matrix(unlist(by(variable, group, quantile, .75, na.rm = T))), 1), "]")
    missing.values <- unlist(by(variable, group, function (x) sum(is.na(x))))
    qs <- data.frame(t(cbind(qs, missing.values)))
    rownames(qs)[1] <- "[IQR]"
    if (sum(is.na(variable)) == 0 | suppress.missing == T) qs <- qs[-2,]
    colnames(qs) <- levels(group)
    return(qs)
}

table.sum <- NULL
for (i in 1: length(table.var)) {
    switch(table.ops[i],
           "+" = table.section <-
             with(dataset, mean.sd(get(table.var[i]), get(group.var))),
           "%" = table.section <-
             with(dataset, xtab(get(table.var[i]), get(group.var))),
           "-" = table.section <-
             with(dataset, med.iqr(get(table.var[i]), get(group.var))))

    rownames(table.section) <- paste(gsub("\\s*\\s", " ",
        gsub("*[_|.]", " ", table.var[i])),
        gsub("\\s*\\s", " ",   gsub("*[_|.]", " ", rownames(table.section))))
    table.sum <- rbind(table.sum, table.section)
}

colnames(table.sum) <- paste0(levels(get(group.var,dataset) ), " (N=",
    table(get(group.var,dataset))[match(levels(get(group.var,dataset)),
    rownames(table(get(group.var,dataset))))], ")")

return(table.sum)
}
For the most part, it's shoe-horning results into a data frame of character data: It's not terribly beautiful to look at, and I'm sure there's a more elegant way about it, but let's see the call and the results:

table1 <- table.summary(table.fmla = "Group ~ +Age %Gender %GCS -SBP -DBP",
              dataset = stroke)
library(xtable)
print(xtable(table1), type = "html")

	Control (N=70)	Treatment (N=80)
Age (±SD)	64.5 (6.8)	65.8 (7.9)
Gender Female	33 (47.1%)	45 (56.2%)
Gender Male	37 (52.9%)	35 (43.8%)
GCS Severe (3-8)	33 (47.1%)	36 (45%)
GCS Moderate (9-12)	29 (41.4%)	21 (26.2%)
GCS Mild (13-15)	8 (11.4%)	23 (28.7%)
SBP [IQR]	160.5 [151.2, 169]	158 [147.8, 167]
DBP [IQR]	100 [93.2, 107.5]	99 [93.8, 107]

Not too shabby for a one-liner! Now, let's look at some missing or incomplete data by tossing out some of our variables:

stroke.incomplete <- stroke
stroke.incomplete$Age[sample(1:N, 4)] <- NA
stroke.incomplete$GCS[sample(1:N, 4)] <- NA
stroke.incomplete$SBP[sample(1:N, 4)] <- NA
stroke.incomplete$DBP[sample(1:N, 4)] <- NA

Let's have a look at the default option: suppress.data = FALSE

# Incomplete Data
table1.m <- table.summary(table.fmla = "Group ~ +Age %Gender %GCS -SBP -DBP",
dataset = stroke.incomplete)
print(xtable(table1.m), type = "html")

	Control (N=70)	Treatment (N=80)
Age (±SD)	64.7 (6.8)	65.9 (7.9)
Age missing values	2	2
Gender Female	33 (47.1%)	45 (56.2%)
Gender Male	37 (52.9%)	35 (43.8%)
GCS Severe (3-8)	32 (45.7%)	35 (43.8%)
GCS Moderate (9-12)	28 (40%)	20 (25%)
GCS Mild (13-15)	8 (11.4%)	23 (28.7%)
GCS Missing	2 (2.9%)	2 (2.5%)
SBP [IQR]	161 [151, 169]	158 [148, 167]
SBP missing values	1	3
DBP [IQR]	100 [93, 108.2]	99.5 [94, 107]
DBP missing values	2	2

We can suppress the missingness information by setting suppress.data = TRUE:

# Incomplete Data - Suppress missingness information
table1.sm <- table.summary(table.fmla = "Group ~ +Age %Gender %GCS -SBP -DBP",
dataset = stroke.incomplete, suppress.missing = T)
print(xtable(table1.sm), type = "html")

	Control (N=70)	Treatment (N=80)
Age (±SD)	64.7 (6.8)	65.9 (7.9)
Gender Female	33 (47.1%)	45 (56.2%)
Gender Male	37 (52.9%)	35 (43.8%)
GCS Severe (3-8)	32 (47.1%)	35 (44.9%)
GCS Moderate (9-12)	28 (41.2%)	20 (25.6%)
GCS Mild (13-15)	8 (11.8%)	23 (29.5%)
SBP [IQR]	161 [151, 169]	158 [148, 167]
DBP [IQR]	100 [93, 108.2]	99.5 [94, 107]

There's a lot of things we could add, like customize the labels, add hypothesis tests, and so fourth, but this is a good start. Maybe next time we can port our results to file formats more conducive to collaboration, editing, and submission.

6 comments:

Fr.April 13, 2013 at 4:49 PM
I have coded something vaguely similar for Stata, because students needed it and it did not exist.
UnknownApril 15, 2013 at 5:51 AM
I have a crude-ish version we use at the University of Louisville. This function does a fisher's exact for categorical vars and a Mann Whitney U for continuous. All you need is your dataset (data here) and a data dictionary (dictionary) which has the format you want the variable names to have (eg if your variable name is demog_age, in the dictionary you may have one column called "variable" that has demog_age and another column called "label" that has Age so it shows up more professionally in the table). The only major downfall (besides no Chi square or T-test) is that it only works for dichotomous categorical variables. Hopefully this helps someone -- if you use it please feel free to cite our work! You can DL the script here: http://www.wikiupload.com/ER5R3QERE23Y7T2

Timothy Wiemken, PhD MPH CIC
Assistant Professor of Medicine
Assistant Director of Epidemiology and Biostatistics
University of Louisville
Division of Infectious Diseases
501 E. Broadway Suite 120
Louisville, KY 40202
Office: 502.852.4627
Fax: 502.852.1147
Email: tlwiem01@louisville.edu
www.iduofl.us
www.ctrsc.net
Asymptotically UnbiasedApril 15, 2013 at 6:45 PM
Thanks for the heads up, François- I'll keep that bookmarked. Right now I've only done reports in ODS, R Markdown, and knitR, but Stata + TeX is next in line.
UnknownApril 29, 2013 at 7:13 AM
I totally agree. Please let me know what you think and if you make any changes. We are constantly tweaking everything to try to make it easier to do these. I probably do at least 4 of these a week -- painful!
UnknownOctober 9, 2014 at 2:31 PM
Not my work, but have a look at this http://rpubs.com/kaz_yos/tableone-demo-e

Asymptotically Unbiased

Friday, April 12, 2013

R: Table 1 - Done in One?

6 comments: