We can use operators to denote each of the types of summaries we'd like to do:
- + : mean (±SD), for continuous variables that are roughly symmetric.
- % : frequency, for categorical variables
- - : IQR/Quartiles, for skewed continuous variables
# Take variables X1-X3 from my.data.frame, and summarize them by the factor "group."
# Present X1 as Mean (±SD), break down X2 by frequency, and display the Quartiles of X3.
table.summary("group ~ +X1 %X2 -X3", my.data.frame)
In essence, we'd be able to construct our summary table using the familiar formula syntax of regression, and Table 1 is done in one line, ready to be beautifully presented using xtable. I haven't had time to fully flesh this out- for all I know, someone's beat me to this, and it's floating about CRAN somewhere, while I'm obliviously re-inventing the wheel.
table.summary <- function (table.fmla, table.labels = NULL, dataset)
# Replace all whitespace in formula, split at tilde (~)
table.fmla <- unlist(strsplit(gsub(" ", "", table.fmla), split = "~"))
group.var <- table.fmla[1]
# Split formula up by the operators
table.var <- unlist(strsplit(table.fmla[2],
split = "[\\+\\-\\%]", perl = TRUE))[-1]
# Find the operators associated with each variable
fmla.chars <- unlist(strsplit(table.fmla[2], split = ""))
table.ops <- fmla.chars[fmla.chars %in% c("+", "-", "%")]
table.sections <- list()
for (i in 1: length(table.var)){
table.sections[i] <- switch(table.ops[i],
"+" = with(dataset, mean.sd(get(table.var[i]), get(group.var))),
"%" = with(dataset, xtab(get(table.var[i]), get(group.var))),
"-" = with(dataset, med.iqr(get(table.var[i]), get(group.var))))
}
# combine elements of list into a data frame, adjust row/column names, and return
...
Basically all this needs is some functions (mean.sd, xtab, and med.iqr) which will get the appropriate statistics, round to proper significant figures, and format them - add parens for the SD, % for percentages, and brackets around the quartiles- a combination of unlist, by, round, format, and paste. We could also add an ability to do paired/unpaired tests, but I think these should be passed as arguments, and not chosen by default to avoid abuse.
It seems like this is something which should exist- I know there are packages out there which can do tables, however I think a simple syntax would really make this highly accessible and highly utilized. I'm sure I'm not the only researcher who yearns for the day when Table 1 is a one-liner.
UPDATE: After a bit of tinkering, I have a working version- there's a couple features I'd like to add, but we can always fine tune down the line. Let's create some data for a motivating example: a randomized trial in Stroke. We'll be summarizing age, gender, neurological status (measured by the Glasgow Coma Scale), and blood pressure:
rm(list = ls(all = T))
set.seed(1)
N <- 150
Age <- rpois(N, 65)
Gender <- cut(rbinom(N, 1, 0.5), breaks = c(-Inf, 0, Inf),
labels = c("Female", "Male"))
Group <- cut(rbinom(N, 1, 0.5), breaks = c(-Inf, 0, Inf),
labels = c("Control", "Treatment"))
GCS <- cut(ceiling(runif(N, 2, 15)), breaks = c(-Inf, 8, 12, Inf),
labels = c("Severe (3-8)", "Moderate (9-12)", "Mild (13-15)"))
SBP <- rpois(N, 160)
DBP <- rpois(N, 100)
stroke <- data.frame(Age, Gender, GCS, SBP, DBP, Group)
# Look at the first few records:
head(stroke)
Age Gender GCS SBP DBP Group 1 59 Female Moderate (9-12) 182 113 Treatment 2 75 Male Severe (3-8) 161 116 Treatment 3 75 Male Moderate (9-12) 177 116 Control 4 68 Female Mild (13-15) 167 106 Treatment 5 52 Male Severe (3-8) 161 83 Treatment 6 68 Male Severe (3-8) 171 115 Control
Now, let's look at the function to generate the table:
table.summary <- function (table.fmla, dataset, suppress.missing = F) {
# Replace all whitespace in formula, split at tilde (~)
table.fmla <- unlist(strsplit(gsub(" ", "", table.fmla), split = "~"))
group.var <- table.fmla[1]
# Split formula up by the operators
table.var <- unlist(strsplit(table.fmla[2],
split = "[\\+\\-\\%]", perl = TRUE))[-1]
# Find the operators associated with each variable
fmla.chars <- unlist(strsplit(table.fmla[2], split = ""))
table.ops <- fmla.chars[fmla.chars %in% c("+", "-", "%")]
xtab <- function (variable, group){
if (suppress.missing == T ) na.flag <- "no" else na.flag <- "ifany"
freqs <- t(rbind(table(group, variable, useNA = na.flag)))
props <- round(100*prop.table(freqs, 2),1)
rownames(freqs)[which(is.na(rownames(freqs)))] <- "Missing"
freqs <- data.frame(freqs)
for (i in 1: ncol(freqs)) {
freqs[,i] <- paste0(freqs[,i], " (", props[,i], "%)")
}
colnames(freqs) <-levels(group)
return(freqs)
}
mean.sd <- function (variable, group) {
means <- paste0(round(matrix(unlist(by(variable, group, mean, na.rm = T))), 1),
" (", round(matrix(unlist(by(variable, group, sd, na.rm = T))), 1), ")")
missing.values <- unlist(by(variable, group, function (x) sum(is.na(x))))
means <- data.frame(t(cbind(means, missing.values)))
rownames(means)[1] <- "(±SD)"
if (sum(is.na(variable)) == 0 | suppress.missing == T) means <- means[-2,]
colnames(means) <- levels(group)
return(means)
}
med.iqr <- function (variable, group) {
qs <- paste0(round(matrix(unlist(by(variable, group, median, na.rm = T))), 1),
" [", round(matrix(unlist(by(variable, group, quantile, .25, na.rm = T))), 1),
", ", round(matrix(unlist(by(variable, group, quantile, .75, na.rm = T))), 1), "]")
missing.values <- unlist(by(variable, group, function (x) sum(is.na(x))))
qs <- data.frame(t(cbind(qs, missing.values)))
rownames(qs)[1] <- "[IQR]"
if (sum(is.na(variable)) == 0 | suppress.missing == T) qs <- qs[-2,]
colnames(qs) <- levels(group)
return(qs)
}
table.sum <- NULL
for (i in 1: length(table.var)) {
switch(table.ops[i],
"+" = table.section <-
with(dataset, mean.sd(get(table.var[i]), get(group.var))),
"%" = table.section <-
with(dataset, xtab(get(table.var[i]), get(group.var))),
"-" = table.section <-
with(dataset, med.iqr(get(table.var[i]), get(group.var))))
rownames(table.section) <- paste(gsub("\\s*\\s", " ",
gsub("*[_|.]", " ", table.var[i])),
gsub("\\s*\\s", " ", gsub("*[_|.]", " ", rownames(table.section))))
table.sum <- rbind(table.sum, table.section)
}
colnames(table.sum) <- paste0(levels(get(group.var,dataset) ), " (N=",
table(get(group.var,dataset))[match(levels(get(group.var,dataset)),
rownames(table(get(group.var,dataset))))], ")")
return(table.sum)
}
For the most part, it's shoe-horning results into a data frame of character data: It's not terribly beautiful to look at, and I'm sure there's a more elegant way about it, but let's see the call and the results:
table1 <- table.summary(table.fmla = "Group ~ +Age %Gender %GCS -SBP -DBP",
dataset = stroke)
library(xtable)
print(xtable(table1), type = "html")
Control (N=70) | Treatment (N=80) | |
---|---|---|
Age (±SD) | 64.5 (6.8) | 65.8 (7.9) |
Gender Female | 33 (47.1%) | 45 (56.2%) |
Gender Male | 37 (52.9%) | 35 (43.8%) |
GCS Severe (3-8) | 33 (47.1%) | 36 (45%) |
GCS Moderate (9-12) | 29 (41.4%) | 21 (26.2%) |
GCS Mild (13-15) | 8 (11.4%) | 23 (28.7%) |
SBP [IQR] | 160.5 [151.2, 169] | 158 [147.8, 167] |
DBP [IQR] | 100 [93.2, 107.5] | 99 [93.8, 107] |
Not too shabby for a one-liner! Now, let's look at some missing or incomplete data by tossing out some of our variables:
stroke.incomplete <- stroke
stroke.incomplete$Age[sample(1:N, 4)] <- NA
stroke.incomplete$GCS[sample(1:N, 4)] <- NA
stroke.incomplete$SBP[sample(1:N, 4)] <- NA
stroke.incomplete$DBP[sample(1:N, 4)] <- NA
Let's have a look at the default option: suppress.data = FALSE
# Incomplete Data
table1.m <- table.summary(table.fmla = "Group ~ +Age %Gender %GCS -SBP -DBP",
dataset = stroke.incomplete)
print(xtable(table1.m), type = "html")
Control (N=70) | Treatment (N=80) | |
---|---|---|
Age (±SD) | 64.7 (6.8) | 65.9 (7.9) |
Age missing values | 2 | 2 |
Gender Female | 33 (47.1%) | 45 (56.2%) |
Gender Male | 37 (52.9%) | 35 (43.8%) |
GCS Severe (3-8) | 32 (45.7%) | 35 (43.8%) |
GCS Moderate (9-12) | 28 (40%) | 20 (25%) |
GCS Mild (13-15) | 8 (11.4%) | 23 (28.7%) |
GCS Missing | 2 (2.9%) | 2 (2.5%) |
SBP [IQR] | 161 [151, 169] | 158 [148, 167] |
SBP missing values | 1 | 3 |
DBP [IQR] | 100 [93, 108.2] | 99.5 [94, 107] |
DBP missing values | 2 | 2 |
We can suppress the missingness information by setting suppress.data = TRUE:
# Incomplete Data - Suppress missingness information
table1.sm <- table.summary(table.fmla = "Group ~ +Age %Gender %GCS -SBP -DBP",
dataset = stroke.incomplete, suppress.missing = T)
print(xtable(table1.sm), type = "html")
Control (N=70) | Treatment (N=80) | |
---|---|---|
Age (±SD) | 64.7 (6.8) | 65.9 (7.9) |
Gender Female | 33 (47.1%) | 45 (56.2%) |
Gender Male | 37 (52.9%) | 35 (43.8%) |
GCS Severe (3-8) | 32 (47.1%) | 35 (44.9%) |
GCS Moderate (9-12) | 28 (41.2%) | 20 (25.6%) |
GCS Mild (13-15) | 8 (11.8%) | 23 (29.5%) |
SBP [IQR] | 161 [151, 169] | 158 [148, 167] |
DBP [IQR] | 100 [93, 108.2] | 99.5 [94, 107] |
There's a lot of things we could add, like customize the labels, add hypothesis tests, and so fourth, but this is a good start. Maybe next time we can port our results to file formats more conducive to collaboration, editing, and submission.
I have coded something vaguely similar for Stata, because students needed it and it did not exist.
ReplyDeleteI have a crude-ish version we use at the University of Louisville. This function does a fisher's exact for categorical vars and a Mann Whitney U for continuous. All you need is your dataset (data here) and a data dictionary (dictionary) which has the format you want the variable names to have (eg if your variable name is demog_age, in the dictionary you may have one column called "variable" that has demog_age and another column called "label" that has Age so it shows up more professionally in the table). The only major downfall (besides no Chi square or T-test) is that it only works for dichotomous categorical variables. Hopefully this helps someone -- if you use it please feel free to cite our work! You can DL the script here: http://www.wikiupload.com/ER5R3QERE23Y7T2
ReplyDeleteTimothy Wiemken, PhD MPH CIC
Assistant Professor of Medicine
Assistant Director of Epidemiology and Biostatistics
University of Louisville
Division of Infectious Diseases
501 E. Broadway Suite 120
Louisville, KY 40202
Office: 502.852.4627
Fax: 502.852.1147
Email: tlwiem01@louisville.edu
www.iduofl.us
www.ctrsc.net
Thanks, Timothy- I'll give the code a look. I figure if someone gets a good, polished version out there, it'd be a real boon to researchers. It's surprising how tables are still a chore, even with all of the packages out there.
DeleteThanks for the heads up, François- I'll keep that bookmarked. Right now I've only done reports in ODS, R Markdown, and knitR, but Stata + TeX is next in line.
ReplyDeleteI totally agree. Please let me know what you think and if you make any changes. We are constantly tweaking everything to try to make it easier to do these. I probably do at least 4 of these a week -- painful!
ReplyDeleteNot my work, but have a look at this http://rpubs.com/kaz_yos/tableone-demo-e
ReplyDelete