We can use operators to denote each of the types of summaries we'd like to do:
- + : mean (±SD), for continuous variables that are roughly symmetric.
- % : frequency, for categorical variables
- - : IQR/Quartiles, for skewed continuous variables
# Take variables X1-X3 from my.data.frame, and summarize them by the factor "group."
# Present X1 as Mean (±SD), break down X2 by frequency, and display the Quartiles of X3.
table.summary("group ~ +X1 %X2 -X3", my.data.frame)
In essence, we'd be able to construct our summary table using the familiar formula syntax of regression, and Table 1 is done in one line, ready to be beautifully presented using xtable. I haven't had time to fully flesh this out- for all I know, someone's beat me to this, and it's floating about CRAN somewhere, while I'm obliviously re-inventing the wheel.
table.summary <- function (table.fmla, table.labels = NULL, dataset)
# Replace all whitespace in formula, split at tilde (~)
table.fmla <- unlist(strsplit(gsub(" ", "", table.fmla), split = "~"))
group.var <- table.fmla[1]
# Split formula up by the operators
table.var <- unlist(strsplit(table.fmla[2],
split = "[\\+\\-\\%]", perl = TRUE))[-1]
# Find the operators associated with each variable
fmla.chars <- unlist(strsplit(table.fmla[2], split = ""))
table.ops <- fmla.chars[fmla.chars %in% c("+", "-", "%")]
table.sections <- list()
for (i in 1: length(table.var)){
table.sections[i] <- switch(table.ops[i],
"+" = with(dataset, mean.sd(get(table.var[i]), get(group.var))),
"%" = with(dataset, xtab(get(table.var[i]), get(group.var))),
"-" = with(dataset, med.iqr(get(table.var[i]), get(group.var))))
}
# combine elements of list into a data frame, adjust row/column names, and return
...
Basically all this needs is some functions (mean.sd, xtab, and med.iqr) which will get the appropriate statistics, round to proper significant figures, and format them - add parens for the SD, % for percentages, and brackets around the quartiles- a combination of unlist, by, round, format, and paste. We could also add an ability to do paired/unpaired tests, but I think these should be passed as arguments, and not chosen by default to avoid abuse.
It seems like this is something which should exist- I know there are packages out there which can do tables, however I think a simple syntax would really make this highly accessible and highly utilized. I'm sure I'm not the only researcher who yearns for the day when Table 1 is a one-liner.
UPDATE: After a bit of tinkering, I have a working version- there's a couple features I'd like to add, but we can always fine tune down the line. Let's create some data for a motivating example: a randomized trial in Stroke. We'll be summarizing age, gender, neurological status (measured by the Glasgow Coma Scale), and blood pressure:
rm(list = ls(all = T))
set.seed(1)
N <- 150
Age <- rpois(N, 65)
Gender <- cut(rbinom(N, 1, 0.5), breaks = c(-Inf, 0, Inf),
labels = c("Female", "Male"))
Group <- cut(rbinom(N, 1, 0.5), breaks = c(-Inf, 0, Inf),
labels = c("Control", "Treatment"))
GCS <- cut(ceiling(runif(N, 2, 15)), breaks = c(-Inf, 8, 12, Inf),
labels = c("Severe (3-8)", "Moderate (9-12)", "Mild (13-15)"))
SBP <- rpois(N, 160)
DBP <- rpois(N, 100)
stroke <- data.frame(Age, Gender, GCS, SBP, DBP, Group)
# Look at the first few records:
head(stroke)
Age Gender GCS SBP DBP Group 1 59 Female Moderate (9-12) 182 113 Treatment 2 75 Male Severe (3-8) 161 116 Treatment 3 75 Male Moderate (9-12) 177 116 Control 4 68 Female Mild (13-15) 167 106 Treatment 5 52 Male Severe (3-8) 161 83 Treatment 6 68 Male Severe (3-8) 171 115 Control
Now, let's look at the function to generate the table:
table.summary <- function (table.fmla, dataset, suppress.missing = F) {
# Replace all whitespace in formula, split at tilde (~)
table.fmla <- unlist(strsplit(gsub(" ", "", table.fmla), split = "~"))
group.var <- table.fmla[1]
# Split formula up by the operators
table.var <- unlist(strsplit(table.fmla[2],
split = "[\\+\\-\\%]", perl = TRUE))[-1]
# Find the operators associated with each variable
fmla.chars <- unlist(strsplit(table.fmla[2], split = ""))
table.ops <- fmla.chars[fmla.chars %in% c("+", "-", "%")]
xtab <- function (variable, group){
if (suppress.missing == T ) na.flag <- "no" else na.flag <- "ifany"
freqs <- t(rbind(table(group, variable, useNA = na.flag)))
props <- round(100*prop.table(freqs, 2),1)
rownames(freqs)[which(is.na(rownames(freqs)))] <- "Missing"
freqs <- data.frame(freqs)
for (i in 1: ncol(freqs)) {
freqs[,i] <- paste0(freqs[,i], " (", props[,i], "%)")
}
colnames(freqs) <-levels(group)
return(freqs)
}
mean.sd <- function (variable, group) {
means <- paste0(round(matrix(unlist(by(variable, group, mean, na.rm = T))), 1),
" (", round(matrix(unlist(by(variable, group, sd, na.rm = T))), 1), ")")
missing.values <- unlist(by(variable, group, function (x) sum(is.na(x))))
means <- data.frame(t(cbind(means, missing.values)))
rownames(means)[1] <- "(±SD)"
if (sum(is.na(variable)) == 0 | suppress.missing == T) means <- means[-2,]
colnames(means) <- levels(group)
return(means)
}
med.iqr <- function (variable, group) {
qs <- paste0(round(matrix(unlist(by(variable, group, median, na.rm = T))), 1),
" [", round(matrix(unlist(by(variable, group, quantile, .25, na.rm = T))), 1),
", ", round(matrix(unlist(by(variable, group, quantile, .75, na.rm = T))), 1), "]")
missing.values <- unlist(by(variable, group, function (x) sum(is.na(x))))
qs <- data.frame(t(cbind(qs, missing.values)))
rownames(qs)[1] <- "[IQR]"
if (sum(is.na(variable)) == 0 | suppress.missing == T) qs <- qs[-2,]
colnames(qs) <- levels(group)
return(qs)
}
table.sum <- NULL
for (i in 1: length(table.var)) {
switch(table.ops[i],
"+" = table.section <-
with(dataset, mean.sd(get(table.var[i]), get(group.var))),
"%" = table.section <-
with(dataset, xtab(get(table.var[i]), get(group.var))),
"-" = table.section <-
with(dataset, med.iqr(get(table.var[i]), get(group.var))))
rownames(table.section) <- paste(gsub("\\s*\\s", " ",
gsub("*[_|.]", " ", table.var[i])),
gsub("\\s*\\s", " ", gsub("*[_|.]", " ", rownames(table.section))))
table.sum <- rbind(table.sum, table.section)
}
colnames(table.sum) <- paste0(levels(get(group.var,dataset) ), " (N=",
table(get(group.var,dataset))[match(levels(get(group.var,dataset)),
rownames(table(get(group.var,dataset))))], ")")
return(table.sum)
}
For the most part, it's shoe-horning results into a data frame of character data: It's not terribly beautiful to look at, and I'm sure there's a more elegant way about it, but let's see the call and the results:
table1 <- table.summary(table.fmla = "Group ~ +Age %Gender %GCS -SBP -DBP",
dataset = stroke)
library(xtable)
print(xtable(table1), type = "html")
Control (N=70) | Treatment (N=80) | |
---|---|---|
Age (±SD) | 64.5 (6.8) | 65.8 (7.9) |
Gender Female | 33 (47.1%) | 45 (56.2%) |
Gender Male | 37 (52.9%) | 35 (43.8%) |
GCS Severe (3-8) | 33 (47.1%) | 36 (45%) |
GCS Moderate (9-12) | 29 (41.4%) | 21 (26.2%) |
GCS Mild (13-15) | 8 (11.4%) | 23 (28.7%) |
SBP [IQR] | 160.5 [151.2, 169] | 158 [147.8, 167] |
DBP [IQR] | 100 [93.2, 107.5] | 99 [93.8, 107] |
Not too shabby for a one-liner! Now, let's look at some missing or incomplete data by tossing out some of our variables:
stroke.incomplete <- stroke
stroke.incomplete$Age[sample(1:N, 4)] <- NA
stroke.incomplete$GCS[sample(1:N, 4)] <- NA
stroke.incomplete$SBP[sample(1:N, 4)] <- NA
stroke.incomplete$DBP[sample(1:N, 4)] <- NA
Let's have a look at the default option: suppress.data = FALSE
# Incomplete Data
table1.m <- table.summary(table.fmla = "Group ~ +Age %Gender %GCS -SBP -DBP",
dataset = stroke.incomplete)
print(xtable(table1.m), type = "html")
Control (N=70) | Treatment (N=80) | |
---|---|---|
Age (±SD) | 64.7 (6.8) | 65.9 (7.9) |
Age missing values | 2 | 2 |
Gender Female | 33 (47.1%) | 45 (56.2%) |
Gender Male | 37 (52.9%) | 35 (43.8%) |
GCS Severe (3-8) | 32 (45.7%) | 35 (43.8%) |
GCS Moderate (9-12) | 28 (40%) | 20 (25%) |
GCS Mild (13-15) | 8 (11.4%) | 23 (28.7%) |
GCS Missing | 2 (2.9%) | 2 (2.5%) |
SBP [IQR] | 161 [151, 169] | 158 [148, 167] |
SBP missing values | 1 | 3 |
DBP [IQR] | 100 [93, 108.2] | 99.5 [94, 107] |
DBP missing values | 2 | 2 |
We can suppress the missingness information by setting suppress.data = TRUE:
# Incomplete Data - Suppress missingness information
table1.sm <- table.summary(table.fmla = "Group ~ +Age %Gender %GCS -SBP -DBP",
dataset = stroke.incomplete, suppress.missing = T)
print(xtable(table1.sm), type = "html")
Control (N=70) | Treatment (N=80) | |
---|---|---|
Age (±SD) | 64.7 (6.8) | 65.9 (7.9) |
Gender Female | 33 (47.1%) | 45 (56.2%) |
Gender Male | 37 (52.9%) | 35 (43.8%) |
GCS Severe (3-8) | 32 (47.1%) | 35 (44.9%) |
GCS Moderate (9-12) | 28 (41.2%) | 20 (25.6%) |
GCS Mild (13-15) | 8 (11.8%) | 23 (29.5%) |
SBP [IQR] | 161 [151, 169] | 158 [148, 167] |
DBP [IQR] | 100 [93, 108.2] | 99.5 [94, 107] |
There's a lot of things we could add, like customize the labels, add hypothesis tests, and so fourth, but this is a good start. Maybe next time we can port our results to file formats more conducive to collaboration, editing, and submission.