This was a quick Saturday afternoon project- I wanted to write the guts of a program to simulate survival data. While there's plenty of survival datasets around to play with, I wanted to make something that could eventually be used to explore models which I don't get a chance to use that often, such as competing risks, discrete-time survival, and joint longitudinal-survival models. This will get us started in the right direction, and utilize some code I already had kicking around. First, let's get started by clearing the workspace, and setting up a directory for the resulting dataset.

rm(list = ls(all = T))

working.dir <- "c:/Data/Didactic/Data"

if (file.exists(working.dir)) {

setwd(file.path(working.dir))

} else {

dir.create(working.dir, recursive = T, showWarnings = F)

setwd(working.dir)

}

Now let's get to some actual fun stuff. First, a quick function that gives us a positive definite correlation matrix, so we can simulate some covariates from the multivariate normal distribution. We use the QR decomposition to get an orthogonal basis, then fills in eigenvalues, and converts the resulting covariance matrix to a correlation matrix.

generate.rho <- function (p = 10) {

# Generate Orthogonal Columns

E <- qr.Q(qr(matrix(runif(p), p, p)))

# Generate Eigenvalues from Gamma Distribution

evalues <- runif(p)

# Scale eigenvalues to give a sum of 1

evalues <- evalues/sum(evalues)

L <- diag(evalues)

rho <- cov2cor(E %*% diag(evalues) %*% E)

}

Let's sample some covariates - these will play into our censoring and event processes. Since I am sampling my covariates from a multivariate normal distribution with zero mean and correlation matrix, all the covariates will have marginal distributions that are normal zero mean and unit variance. We can create categorical covariates by using the cut function, and control the proportions in each category using the quantiles of the standard normal distribution.

# Number of Observations

n.obs <- 500

# Time at which all observations will be censored

censoring.time <- 10

# Number of Covariates - Generate correlation matrices for covariates

n.categorical <- 4

n.cens.cov <- 3;

n.event.cov <- 3;

n.common.cov <- 2;

n.null.cov <- 3;

n.cov <- n.cens.cov + n.event.cov + n.common.cov + n.null.cov

rho <- generate.rho(n.cov)

# Sample covariates & coefficients, assign sign at random

library(MASS) # for mvrnorm function

X <- matrix(mvrnorm(n.obs, matrix(0, n.cov), rho), n.obs)

# Create Categorical Covariates - Tertiles based on Z scores

X[,1:n.categorical] <- apply(X[,1:n.categorical], 2, cut, breaks = c(-Inf, -.43, .43, Inf), labels = F)

So now that we have our covariates, let's randomly assign each of our covariates to the censoring process, the event process, both processes, or being unrelated to the processes. Once we have our covariates, we sample some regression parameters, and create our linear predictors.

# Switch up the order, so we don't know which variables are related to the

# censoring and event processes

var.type <- sample(c(rep("cens", n.cens.cov), rep("event", n.event.cov),

rep("common", n.common.cov), rep("null", n.null.cov)))

X.cens <- X[, which(var.type == "cens")]

X.event <- X[, which(var.type == "event")]

X.common <- X[, which(var.type == "common")]

X.null <- X[, which(var.type == "null")]

B.cens <- matrix(rbeta(n.cens.cov, 8, 4)*sign(rnorm(n.cens.cov)), n.cens.cov)

B.event <- matrix(rbeta(n.event.cov, 20, 2)*sign(rnorm(n.event.cov)), n.event.cov)

B.common <- matrix(rbeta(n.common.cov, 6, 4)*sign(rnorm(n.common.cov)), n.common.cov)

Eta.cens <- cbind(X.cens, X.common) %*% rbind(B.cens, B.common)

Eta.event <- cbind(X.event, X.common) %*% rbind(B.event, B.common)

The last thing we need to do is sample our censoring and event times: in our current example, we only have two competing processes, censoring and one event. This could easily be generalized to several competing events.

Our event times will be sampled from a Weibull distribution. In the Weibull model, the shape parameter is fixed, and the linear predictor enters into the scale parameter. The date we generate can be adapted to both parametric and non-parametric survival models.

Once we have our events and times, we gather all our variables into a data frame, and write our results into a data frame, put it into a .CSV, and we're ready to go.

# Set scale and shape parameters for distribution

event.shape <- 2

cens.shape <- 2

event.scale <- .9*censoring.time

cens.scale <- .9*censoring.time

# Parameterize scale with linear predictor

event.time <- rweibull(n.obs, event.shape, scale = event.scale + Eta.event)

cens.time <- rweibull(n.obs, event.shape, scale = cens.scale + Eta.cens)

time_to_event <- apply(cbind(event.time, cens.time), 1, min)

censored <- matrix(0, n.obs)

censored[which(time_to_event == cens.time)] <- 1

censored[which(event.time >= 10)] <- 1

time_to_event[which(time_to_event > censoring.time)] <- 10

event = 1 - censored

surv.data <- data.frame(censored, event, time_to_event, X)

write.csv(surv.data, paste0("Survival sim ", format(Sys.Date(), "%Y-%m-%d"), ".csv"),

row.names = F)

This will definitely need some tweaks, but it has some potential. With a little modification, we could introduce other factors like left truncation, discrete time scale, and so on. This was a spur-of-the-moment project, so give it a try, tweak it, and let me know the results.

## No comments:

## Post a Comment