Survival Analysis is the analysis of any 'time to event' response variable. A 'time to event' variable reflects the the waiting time until the occurence of a well-defined event. This time is often referred to as a failure time, survival time, or event time. For example,
Typical problem statements in survival analysis are:
Time to event variables have a set of unique features.
As discussed above, censoring implies that we have some information about a subject’s event time, but the exact event time is unknown. Censoring should be random for survival analysis to be valid. There are many types of censoring, most common being right censoring. The three most common reasons for right censoring are:
Right censoring is further classified into Type I and Type II censoring.
The below image depicts censoring in a study with 8 subjects.
Consider a factory of machine parts. Stress testing is conducted on these parts and a study is designed for time to failure of the parts. 1000 parts are sampled at random, and the study is designed to stop after 2 weeks. What is the censoring type. Assign 'Type I' or 'Type II' to the variable cens.
#cens = ''
Refer to above discussion
cens='Type I'
try:
if cens == 'Type I':
ref_assert_var = True
print('continue')
else:
ref_assert_var = False
print('Please follow the instructions given and use the same variables provided in the instructions. ')
except Exception:
print('Please follow the instructions given and use the same variables provided in the instructions. ')
Let T denote the continous non-negative random variable that represents the time to event response variable. T has the probability density function (pdf), f(t) and cumulative distribution function (cdf), F(t) = Pr{T <= t}. This gives the probability that the event has occurred by duration t. The survival function is derived as shown below. S(t) = Pr(T > t) = 1 − Pr(T <= t) = 1 - F(t).
The survival function gives the probability that a subject will survive past time t. The survival function has the following properties:
The hazard function, h(t), is an alternative characteristic of the districution of T. It is the instantaneous rate at which events occur, given there were no previous events.
$h(t) = \left\{\frac{f(t)}{S(t)}\right\}$
H(t) is the cumulative hazard function which describes the accumulated risk up to time t. If any of the functions F(t), S(t), H(t), or h(t) is known, we can derive the other functions.
The below image shows the survival and hazard functions for two drugs where the event is death. It is clear from the survival function that the performance of Drug A is better because the survival probability is higher than that of Drug B at all times $t$.
We can use both parametric and non-parametric methods to estimate the survival function. We would have to assume that every subject follows the same survival function. When there is no censoring, a non-parametric estimator or $S(t)$ is $1-F_n(t)$, where $F_n(t)$ is the empirical cumulative distribution function of the response variable $T$. In case of censoring, we can estimate $S(t)$ using the Kaplan-Meier product-limit estimator. By making further assumptions and by specifying a parametric form for $S(t)$, we can estimate the expected failure times and derive smooth functions for estimating $S(t)$ and $H(t)$. Weibull, exponential, log-normal and log-logistic are some of the popular distributions used for estimating survival functions.
The Kaplan-Meier estimator estimates the survival function using the following product:
$\hat{S}\;(t)= \displaystyle\prod_{t_i < t} \frac {n_i−d_i}{n_i}$ where $d_i$ are the number of events at time $t$ and $n_i$ is the number of subjects at risk of event just prior to time $t$.
Let us now look at the marriage dissolution in US dataset. The dataset has 3771 couples where the unit of observation is the couple and the event of interest is divorce, with interview and widowhood treated as censoring events. We have two fixed covariates: education of the husband and one indicator of the couple's ethnicity: whether the couple is mixed. The variables are:
id: a couple number. heduc: education of the husband, coded 0 = less than 12 years, 1 = 12 to 15 years, and 2 = 16 or more years. mixed: coded 1 if the husband and wife have different ethnicity (defined as black or other), 0 otherwise. years: duration of marriage, from the date of wedding to divorce or censoring (due to widowhood or interview). div: the failure indicator, coded 1 for divorce and 0 for censoring.
Dataset souce: http://data.princeton.edu/wws509/datasets/#divorce
The lifelines library can be used to estimate survival function. Use the following code to read in the data and plot the survival function using the Kaplan-Meier estimator.
%matplotlib inline
import pandas as pd
from lifelines import KaplanMeierFitter
div_df = pd.read_csv("../data/divorce.csv", header=None)
columns = ['ID', 'HEduc', 'Mixed', 'Years','Div']
div_df.columns = columns
kmf = KaplanMeierFitter()
kmf.fit(durations = div_df.Years,
event_observed = div_df.Div)
kmf.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x1115510f0>
try:
a=1
if a == 1:
ref_assert_var = True
print('continue')
else:
ref_assert_var = False
print('Please follow the instructions given and use the same variables provided in the instructions. ')
except Exception:
print('Please follow the instructions given and use the same variables provided in the instructions. ')
continue