The survival analysis is a collection of statistical methods for analyzing time-to-event data. The commencement of the survival analysis dates back to the 18th century when analyses of mortality experience of human populations started. During the World War II, the survival analysis focused on engineering – reliability of military equipment was being analyzed. After the World War II the interest turned towards economics and medicine. In 1960s, after the fundamental article of E. L. Kaplan and P. Meier  had been published, medical applications of the survival analysis shifted to the center of statistical focus.
2. Basic concepts in the survival analysis
2.1 The survival and hazard function
The probability density of X: f(x), x≥0.
The survival function:
The hazard function:
The cumulative hazard function:
the hazard x1 at is
where S(x-)=limt→x-S(t). The survival function and probability mass function can be also written as (see )
and the cumulative hazard function is
Let dΛ(x) be a differential increment of the cumulative hazard over the interval [x,x+dx):
The survival function in the discrete, continuous, or mixed cases can then be written as
is the product integral .
2.2 Censoring and truncation
Survival data possess a special feature of censoring, compared to other statistical data. Censoring is used when the survival time is not known exactly, the event is only known to have occurred within some time interval. There are several types of censoring: right, left and interval. In biomedical applications, right censoring is the most common type of censoring. It occurs when the survival time is incomplete on the right-hand side of the follow-up period, i.e. the study ends before all patients experience the event or a patient is lost to follow-up (dies due to reasons other than the event of interest, withdraws from the study, moves to another city, etc.).
Let X1,X2,...,Xn be independent and identically distributed (i.i.d.) survival times and C1,C2,...Cn be i.i.d. censoring times. The lifetime Xi of the i-th individual will be known if, and only if, Xi<Ci. If Ci<Xi the event time will be censored at Ci. Thus it is convenient to represent the survival experience of a group of patients by the pairs of random variables (Ti, δi), where Ti=min(Xi,Ci), δi=I(Xi<Ci), and I is an indicator of the event's occurring, having value one if the event occurs, and zero otherwise.
Another feature, common in survival data, is a truncation. Truncation occurs when only those individuals whose event time lies within a certain time interval (TL, TR) are observed. For left truncation, TR=∞, in case of right truncation, TL=0. Individuals, whose event time is not in this interval, are not observed and no information on these subjects is available. This is in contrast to censoring where there is at least partial information available on each patient. When data are truncated, a conditional distribution has to be used in constructing the likelihood (see ).
where Y(t)=1 indicates that the individual is at risk of failure at time t (has neither failed nor been censored prior to t).
2.3 Counting processes and martingales
Ni(t) is a counting process, while Yi(t) is a predictable process, i.e. a process whose value at time t is known infinitesimally before t, at time t-. This process has left-continuous sample paths. Right-censored survival data are included in this formulation as a special case: Ni(t)=I(Ti≤t,δi=1) and Yi(t)=I(Ti≥t).
For each t>0, let
denote the full history of the processes Ni(s),Yi(s),i=1,...,n, up to but not including t. Then (see ):
where λi(t) is the hazard function. The process
is called the intensity process. At each fixed t, this process is a random variable which approximates the number of jumps by Ni over (0,t]. In fact, ENi(t)=EΛi(t) and thus
For any given i, define the process
| || (1) |
It can be seen that
for all t and
The approach using martingale methods is very useful in yielding results for censored and truncated data, especially for calculating and verifying asymptotic properties of test statistics and estimators.
3. Non-parametric and semi-parametric models
when a continuous distribution is estimated by a discrete one. For an uncensored sample of n distinct failure times, the empirical survival function is then estimated by Sn(t)=1–Fn(t). The only problem with this approach is the censoring – it is not taken into account in standard statistical methods. Important steps in the development of appropriate methods were done by Kaplan and Meier  and Cox .
3.1 Kaplan-Meier and Nelson-Aalen estimators
The product-limit estimator can also be used to estimate the cumulative hazard function:
Based on the Nelson-Aalen estimator of the cumulative hazard function, an alternative estimator of the survival function becomes
A common interest is to compare two or more samples, i.e. to test whether there is a significant difference in survival experience of distinct groups of patients. Several generalizations of standard non-parametric tests have been developed to deal with censored and truncated data. The most common tests are the log-rank test, Gehan-Wilcoxon test and Peto-Peto test. For more information, see e.g. .
3.2 The Cox model
where t1<...<tk are the uncensored failure times of the study group, R(tj) is the set of subjects at risk of failure at time t-j (just prior to time tj), and Xj denotes the covariate vector for an individual failing at tj. The partial likelihood function is treated as a standard likelihood, and inference is carried out by usual means.
4. Multivariate survival analysis
In most clinical applications the univariate survival analysis assumes that the observed survival times are mutually independent (i.i.d. failure times). In practice, however, dependence can occur for very different kinds of data, e.g. survival of twins or other several individuals, similar organs, recurrent events or multi-state events. The multivariate survival analysis covers the field where independence between survival times cannot be assumed. According to , the various approaches to analyzing multivariate survival data fall into four main categories: multi-state models, frailty models, marginal modeling and non-parametric methods. The data structure should be considered as well. The data can be parallel (where the number of failures is fixed by the design of the study) or longitudinal (where the number of failures is random for each object under study). The data sets are classified into six types: several individuals, similar organs, recurrent events, repeated measurements, different events and competing risks. Relation of the data types to the two main approaches of analysis (multi-state and frailty models) is described in Table 1. Only these two approaches to analyzing multivariate survival data are presented in this paper. For more information on marginal and non-parametric methods, see e.g. , , or .
Tab. 1. Overview of data types and approaches; x means relevant, blank not relevant. Adopted from .
|Type of data|
| Frailty |
|Several individuals ||x||x|
|Similar organs ||x||x|
|Recurrent events ||x||x|
|Repeated measurements ||x|
|Different events ||x|
|Competing risks ||x|
4.1 Competing risk and multi-state models
Multi-state models are commonly used for describing the development of longitudinal data. They model stochastic processes, which at any time point occupy one of a set of discrete states. In medicine, the states can be e.g. healthy, diseased, and dead. A change of state is called a transition. The competing risk model is an example of multi-state modeling. In competing risks, various causes of death "compete" in the life of patient, and occurrence of one event precludes occurrence of the other events. There are generally three areas of interest in the analysis of competing risks :
Studying the relationship between a vector of covariates and the rate of occurrence of specific types of failure.
Analyzing whether patients at high risk of one type of failure are also at high risk for others.
Estimating the risk of one type of failure after removing others.
To model competing risks, a cause-specific hazard function is considered:
It is possible to calculate the Kaplan-Meier estimator for each type of failure separately, but it is difficult to give this a survival function interpretation and therefore this is not recommended . Instead, generalizations of the Kaplan-Meier and Nelson-Aalen estimators can be made (see e.g. ). The generalized estimator includes all causes of failure and is usually denoted the Aalen-Johanson estimator.
Both the baseline hazards λ0j and the regression coefficients βj vary arbitrarily over the m failure types. Estimation and comparison of the coefficients βj can be conducted by applying asymptotic likelihood techniques individually to the m factors.
for k=1,...n; i,j=1,...m and suppose that censoring is independent, so that
which must hold for all i, j, k and t>0. The Nelson-Aalen estimator of Λij(t) is then given by
for all i≠j.
Parametric and semi-parametric models for λijk are obtained analogously as earlier and may be found in .
4.2 Frailty models
Frailty models represent an extension of the Cox proportional hazards model. The concept of frailty provides a way to introduce random effects into the model to account for association (correlation) and unobserved heterogeneity. This heterogeneity may be difficult to assess but is nevertheless of a great importance. The frailty is an unobserved random factor that modifies multiplicatively the hazard function of an individual or a group of individuals. The key idea of these models is that individuals most "frail" die earlier than the others . The frailty models are relevant to lifetimes of several individuals, similar organs and repeated measurements. They are not generally relevant for the case of different events .
First, bivariate models will be considered. Let
The marginal survival functions are then
If T1 and T2 are independent, S12(t1,t2)=S1(t1)S2(t2). The joint hazard function is
and the marginal hazards are
Usually, the frailty is assumed to act multiplicatively on the hazard, so that
where g(z) is the probability density of Z. For the bivariate survival function thus
where the Laplace transform of g(z) is evaluated at s=Λ01(t1)+Λ02(t2).
The gamma function in the denominator of the probability density function is defined as
It satisfies Γ(k+1)=kΓ(k). The gamma distribution fits very well to failure data and is also convenient from computational and analytical point of views .
The survival analysis is a collection of specific statistical methods. In this paper, a short overview of these methods was presented. The standard univariate models were extended to multivariate models dealing with parallel and longitudinal data. The two major multivariate concepts were introduced: multi-state and frailty models.
The paper has been supported by the project SVV-2010-265513.
| ||Aalen O. O.: Nonparametric Inference for a Family of Counting Processes. Annals of Statistics 6 (1978), 701-726.|
|||Andersen,P. K. Gill R. D.> Cox's regression model for counting processes: a large sample study. The Annals of Statistics 10 (1982), 1100-1120.|
|||Cox D. R. : Regression Models and Life-Tables. Journal of the Royal Statistical Society B 34 (1972), 187-220.|
|||Fleming T. R., Harrington D. P.: Counting Processes and Survival Analysis. John Wiley & Sons, New York, 1991.|
|||Fürstová J.: Multivariate Methods of Survival Analysis. Doktorandský den 2010, Matfyzpress, Praha, 2010.|
|||Gill R. D.: Understanding Cox's regression model: a martingale approach. Journal of the American Statistical Association 79 (1984), 441-447.|
|||Hougaard P.: Analysis of Multivariate Survival Data. Springer, New York, 2000.|
|||Kalbfleisch J. D., Prentice R. L.: The Statistical Analysis of Failure Time Data. John Wiley & Sons, New York, 2002.|
|||Kaplan E. L., Meier P.: Nonparametric Estimation from Incomplete Observations. Journal of the American Statistical Association 53 (1958), 457-481.|
|||Klein J. P., Moeschberger M. L.: Survival Analysis. Techniques for Censored and Truncated Data. Springer, New York, 2003.|
|||Miller R. G., Gong G., Muñoz A.: Survival Analysis. John Wiley & Sons, New York, 1998.|
|||Nelson W.: Theory and Applications of Hazard Plotting for Censored Failure Data. Technometrics 14 (1972), 945-965.|
|||Prentice R. L., Williams B. J., Peterson A. V.: On the regression analysis of multivariate failure time data. Biometrika 68 (1981), 373-379.|
|||Rodríguez G.: Multivariate Survival Models. available at http://data.princeton.edu/, cited on April 10, 2010.|
|||Self S. G., Prentice R. L.: Commentary on Andersen and Gill's "Cox's regression model for counting processes: a large sample study". The Annals of Statistics 10 (1982), 1121-1124.|
|||Therneau T. M., Grambsch P. M.: Modeling Survival Data. Extending the Cox Model. Springer, New York, 2000.|
|||Vaupel J. W., Manton K. G., Stallard E.: The impact of heterogeneity in individual frailty on the dynamics of mortality. Demography 16 (1979), 439-454.|
|||Wei L. J., Lin D. Y., Weissfeld L.: Regression analysis of multivariate incomplete failure time data by modeling marginal distributions. Journal of the American Statistical Association 84 (1989), 1065-1073.|
| ||Wienke A.: Frailty Models in Survival Analysis. Habilitation. Martin-Luther-Universität Halle-Wittenberg, 2007. available at http://sundoc.bibliothek.uni-halle.de/habil-online/, cited on September 30, 2009.|
Faculty of Medicine and Dentristy
Tř. Svobody 8
771 26 Olomouc