Genome Institute of Singapore, Biopolis, Singapore

Univ Paris-Sud, U669, Villejuif, F-94807 France

Inserm, UMRS 1018, Villejuif, F-94807 France; Univ Paris-Sud, Villejuif, F-94807 France

Hôspital Paul Brousse AP-HP, Villejuif, F-94807 France

Abstract

Background

In genomic medical studies, one of the major objectives is to identify genomic factors with a prognostic impact on time-to-event outcomes so as to provide new insights into the disease process. Selection usually relies on statistical univariate indices based on the Cox model. Such model assumes proportional hazards (PH) which is unlikely to hold for each genomic marker.

Methods

In this paper, we introduce a novel pseudo-R^{2 }measure derived from a crossing hazards model and designed for the selection of markers with crossing effects. The proposed index is related to the score statistic and quantifies the extent of a genomic factor to separate patients according to their survival times and marker measurements. We also show the importance of considering genomic markers with crossing effects as they potentially reflect the complex interplay between markers belonging to the same pathway.

Results

Simulations show that our index is not affected by the censoring and the sample size of the study. It also performs better than classical indices under the crossing hazards assumption. The practical use of our index is illustrated in a lung cancer study. The use of the proposed pseudo-R^{2 }allows the identification of cell-cycle dependent genes not identified when relying on the PH assumption.

Conclusions

The proposed index is a novel and promising tool for selecting markers with crossing hazards effects.

Background

In genomic medical research, one of the major objectives is to identify genomic markers having a prognostic impact on clinical outcomes (e.g. relapse, death) so as to provide new insights into the disease process. Most of the studies which investigate the relationship between genomic markers and time-to-event outcomes usually rely on marginal survival analysis that consider univariate prognostic indices derived from the semi-parametric Cox proportional hazards model. This proportional hazards (PH) assumption states that the ratio of the hazard functions of different individuals remains constant over time. Although this assumption is arbitrary, it is widely used since it offers a convenient way to summarize the effect of a covariate on the baseline hazard function and the resulting inference on the parameters of the model is robust enough to encompass some instances of non-proportionality (monotone, converging or diverging hazard functions). However, this PH modelisation is clearly not coping with crossing hazard functions. Crossing-hazards models explicitly specify that there is a time at which the hazard curves for different levels of a covariate cross. To our best knowledge, the crossing hazards phenomenon is barely investigated in genomic studies and it is usually described as a time-dependent effect of the genomic marker without any meaningful bioclinical interpretation.

In this paper, we introduce a novel pseudo-R^{2 }index derived from a semi-parametric non-proportional hazards model that is suited for the selection of genomic markers with crossing hazard functions. We also discuss one of the plausible interpretations for such crossing phenomenon that relates to a gene effect modification. For censored survival data, two main sapproaches have been considered for quantifying the predictive ability of a variable to separate patients: concordance and proportion of explained variation. This latter quantifies the relative gain in prediction ability between a covariate-based model and a null model, by analogy with the well-known linear model (and the R^{2 }criterion). In this framework, we propose a novel statistical quantity which is related to the score statistic. The proposed pseudo-R^{2 }index relies on the partial likelihood function in such a way that it has an interpretation in terms of percentage of separability between patients according to their survival times and marker measurements. It extends a previous work ^{2 }measure for such non-proportional situations. Then, we introduce a semi-parametric non-proportional hazards model which gives rise to some crossing effect of the hazard function. Finally, we derive from this model a pseudo-R^{2 }measure well-suited for crossing hazard function and show its link to the robust score statistic for testing no effect of the considered marker

Methods

In this section, we first present a simple situation which motivates the use of the semi-parametric non-proportional hazards model introduced in the next subsection.

Notations

Let the random variables _{i}
_{i}
_{i }
_{i}
_{i}

Let _{i}
_{i}
^{- }+ _{i}
^{-}) be the number of events occurring in the interval [^{th }covariate for individual

Motivational situation: the modulating effect

In the following, we show how a simple interplay between two binary markers ^{(1) }and ^{(2) }can lead to marginal crossing hazard functions.

The joint distribution of ^{(1) }and ^{(2) }is defined by:

It is assumed that the hazard function of subject

where _{0}(

Model (1) describes a modulating effect of the two markers ^{(1) }and ^{(2)}, whereby ^{(2) }has a multiplicative effect on the hazard and ^{(1) }has a multiplicative effect only if ^{(2) }equals one (so called effect modification). The corresponding hazard functions according to the values of ^{(1) }and ^{(2) }are shown in Table

Hazard function

**
Z
^{(1)}
**

**
Z
^{(2)}
**

**0**

**1**

0

_{0}(

_{0}(

1

_{0}(^{γ}

_{0}(^{α+γ}

Assuming that model (1) is the true one, the consequences of omitting ^{(2) }on the formulation of the observed hazards ratio relative to ^{(1) }is described below. Expressing model (1) in terms of the conditional survival function given

where _{0}(_{0}(

It is worth noting that this latter expression can be obtained as the expectation of (1) taken over ^{(2) }given the at risk process. Finally, the hazards ratio relative to the values

It appears from this expression that hazards may cross over time. More precisely, it is shown in Additional File ^{(1)}, ^{(2)}), the hazards ratio inverts at a given time in (0; +∞). Obviously, such a time-dependence cannot be properly handled by using the proportional hazards model to analyze the data.

**The hazards ratio function inverts for a given time**.

Click here for file

Semi-parametric model

The proposed model defines the survival function of subject _{i }

where _{0}(

where

In the simple case of a covariate

Thus, as expected, model (4) allows hazards to cross over time. Note that the survival functions cross at a time larger than the crossing time of the hazards, and may not cross at a finite time.

At time

The score function evaluated for

with _{0}(

Pseudo-R^{2 }measure

The goal of this section is to propose a pseudo-R^{2 }index that can be interpreted in terms of percentage of separability between patients according to their survival times and marker measurements under the crossing hazards model (4). The approach used below is based on the score function (5). It extends the particular case that we considered in a former work _{i }

With

From this expression, we show that, for a given covariate _{i }

An estimation of the _{i }

where _{0}(_{i}
_{i}
_{i}

For distributional reason, instead of the _{i}
_{i }

Where

The _{i }

The sum over _{i}
_{i}
_{i }
_{i }

Finally, the index is equal to the robust score statistic divided by the number of distinct uncensored failure times

The index **D**
_{0 }is interpreted in terms of percentage of separability over time between the event/non-event groups. Its calculation is easy as it does not require the estimation of the parameter **D**
_{0 }≤ 1.

It is worth noting that the index **D**
_{0 }can be interpreted as a pseudo-R^{2 }measure. In the linear regression model, the R^{2 }(coefficient of determination) can be directly linked to likelihood-related quantities such as the Wald test, the likelihood ratio and the score statistics (see ^{2}. In the framework of non-linear models, statisticians have searched for a corresponding index and different pseudo-R^{2 }statistics have been proposed for censored data. Our proposed index is an extension of the definition of the R^{2 }for survival model with crossing hazards which relies on the score statistic.

Results

Simulation Scheme

A simulation study was performed to describe the behavior of the proposed index, _{i}
_{0}(^{-βZ}

The coefficient ^{β }
^{β }
_{i }
_{i }
_{i }
_{i }
_{c }
_{i }
_{i }
_{i }
_{i }
_{i}
_{i}

Simulation Results

Figures ^{β }
_{c}

Simulations results for

**Simulations results for ****, ****, **** and ****, for n = 100 subjects, **** and a uniform censoring (1,000 repetitions)**. Boxplots of the different indices according to the values of ^{β }and _{c}.

Simulations results for

**Simulations results for ****, ****, **** and ****, for n = 100 subjects, **** and a uniform censoring (1,000 repetitions)**. Boxplots of the different indices according to the values of ^{β }and _{c}.

Simulations results for

**Simulations results for ****, ****, **** and **** and ****, for n = 100 subjects, **** and a uniform censoring (1,000 repetitions)**. Boxplots of the different indices according to the values of ^{β }and _{c}.

Simulations results for

**Simulations results for ****, ****, **** and ****, for n = 50 subjects, **** and a uniform censoring (1,000 repetitions)**. Boxplots of the different indices according to the values of ^{β }and _{c}.

Simulations results for **, **

**Simulations results for ****, ****, **** and ****, for n = 50 subjects, **** and a uniform censoring (1,000 repetitions)**. Boxplots of the different indices according to the values of ^{β }and _{c}.

Simulations results for

**Simulations results for ****, ****, **** and ****, for n = 50 subjects, **** and a uniform censoring (1,000 repetitions)**. Boxplots of the different indices according to the values of ^{β }and _{c}.

**Simulations results for **
**, **
**, **
** and **
**, for n = 500, **
** and a uniform censoring (1,000 repetitions)**. Graphic: Boxplots of the different indices according to the values of ^{
β
}and _{
c
}.

Click here for file

**Simulations results for **
**, **
**, **
** and **
**, for n = 500, **
** and a uniform censoring (1,000 repetitions)**. Graphic: Boxplots of the different indices according to the values of ^{
β
}and _{
c
}.

Click here for file

**Simulations results for **
**, **
**, **
** and **
**, for n = 500, **
** and a uniform censoring (1,000 repetitions)**. Graphic: Boxplots of the different indices according to the values of ^{
β
}and _{
c
}.

Click here for file

As seen from Figures ^{β }

The standard errors of the six indices are small when

The mean value of

**Simulations results for **
**, **
**, **
** and **
**, for n = 100 subjects, **
** and a uniform censoring (1,000 repetitions)**. Graphic: Boxplots of the different indices according to the values of ^{
β
}and _{
c
}.

Click here for file

**Simulations results for **
**, **
**, **
** and **
**, for n = 100 subjects, **
** and a uniform censoring (1,000 repetitions)**. Graphic: Boxplots of the different indices according to the values of ^{
β
}and _{
c
}.

Click here for file

**Simulations results for **
**, **
**, **
** and **
**, for n = 100 subjects, **
** and a uniform censoring (1,000 repetitions)**. Graphic: Boxplots of the different indices according to the values of ^{
β
}and _{
c
}.

Click here for file

Application of the index on real data

In this section, we illustrate the use of the proposed index by selecting transcriptomic prognostic factors having a crossing effect in a lung cancer study. We compare the selection to the one obtained when relying on the index calculated under a proportional hazards model.

Dataset

This series is composed of 74 patients who underwent surgery at the Hôtel-Dieu Hospital (AP-HP, France) between August 2000 and February 2004 for stage IB (pT2N0) primary adenocarcinoma or large cell lung carcinoma of peripheral location

Selection of the variables

The genes were ranked according to the value of either

We then examined the biological processes that were significantly over-represented in the two sublists using the PANTHER (Protein ANalysis THrough Evolutionary Relationships) classification system

**Biological processes (obtained from the PANTHER classification system) for the lung cancer study cohort**. Table: List of the biological process obtained according to

Click here for file

**Lists of the cell cycle related transcripts among the selection according to the value of the index calculated either under the crossing hazards effect or the PH model**. Tables: Lists of the cell cycle related transcripts according to

Click here for file

Among the 25 cell-cycle related transcripts selected according to

Kaplan-Meier curve of the groups according to the expression levels of genes (a)

**Kaplan-Meier curve of the groups according to the expression levels of genes (a) FGFR2 , (b) MCL1 , and of the groups defined by the four combinations of expression levels of (c) FGFR2 and FGF4 , and (d) MCL1 and BCL2 in the lung cancer study**. The groups of patients were determined according to the low-high expression status of the genes considered. Patients whose expression measurement was higher (resp. lower) than the median were assigned to the "highly expressed" (resp. "lowly expressed") group.

The gene

In the same way, we discussed the interaction between

In these two examples, we could hypothesize that the crossing effect observed in the marginal analysis of

Discussion

For survival data analysis, univariate feature selection strategy is mainly based on ranking markers according to the value of a test statistic or a predictive index obtained under the classical Cox PH model. In such setting, we demonstrated in a previous work the interest of using a pseudo-R^{2 }measure for genomic studies. However, various departures from the PH assumption can be observed and crossing hazards phenomenon can be encountered in real situations.

In this context, we propose a novel pseudo-R^{2 }measure that is suitable for identifying genomic markers with crossing effects. It is linked to a semi-parametric survival model that provides sufficient flexibility to handle data with crossing hazards. Selecting such markers is potentially important since it could reflect the complex interplay between genes belonging to the same pathway.

The proposed index is ranging from zero to one and can be interpreted in terms of percentage of separability over time between the subgroup of subject(s) experiencing the event and the subgroup of those experiencing the event at a later time. It quantifies the prognostic separability of markers under a crossing hazard function assumption, whereas for the proportional hazards setting other specialized indices have previously been proposed ^{2 }is derived from the partial log-likelihood function and directly linked to the robust score statistic, while similar derivations from Wald or likelihood ratio statistics are not trivial and not easily tractable. As seen from our simulation results, the proposed index increases with the value of the regression parameter and is affected neither by the percentage of censoring nor the sample size of the study. The results show that our pseudo-R^{2 }is the most suitable for taking into account the crossing hazards phenomenon, as compared to classical indices.

From a real dataset on lung cancer, we show that our index allows to identify genes involved in biological processes linked to the tumor evolution and that are not selected under the PH assumption.

Among the cell-cycle related genes of our selection, we investigate two genes, ^{2 }measure that is specifically designed for crossing hazards situations.

Conclusions

We propose a novel pseudo-R^{2 }measure that quantifies the prognostic separability of markers under a crossing hazard function assumption. This phenomenon can be encountered in real situations promoting the use of this novel index.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

SR, TM and PB developed the original index. PB coordinated the project and is SR's PhD thesis advisor. All authors read and approved the final manuscript.

Acknowledgements

We acknowledge the following institutions for general funding: the Genome Institute of Singapore (Singapore) and the French Ministry of Higher Education and Research (France). We thank all our colleagues from the Computational and Mathematical Biology group for fruitful discussions.

Pre-publication history

The pre-publication history for this paper can be accessed here: