Estimating the treatment effect from non-randomized studies: The example of reduced intensity conditioning allogeneic stem cell transplantation in hematological diseases

1471-2326-12-10 1471-2326 Technical advance Estimating the treatment effect from non-randomized studies: The example of reduced intensity conditioning allogeneic stem cell transplantation in hematological diseases Resche-RigonMatthieumatthieu.resche-rigon@univ-paris-diderot.fr PirracchioRomainromainpirracchio@yahoo.fr RobinMariemarie.robin@univ-paris-diderot.fr De LatourPeffaultRegisregis.peffaultdelatour@sls.aphp.fr SibonDaviddavid.sibon@nck.aphp.fr AdesLionellionel.ades@avc.aphp.fr RibaudPatriciapatricia.ribaud@univ-paris-diderot.fr FermandJean-Pauljpfermand@yahoo.fr ThieblemontCatherinecatherine.thieblemont@sls.aphp.fr SociéGérardgerard.socie@sls.aphp.fr ChevretSylviesylvie.chevret@univ-paris-diderot.fr

Département de Biostatistique et Informatique Médicale, Hôpital Saint-Louis, AP-HP, Paris 75010, France

INSERM, UMRS 717, Paris 75010, France

Université Denis Diderot Paris 7, Paris 75010, France

Service d’Hématologie Greffe, Hôpital Saint-Louis, AP-HP, Paris 75010, France

Service d'Hématologie Clinique, Hôpital Avicenne, AP-HP, Bobigny 93100, France

Service d'Immuno-Hématologie, Hôpital Saint-Louis, AP-HP, Paris 75010, France

Service d'Onco-Hématologie, Hôpital Saint-Louis, AP-HP, Paris 75010, France

BMC Blood Disorders 1471-2326 2012 12 1 10 http://www.biomedcentral.com/1471-2326/12/10 10.1186/1471-2326-12-1022898556 30120129820121682012 2012Resche-Rigon et al.; licensee BioMed Central Ltd.This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Propensity score Allogeneic stem cell transplantation Treatment effect Non-randomized studies

Abstract

Background

In some clinical situations, for which RCT are rare or impossible, the majority of the evidence comes from observational studies, but standard estimations could be biased because they ignore covariates that confound treatment decisions and outcomes.

Methods

Three observational studies were conducted to assess the benefit of Allo-SCT in hematological malignancies of multiple myeloma, follicular lymphoma and Hodgkin’s disease. Two statistical analyses were performed: the propensity score (PS) matching approach and the inverse probability weighting (IPW) approach.

Results

Based on PS-matched samples, a survival benefit in MM patients treated by Allo-SCT, as compared to similar non-allo treated patients, was observed with an HR of death at 0.35 (95%CI: 0.14-0.88). Similar results were observed in HD, 0.23 (0.07-0.80) but not in FL, 1.28 (0.43-3.77). Estimated benefits of Allo-SCT for the original population using IPW were erased in HR for death at 0.72 (0.37-1.39) for MM patients, 0.60 (0.19-1.89) for HD patients, and 2.02 (0.88-4.66) for FL patients.

Conclusion

Differences in estimated benefits rely on whether the underlying population to which they apply is an ideal randomized experimental population (PS) or the original population (IPW). These useful methods should be employed when assessing the effects of innovative treatment in non-randomized experiments.

Background

Randomized controlled trial (RCT) is considered the gold standard study design for removing sources of bias from observations when estimating the effects of a treatment 1 2 . However, in some situations, it may be difficult, unnecessary, inappropriate, or impossible to perform an RCT 3 , and the majority of the evidence comes from observational studies 4 5 .

This is notably true when evaluating non-myeloablative or reduced-intensity conditioning (RIC) regimens before allogeneic stem cell transplantation (Allo-SCT). RIC Allo-SCT has emerged in the last decade as an attractive modality to decrease transplant-related toxicity. The enthusiasm for this technique has been based on heterogeneous observational studies ranging from case reports to registry cohort studies 6 7 8 9 10 11 12 13 14 . These studies are very heterogeneous in terms of patient selection criteria and outcomes, RIC regimens and timing. For this reason, conclusions regarding the the overall body of evidence in this area are very limited 15 . Only a few prospective controlled clinical trials have been performed in studies of myeloma. This is mostly due to practical difficulties and selection restrictions for patients affected by advanced or refractory diseases, elderly patients, or patients with comorbidities for whom no other treatment option could be clearly proposed. In these few recent prospective non-randomized studies that have been conducted 16 17 18 , the availability of an HLA-identical or non-identical sibling donor has been considered equivalent to so-called ”genetic randomization” of bone marrow transplant (BMT) against chemotherapy, justifying the absence of RCT 19 20 21 . Nevertheless, results of such studies are still vulnerable to selection bias and confounding factors.

In RCTs, the use of inclusion and exclusion criteria yields a sample of subjects that are all eligible for each of the treatments under study. By contrast, in observational studies, baseline selection criteria differing between Allo-SCT and other treatments may also affect patient outcome and lead to bias in the estimated effect of 2 22 . Thus, non-randomized comparative designs expose to unequal distributions of covariates that impact both the outcome and the decision to treat, so-called ”confounding by indication” 23 . Adjusted techniques of treatment estimation through the use of multivariate regression models have been widely used to control for confounding in observational data, but these methods do not provide any causal evidence comparable to that derived from RCTs. Formally, an association is considered causal when the observed outcome under the studied exposition is different from what would have been observed in the absence of the exposition. Because the latter outcome cannot actually happen, it is generally known as a counterfactual outcome 24 . In an ideal randomized design with blind assignment, full compliance, and no loss during follow-up, the absence of confounding data ensures that treated and non-treated patients exchangeable. In this setting, RCT allows causal claims about the population in the study to be deduced from differences between the treatment groups 25 . By contrast, in observational studies, because treated and non-treated populations are not exchangeable, no causal evidence could be derived from the original data 26 . Therefore, specific statistical tools have been developed to enable appopriate causal conclusions to be derived from observational data. These tools re-create the conditions of conditional exchangeability as observed in an RCT.

This article provides an illustration of two of these specific statistical approaches in the particular setting of Allo-SCT evaluation of observational cohorts. The methods described here aim at handling confounding variables induced by non-randomized designs, namely, the propensity score-based (PS) matching approach 27 and the inverse probability of treatment weighting (IPW) approach, which is derived from the marginal structural models 28 . These statistical methods have both been developed to re-create exchangeability in the presence of all confounding variables. By re-creating populations in which all the confounding variables have comparable distributions (Figure 1), they allow a causal inference and unbiased estimation of treatment effect 26 29 .

Figure 1

Illustration of the different distributions of a covariate (X) in two non-randomized samples (A & B).

Illustration of the different distributions of a covariate (X) in two non-randomized samples (A & B). The propensity score method (PS) aims at re-creating the conditions of a pseudorandomization, while the inverse probability weighting (IPW) approach aims at re-creating a pseudopopulation where patients A and B are exchangeable. Both methods aim at obtaining a similar distribution of the covariate X in the two groups.

Methods

The Allogeneic Stem Cell Transplantation cohorts

Allogeneic Stem Cell Transplantation (Allo-SCT) was performed in patients who relapsed after autologous transplantation (in Saint-Louis Hospital, Paris, France) but remained chemosensitive. Among them, all consecutive patients with multiple myeloma (MM, 23 pts), follicular lymphoma (FL, 28 pts) or Hodgkin’s disease (HD, 31 pts), were considered for analysis as follows.

• MM: Between October 2002 and August 2006, 23 consecutive MM patients under 60 years of age and in their first or second relapse received RIC Allo-SCT.

• FL: All 28 consecutive patients who received Allo-SCT for relapsing/refractory FL from December 1989 to January 2007 were eligible for analysis.

• HD: A total of 31 HD patients who received Allo-SCT from January 1995 to December 2008 were consecutively analyzed.

Selection of controls

The main issue in observational studies is the definition of control subjects to whom comparison of outcomes can be applied. As reported by Austin 30 , observational studies should be designed to approximate randomized experiments as closely as possible. This suggests that particular attention should be paid to include only those subjects who are eligible to receive either treatment or intervention 31 . This refers to the “positivity” or "overlap" 32 assumption and requires a careful selection of the original cohorts of untreated patients.

As summarized in the flow chart depicted in Figure 2, controls were selected carefully. MM controls were selected from patients enrolled in the MAG-95 and MAG-2002 trials 33 , while FL and HD patients were selected from hospital cohorts. The clinical trials from which the Multiple Myeloma control patients were selected, have been carried out in compliance with the Helsinki Declaration and French laws regarding biomedical research at the time the trials were conducted. In particular the studies were approved by the Ethics Committee of Saint Louis Hospital (Paris, France). To insure the validity of the overlap assumption, we restricted the controls to patients who survived at least six months after relapse (MM) or one year after auto-SCT (HD), since this was the minimal time between relapse or first Auto-SCT and Allo-SCT in MM and HD patients from the Allo-SCT groups, respectively.

Figure 2

Flow chart for the selection of the control patients.

Flow chart for the selection of the control patients.

Three cohorts comprised of 276 patients (142 MM, 115 FL and 19 HD) who relapsed after autologous transplantation (auto-SCT) but did not undergo allogeneic stem cell transplantation were retained for analysis. Patients who had contraindications (severe comorbidities, age > 65 years..) to Allo-SCT were excluded from the cohort.

To estimate the benefit of Allo-SCT from observational cohort data, three analyses were performed in each cohort of MM, FL and HD patients separately. Both approaches require modeling the probability of being treated.

Probability of treatment model: Propensity Score

The propensity score (PS) is derived from the probability that a given patient would receive Allo-SCT conditionally to his confounding covariates, X. It is estimated by fitting a multivariate logistic model to the original cohorts of treated and untreated patients in order to predict allocation to Allo-SCT from patient covariates, X 27 34 35 . This aims to re-create exchangeability, that is, there is no unmeasured confounding variable. Unfortunately, this assumption cannot be tested, and the PS model requires the analyst to have confidence that X contains almost all characteristics related to both treatment and outcome, and that there are no additional, unmeasured, confounders 36 .

Since one cannot know all the covariates that are confounding, this multivariable model should include most of the covariates measured at baseline, or at least those known or suspected to be confounding, in the hope that there is at least one measured covariate strongly related to all the confounders 37 38 . Nevertheless, due to the sample size of the cohorts, we only included those variables that were strongly related to the treatment allocation in the PS models 38 . These included age at diagnosis, time to relapse and beta-2-microglobulin level for the MM cohorts, age at relapse, time from relapse to SCT and number of previous regimens for the FL cohorts, and age at diagnosis and stage for the HD cohort.

Estimation of causal benefit of Allo-SCT

The main endpoints were overall survival (OS) and event-free survival (EFS). These were defined in the Allo-SCT groups from the date of Allo-SCT for MM and FL and from the date of first autologous SCT for HD. In the non-Allo-SCT patients, OS and EFS were defined from the date of relapse plus six months for MM, from the date to autologous SCT for FL and from the date of first autologous SCT plus 12 months for HD. We first fitted standard Cox models to the original samples. Then, specific methods to handle confounding variables were applied.

Matched propensity score-based approach

Propensity score (PS) analysis attempts to create a comparison group of non-treated patients that closely mimics the group of treated patients by matching based on the likelihood that a given patient has received Allo-SCT considering all his confounders (Figure 1) 34 .

It is based on a matched-paired analysis as follows 39 40 : Allo-SCT patients and controls are matched on the logit of the PS using calipers of width equal to 0.2 of its standard deviation (SD). Two patients of a pair cannot differ in the linear score of being treated by more than 0.2 SD 39 40 . A nearest-neighbor matching algorithm was thus used to form pairs of treated and untreated subjects with the constraint that once a patient had been matched, he(she) could not be further matched.

The degree to which the matching procedure adequately balanced covariates between patients who received Allo-SCT and those who did not was evaluated by comparing the standardized mean differences of the main measured baseline covariates between treated and untreated patients in the original and matched samples 35 41 .

The benefit of Allo-SCT to outcome was then estimated by fitting a Cox model that applies to the propensity-based matched sample using a robust variance estimator to take into account the correlation induced by the matching 42 43 .

Inverse probability weighting approach

As an alternative to the PS matching approach, inverse probability of treatment weighted (IPW) estimators have been developed to draw causal conclusions from observational data in the presence of confounding variables by indication 24 44 45 . This approach consists of creating a hypothetical population, the so-called pseudo population, that includes patients for which there are no example of Allo-SCT treated or untreated patients sharing the same characteristics (Figure 1) 28 46 47 . In that pseudo population, in which the probability of treatment no longer depends on covariates, the effect of the treatment on outcome is the same as in the original selected population. This pseudo-population is expected to have the X distribution of the total population.

This method uses propensity scores to derive weights for individual observations. Actually, each individual is assigned a weight, which is inversely proportional to his (her) probability of receiving the treatment he (she) actually received (either Allo-SCT or not), conditionally to the value of his (her) counfonding covariate X 28 . It is thus computed directly from 1/PS or 1/(1-PS), respectively. This is also referred as the "PS weighted modelling method" or the "inverse propensity weighted method" 28 29 36 46 48 .

A marginal causal effect of Allo-SCT on survival or EFS in the resulting pseudo-cohorts is then analyzed by using a weighted Cox proportional hazard model. As in the matched propensity score-based approach, a robust variance estimator is applied to take into account that each patient contributed more than once, given that weights are not equal to one 28 .

Statistical analysis

Logistic models, Cox models and weighted Cox models were fitted using standard packages of R software 49 . Matching was performed using the Matching R package. Equivalent packages are available in standard statistical softwares.

We checked for model misspecifications, i.e., of either the PS or IPW models. For the PS model, we checked for linearity between continuous covariates and the log-odds of receiving treatment 41 . For the IPW model, we explored the distribution of weights (mean, standard deviations, minimum and maximum) 39 . Weights distribution was considered as optimized when mean weights were close to 1 with limited dispersion 28 46 . Reductions in the imbalances reached by each method were assessed using graphical displays of the standardized mean difference in main covariates between treatment groups 41 50 .

Finally, Cox model assumptions of proportional hazards and log-linearity for continuous covariates were checked 51 .

Results and discussion

Three separate analyses were thus performed corresponding to MM, FL and HD patients, respectively.

Baseline comparison

As expected due to the-non randomized designs, and although controls were selected carefully to avoid non-overlapped confounding variables, Allo-SCT and control patients markedly differed at baseline (Table 1). As expected, all patients who received Allo-SCT were younger than those who did not. Moreover, MM patients who received Allo-SCT had relapsed earlier (median 16 vs. 26.5) than those who did not; by contrast, HD patients from the Allo-SCT group had delayed relapse as compared to the control group (median: 1.2 vs. 2.4). Otherwise, FL patients from the Allo-SCT group received a higher number of previous regimens (4 vs. 3) while those HD patients had less (3 vs. 4). This is illustrated on plots of absolute mean standardized differences in Figure 3.

Table 1

Median [Q1-Q3]

Allo-SCT

Controls

p-value

N (%)

Original set

n=23

n=142

Age

48 [40.5-51]

51.5 47 48 49 50 51 52 53 54 55

0.005

Beta2 ≥ 3.5

4 (17 %)

52 (37 %)

0.12

Months to relapse

16 [11–32.5]

26.5 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38

0.014

Matched set

n=21

Age

49 41 42 43 44 45 46 47 48 49 50 51

46 42 43 44 45 46 47 48 49 50

0.24

Beta2 ≥ 3.5

4 (19 %)

0.71

Months to relapse

17 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33

24 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

0.22

Weighted set

n=268

n=165

Age

56 51 52 53 54 55 56 57 58

51 46 47 48 49 50 51 52 53 54

0.15

Beta2 ≥ 3.5

4 (19%)

0.28

Months to relapse

58 [26–70]

25 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36

0.08

Original set

n=28

n=115

Age

38 33 34 35 36 37 38 39 40 41 42

46 40 41 42 43 44 45 46 47 48 49 50 51 52

0.0001

No previous regimens

4 3 4

3 2 3 4

0.005

Months to relapse

6.7 [5.6-9.2]

4.6 [3.7-6.1]

0.0001

Matched set

n=19

Age

38 33 34 35 36 37 38 39 40 41 42

38 33 34 35 36 37 38 39 40 41 42 43 44 45

0.35

No previous regimens

3 3 4

0.90

Months to relapse

6.3 [4.7-8.9]

7.8 [4.0-10.7]

0.82

Weighted set

n=78

n=117

Age

39 35 36 37 38 39 40 41 42 43 44 45 46

44 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

0.42

No previous regimens

3 3 4

3 2 3 4

0.16

Months to relapse

5.9 [4.7-7.4]

5.3 [3.8-9.3]

0.03

Hodgkin disease

Original set

n=23

n=19

Age

23 19 20 21 22 23 24 25 26 27 28 29

29 24 25 26 27 28 29 30 31 32 33 34 35

0.05

No previous regimens

4 3 4

4 4 5

0.05

Months to relapse

1.2 [0–9.2]

2.4 [0–7.2]

0.94

Matched set

n=15

Age

24 21 22 23 24 25 26 27 28 29 30 31

25 23 24 25 26 27 28 29 30

0.98

No previous regimens

3 3 4

4 4 5

0.21

Months to relapse

1.9 [0.3-9.9]

1.8 [0–5.4]

0.49

Weighted set

n=41

n=40

Age

26 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49

25 20 21 22 23 24 25 26 27 28 29 30 31 32

0.71

No previous regimens

3 3 4

4 4

0.17

Months to relapse

0.1 [0–0.7]

0.1 [0–0.5]

0.35

Table 2

Numbers of patients

OS : HR (CI95%)

EFS : HR (CI95%)

Allo-SCT/Controls

Original samples

PS-matched samples

Naive

IPW

Naive

IPW

23/142

21/21

0.38 (0.18;0.80)

0.35 (0.14;0.88)

0.72 (0.37;1.39)

28/115

19/19

2.55 (1.37;4.75)

1.28 (0.43;3.77)

2.02 (0.88;4.66)

1.21 (0.68;2.18)

0.45 (0.17;1.21)

0.67 (0.31;1.41)

22/19

15/15

0.33 (0.12;0.87)

0.23 (0.07;0.80)

0.60 (0.19;1.89)

0.71 (0.38;1.35)

0.47 (0.20;1.09)

0.64 (0.33;1.22)

Figure 3

Imbalances in the MM, FL and HD cohorts, defined as the standardized means differences of covariate values between the two treatment groups.

Imbalances in the MM, FL and HD cohorts, defined as the standardized means differences of covariate values between the two treatment groups. · Naive Analysis, red circle symbol = Propensity Score Model, blue circle symbol = Marginal Structural Model.

Treatment effect

From the naive analyses based on standard Cox models, a significant benefit associated with RIC Allo-SCT was observed for MM patients with an estimated hazard ratio (HR) of death at 0.38 (95% confidence interval 95%CI: 0.18;0.80) and for HD patients (HR = 0.33, 95%CI: 0.12;0.87) while Allo-SCT seemed to be deleterious in FL patients (HR = 2.55, 95%CI: 1.37;4.75). No significant benefit was found in terms of EFS (HR =1.21, 95%CI: 0.68;2.18, HR =0.71, 95%CI: 0.38;1.35 for FL and HD respectively).

Matched propensity score-based approach

The matching procedure resulted in a drastic reduction of the sample size of the PS-matched samples. From the original datasets, 21 (91% of RIC Allo-SCT patients, 15% of controls) matched pairs could be constituted from MM patients, as compared to 19 (68% of Allo-SCTpatients, 17% of controls) from the the FL patients, and 15 (48% of the Allo-SCT patients and 79% of the controls) from the HD patients. This relies both on the original differences in sample sizes and the non-overlapped covariates values (Table 1). As a result, baseline imbalances between the two matched sets were reduced (Figure 3). Note that imbalance was also reduced for those covariates not included in the PS, especially age at diagnosis and age at transplantation in the FL cohort.

Based on these PS-matched samples, we observed a significant benefit to the survival of Allo-SCT as compared to non Allo-SCT MM patients with an estimated HR of death at 0.35 (95%CI: 0.14-0.88), as well as HD (HR = 0.23, 95%CI: 0.07;0.80). A similar result was not found for FL patients (HR = 1.28; 95%CI: 0.43;3.77). No significant benefit was found for EFS with the estimated HR of event at 0.45 (95%CI: 0.17;1.21) in FL and 0.47 (95%CI: 0.20;1.09) in HD.

IPW approach

Using the IPW approach, imbalances in the pseudo-cohorts were also reduced, though reduction was slightly less effective than that observed using the PS (Figure 3). Actually, the distribution of the covariates in the weighted samples (pseudo-population, was close to that observed in the original datasets (Table 1).

Despite similar trends, the survival benefit associated with Allo-SCT in MM and HD patients was erased using IPW based analyses as compared to PS-based analyses, which yielded an estimated HR of death of 0.72 (95%CI: 0.37-1.39) and 0.60 (95%CI: 0.19-1.89), respectively. Results for FL patients remained non-significant (HR = 2.02, 95%CI: 0.88;4.66). No significant benefit was found for EFS, which gave an estimated HR of event of 0.67 (95%CI: 0.31;1.41) in FL and 0.64 (95%CI: 0.33;1.22) in HD.

The main objective of this paper was to report examples of treatment estimation from observational cohorts in the particular setting of Allogeneic Stem Cell Transplantation. Despite the fact that the randomized controlled trial (RCT) is the gold standard for removal of most sources of bias from observational data, such studies are difficult to conduct when evaluating Allo-SCT. In situations such as HLA-matched sibling allogeneic transplants, some authors have advocated a biological assignment trial 16 . Such trials are also known as genetic or Mendelian randomization trials, and these trials consider the selection of the sibling donor and recipient genes from their parents as a random process at the time of conception. Nevertheless, implementing such a trial requires careful consideration of the ethical issues and potential biases (prognostic factor imbalance, enrollment bias) 21 . Moreover, these trials are prospective and require several years to provide estimates of survival benefits, while observational information about treatment effect are already available.

Indeed, observational studies have several advantages over randomized, controlled trials, including lower cost, greater timeliness, and a broader range of patients 8 . Moreover, systematic reviews tend to demonstrate that, when adequaltely performed, observational studies give results similar to those of randomized clinical trials 52 . In the hematology field, and especially in that of Allo-SCT, many international cooperating groups exist and register all blood or marrow transplantation experiments. Notably, the European Group for Blood and Marrow Transplantation (EBMT) and the Center for International Blood and Marrow Transplant Research (CIBMTR) have collected information about patients undergoing Allo-SCT since the 1970s. Such observational registers could be a an important source of information when estimating the causal effect of Allo-SCT as compared to autologous SCT or other standard treatments. Nevertheless, standard statistical analyses from such observational data may result in biased and associational rather than causal estimates of treatment effect 27 28 .

Since 2000, there has been a growing interest in the use of statistical methods to estimate unbiased treatment effects from observational studies and begin to be used in haematology or oncology 53 54 55 56 . Most of these methods are based on the propensity score, i.e. re-creation of the exchangeability between the two treatments groups. Two main approaches have been proposed in this setting, namely, the propensity score-matched approach and the inverse probability weighting approach 36 . If these approaches were initially proposed for large studies, recent work by Pirracchio et al. showed that propensity score approaches (matching or IPW) are also valid and useful on small sample studies 5 . We illustrated how those methods could perform to estimate the effect of Allo-SCT on survival and event-free survival using observational data from multiple myeloma, follicular lymphoma and Hodgkin’s disease observational cohorts. Obviously, considering our low sample sizes, our findinds should be confirmed by larger studies.

However, as recently pointed out 32 , both approaches are interested in estimating different quantities, namely the average treatment effect (ATE) and the ATE for the treated (ATT). The propensity based approach aims at estimating the ATT, i.e. the effect of treatment on those subjects who are treated, allowing observational studies to be designed similarly to randomized experiments 57 . By contrast, the inverse probability weighting approach aims at estimating the ATE, that is, the average effect on the population of moving all subjects from being untreated to treated. According to specific clinical contexts, researchers should determine the most clinically meaningful treatment effect. When evaluating the benefit of Allo-SCT as compared to chemotherapy, ATE (and thus, the IPW approach) would answer the question about how outcomes would change if a policy was instituted that all patients eligible for either therapy were offered Allo-SCT. By contrast, ATT would answer the question of what was the effect of treatment for those who selected a particular modality such as Allo-SCT. This explains why estimated resulting hazard ratio estimates differed between the two approaches. Indeed, by contrast to the PS-based approach, the IPW approach never showed a significant impact of Allo-SCT on overall survival or event-free survival. In other words, the benefit of Allo-SCT appeared to be restricted to treated patients, while no average benefit appeared to be expected in the whole eligible population for Allo-SCT. This is likely to rely on the fact that the benefit of Allo-SCT may be restricted to some subsets of patients that have been excluded by matching in the PS-matched analyses but maintened, and possibly heavily weighted, in the IPW method. This further highlights the importance of the positivity (overlap) assumption.

Indeed, whatever the approach, each subject is assumed to have a non-zero probability of receiving either treatment. This suggests that observational studies should be designed similar to RCTs. That is, subjects who are ineligible for at least one of the treatments should be excluded 32 . Actually, this was exemplified in our cohorts by the percentage of control patients who could not be matched, ranging from 21% in HD up to 85% in MM. Such percentages could be related to the differences in the criteria used to define controls. Moreover, it is assumed that all variables related to both outcomes and treatment assignments were introduced in the propensity score model 35 . Rubin suggested including only variables that are strongly related to the treatment allocation, while others have proposed the application of selection algorithms 37 58 . Our PS models were based on unbalanced characteristics with known clinical significance and the number of variables was limited by the sample size. Therefore, one cannot exclude that other confusing characteristics should have been included in the PS model.

Other methods could be proposed to estimate treatment effect in non-randomized studies. The most popular method consists in estimating treatment effects using adjustment on covariates with a multivariable regression model 5 .The main limitation of this approach is that the treatment effect estimated is neither the ATE nor the ATT. Indeed, the treatment effect measured is conditional on the other covariates and then biased if used as an estimate of the ATE or ATT. Another emerging approach is the instrumental variable (IV) approach which is an econometric method used to remove the effects of hidden bias in observational studies 5 . An instrumental variable has 2 key characteristics: it is highly correlated with treatment and does not independently affect the outcome, so that it is not associated with measured or unmeasured patient health status. In our case, none available variable could be considered as an IV. Moreover, this approach hasn’t been validated on small samples. This should deserve further evaluation to be used in such clinical settings.

Conclusion

In summary, it is expected that hematologists involved in clinical research will face an increasing need for methods such as those discussed here when assessing effects of innovative treatments based on cohorts or registries. Actually, though they do not replace randomized trials, these approaches have already been widely used in other medical settings such as cardiology or critical care 7 59 . This could be similar to what happened a decade ago with competing risks approaches in estimating the incidence of relapse. Whatever the statistical innovation, full understanding of the method is required. Notably, differences in the proposed methods should be anticipated by considering the population of interest for which the benefit is likely to apply. In other words, physicians and researchers should carefully assess whether they are interested in estimating the average treatment effect in the eligible population or only in those who were treated.

Abbreviations

RCT: Randomized Controlled Trial; RIC: Reduced-Intensity Conditioning; SCT: Stem Cell Transplantation; Allo-SCT: Allogeneic Stem Cell Transplantation; BMT: Bone Marrow Transplant; PS: Propensity Score; IPW: Inverse Probability of treatment Weighting; MM: Multiple Myeloma; FL: Follicular Lymphoma; HD: Hodgkin’s Disease; ATE: Average Treatment Effect; ATT: Average Treatment effect for the Treated; OS: Overall Survival; EFS: Event-Free Survival; SD: Standard Deviation; HR: Hazard ratio; EBMT: European Group for Blood and Marrow Transplantation; CIBMTR: Center for International Blood and Marrow Transplant Research.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

MRR, RP, SC participated in the design of the study, performed the statistical analysis and drafted the manuscript. MR, DS, JPF, CT, GS participated in the design of the study. All authors read and approved the final manuscript.

Statistics notes. Treatment allocation in controlled trials: why randomise?AltmanDGBlandJMBMJ199931871921209111559510221955The need for randomization in the study of intended effectsMiettinenOSStat Med19832226727110.1002/sim.47800202226648141Why we need observational studies to evaluate the effectiveness of health careBlackNBMJ199631270401215121810.1136/bmj.312.7040.121523509408634569A comparison of observational studies and randomized, controlled trialsBensonKHartzAJAm J Ophthalmol2000130568811078862Analysis of observational studies in the presence of treatment selection bias: effects of invasive cardiac management on AMI survival using propensity score and instrumental variable methodsStukelTAFisherESWennbergDEAlterDAGottliebDJVermeulenMJJAMA2007297327828510.1001/jama.297.3.278217052417227979Comparative outcome of reduced intensity and myeloablative conditioning regimen in HLA identical sibling allogeneic haematopoietic stem cell transplantation for patients older than 50 years of age with acute myeloblastic leukaemia: a retrospective survey from the Acute Leukemia Working Party (ALWP) of the European group for Blood and Marrow Transplantation (EBMT)AoudjhaneMLabopinMGorinNCShimoniARuutuTKolbHJFrassoniFBoironJMYinJLFinkeJLeukemia200519122304231210.1038/sj.leu.240396716193083Engraftment of allogeneic hematopoietic progenitor cells with purine analog-containing chemotherapy: harnessing graft-versus-leukemia without myeloablative therapyGiraltSEsteyEAlbitarMvan BesienKRondonGAnderliniPO'BrienSKhouriIGajewskiJMehraRBlood19978912453145369192777Transplant-lite: induction of graft-versus-malignancy using fludarabine-based nonablative chemotherapy and allogeneic blood progenitor-cell transplantation as treatment for lymphoid malignanciesKhouriIFKeatingMKorblingMPrzepiorkaDAnderliniPO'BrienSGiraltSIppolitiCvon WolffBGajewskiJJ Clin Oncol1998168281728249704734Hematopoietic cell transplantation in older patients with hematologic malignancies: replacing high-dose cytotoxic therapy with graft-versus-tumor effectsMcSweeneyPANiederwieserDShizuruJASandmaierBMMolinaAJMaloneyDGChaunceyTRGooleyTAHegenbartUNashRABlood200197113390340010.1182/blood.V97.11.339011369628Allogeneic bone marrow transplant is not better than autologous transplant for patients with relapsed Hodgkin's disease. European Group for Blood and Bone Marrow Transplantation.MilpiedNFieldingAKPearceRMErnstPGoldstoneAHJ Clin Oncol1996144129112968648386An EBMT registry matched study of allogeneic stem cell transplants for lymphoma: allogeneic transplantation is associated with a lower relapse rate but a higher procedure-related mortality rate than autologous transplantationPeniketAJRuiz de ElviraMCTaghipourGCordonnierCGluckmanEde WitteTSantiniGBlaiseDGreinixHFerrantABone Marrow Transplant200331866767810.1038/sj.bmt.170389112692607Comparison of autologous and allogeneic hematopoietic stem cell transplantation for follicular lymphomavan BesienKLoberizaFRJrBajorunaiteRArmitageJOBasheyABurnsLJFreytesCOGibsonJHorowitzMMInwardsDJBlood2003102103521352910.1182/blood-2003-04-120512893748Allogeneic transplantation improves the overall and progression-free survival of Hodgkin lymphoma patients relapsing after autologous transplantation: a retrospective study based on the time of HLA typing and donor availabilitySarinaBCastagnaLFarinaLPatriarcaFBenedettiFCarellaAMFaldaMGuidiSCiceriFBoniniABlood2010115183671367710.1182/blood-2009-12-25385620220116Nonmyeloablative stem cell transplantation and cell therapy as an alternative to conventional bone marrow transplantation with lethal cytoreduction for the treatment of malignant and nonmalignant hematologic diseasesSlavinSNaglerANaparstekEKapelushnikYAkerMCividalliGVaradiGKirschbaumMAckersteinASamuelSBlood19989137567639446633Reduced-intensity conditioning allogeneic stem cell transplantation: hype, reality or time for a rethink?MohtyMNaglerAKillmannNMLeukemia200620101653165410.1038/sj.leu.240433617041634A comparison of allografting with autografting for newly diagnosed myelomaBrunoBRottaMPatriarcaFMordiniNAllioneBCarnevale-SchiancaFGiacconeLSorasioROmedePBaldiIN Engl J Med2007356111110112010.1056/NEJMoa06546417360989Prospective comparison of autologous stem cell transplantation followed by dose-reduced allograft (IFM99-03 trial) with tandem autologous stem cell transplantation (IFM99-04 trial) in high-risk de novo multiple myelomaGarbanFAttalMMichalletMHulinCBourhisJHYakoub-AghaILamyTMaritGMaloiselFBerthouCBlood200610793474348010.1182/blood-2005-09-386916397129Long-term follow-up results of IFM99-03 and IFM99-04 trials comparing nonmyeloablative allotransplantation with autologous transplantation in high-risk de novo multiple myelomaMoreauPGarbanFAttalMMichalletMMaritGHulinCBenboubkerLDoyenCMohtyMYakoub-AghaIBlood200811293914391510.1182/blood-2008-07-16882318948589Prospective genetically randomized comparison between intensive postinduction chemotherapy and bone marrow transplantation in adults with newly diagnosed acute myeloid leukemiaArchimbaudEThomasXMichalletMJaubertJTroncyJGuyotatDFiereDJ Clin Oncol19941222622678113835Myeloablative allogeneic versus autologous stem cell transplantation in adult patients with acute lymphoblastic leukemia in first remission: a prospective sibling donor versus no-donor comparisonCornelissenJJvan der HoltBVerhoefGEvan't VeerMBvan OersMHSchoutenHCOssenkoppeleGSonneveldPvan Marwijk KooyMBlood200911361375138210.1182/blood-2008-07-16862518988865Use of biological assignment in hematopoietic stem cell transplantation clinical trialsLoganBLeiferEBredesonCHorowitzMEwellMCarterSGellerNClin Trials20085660761610.1177/1740774508098326267101519029209Randomized trials or observational tribulations?PocockSJElbourneDRN Engl J Med2000342251907190910.1056/NEJM20000622342251110861329Confounding and indication for treatment in evaluation of drug treatment for hypertensionGrobbeeDEHoesAWBMJ199731571161151115410.1136/bmj.315.7116.115121277259374894A definition of causal effect for epidemiological researchHernanMAJ Epidemiol Community Health200458265271BMJ Publishing Group Ltd10.1136/jech.2002.006361173273715026432Causal inference in retrospective studiesHollandPWRubinDBEval Rev19881220310.1177/0193841X8801200301Estimating causal effects from epidemiological dataHernanMARobinsJMJ Epidemiol Community Health200660757858610.1136/jech.2004.029496265288216790829The central role of the propensity score in observational studies for causal effectsRosenbaumPRRubinDBBiometrika1983701415510.1093/biomet/70.1.41Marginal structural models and causal inference in epidemiologyRobinsJMHernanMABrumbackBEpidemiology200011555056010.1097/00001648-200009000-0001110955408A distributional approach for causal inference using propensity scoresZhiqiangTJ Am Stat Assoc20061014761619163710.1198/016214506000000023The design versus the analysis of observational studies for causal effects: parallels with the design of randomized trialsAustinPCStat Med200726203610.1002/sim.273917072897An application of model-fitting procedures for marginal structural modelsMortimerKMNeugebauerRvan der LaanMTagerIBAm J Epidemiol2005162438238810.1093/aje/kwi20816014771Different measures of treatment effect for different research questionsAustinPCJ Clin Epidemiol201063191010.1016/j.jclinepi.2009.07.00619837563Tandem autologous non-myeloablative allogeneic transplantation in patients with multiple myeloma relapsing after a first high dose therapyKarlinLArnulfBChevretSAdesLRobinMDe LatourRPMalphettesMKabbaraNAsliBRochaVBone Marrow Transplant201046225025620400980Propensity score methods for bias reduction in the comparison of a treatment to a non-randomized control groupD'AgostinoRBJrStat Med199817192265228110.1002/(SICI)1097-0258(19981015)17:19<2265::AID-SIM918>3.0.CO;2-B9802183Reducing bias in observational studies using subclassification on the propensity scoreRosenbaumPRRubinDBJ Am Stat Assoc19847938751652410.1080/01621459.1984.10478078Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative studyLuncefordJKDavidianMStat Med200423192937296010.1002/sim.190315351954Variable selection for propensity score modelsBrookhartMASchneeweissSRothmanKJGlynnRJAvornJSturmerTAm J Epidemiol2006163121149115610.1093/aje/kwj149151319216624967Matching using estimated propensity scores: relating theory to practiceRubinDBThomasNBiometrics199652124926410.2307/25331608934595Type I error rates, coverage of confidence intervals, and variance estimation in propensity-score matched analysesAustinPCInt J Biostat20095115574679April 2009. http://dx.crossref.org/10.2202%2F1557-4679.114610.2202/1557-4679.1146A comparison of the ability of different propensity score models to balance measured variables between treated and untreated subjects: a Monte Carlo studyAustinPCGrootendorstPAndersonGMStat Med200726473475310.1002/sim.258016708349Balance diagnostics for comparing the distribution of baseline covariates between treatment groups in propensity-score matched samplesAustinPCStat Med200928253083310710.1002/sim.3697347207519757444A comparison of propensity score methods: a case-study estimating the effectiveness of post-AMI statin useAustinPCMamdaniMMStat Med2005251220842106Model Selection, Confounder Control, and Marginal Structural ModelsJoffeMMTen HaveTRFeldmanHIAm Stat200458427227910.1198/000313004X5824Instruments for causal inference: an epidemiologist's dream?HernanMARobinsJMEpidemiology200617436037210.1097/01.ede.0000222409.00878.3716755261Marginal structural models to estimate the causal effect of zidovudine on the survival of HIV-positive menHernanMABrumbackBRobinsJMEpidemiology200011556157010.1097/00001648-200009000-0001210955409Constructing inverse probability weights for marginal structural modelsColeSRHernanMAAm J Epidemiol2008168665666410.1093/aje/kwn164273295418682488RothmanKJGreenlandSLashTLModern epidemiology2008Model-based direct adjustmentRosenbaumPRJ Am Stat Assoc19878239838739410.1080/01621459.1987.10478441Team RDCR: A Language and Environment for Statistical ComputingVienna, Austria: R Foundation for Statistical Computing2009Outcomes in ambulatory chronic systolic and diastolic heart failure: a propensity score analysisAhmedAPerryGJFlegJLLoveTEGoffDCJrKitzmanDWAm Heart J2006152595696610.1016/j.ahj.2006.06.020262847417070167ColletDModelling survival data in medical researchLondon, UK: Chapman & Hall/CRC22003Randomized, controlled trials, observational studies, and the hierarchy of research designsConcatoJShahNHorwitzRIN Engl J Med2000342251887189210.1056/NEJM200006223422507155764210861325Application of propensity score model to examine the prognostic significance of lymph node number as a care quality indicatorChangYJChenLJChungKPLaiMSSurg Oncol2012212e75e8510.1016/j.suronc.2011.12.00322221938Propensity scores in intensive care and anaesthesiology literature: a systematic reviewGayatEPirracchioRResche-RigonMMebazaaAMaryJYPorcherRIntensive Care Med201036121993200310.1007/s00134-010-1991-520689924Intensive chemotherapy for elderly patients with acute myelogeneous leukemia: a propensity score analysis by the Japan Hematology and Oncology Clinical Study Group (J-HOCS)OshimaKTakahashiWAsano-MoriYIzutsuKTakahashiTAraiYNakagawaYUsukiKKurokawaMSuzukiKAnn Hematol201217Comparison of adverse events during 5-fluorouracil versus 5-fluorouracil/oxaliplatin adjuvant chemotherapy for stage III colon cancer: A Population-based analysisSanoffHKCarpenterWRFreburgerJLiLChenKZulligLLGoldbergRMSchymuraMJSchragDCancer2012Using propensity scores to help design observational studies: application to the tobacco litigationRubinDBHealth Services and Outcomes Research Methodology20012316918810.1023/A:1020363010465Estimation of causal effects using propensity score weighting: An application to data on right heart catheterizationHiranoKImbensGWHealth Services and Outcomes Research Methodology20012325927810.1023/A:1020371312283Perioperative statin therapy and renal outcomes after major vascular surgery: a propensity-based analysisKorDJBrownMJIscimenRBrownDRWhalenFXRoyTKKeeganMTJ Cardiothorac Vasc Anesth200822221021610.1053/j.jvca.2007.12.01918375322

Pre-publication history

The pre-publication history for this paper can be accessed here:

http://www.biomedcentral.com/1471-2326/12/10/prepub