Service interruption on Monday 11 July from 12:30 to 13:00: all the sites of the CCSD (HAL, Epiciences, SciencesConf, AureHAL) will be inaccessible (network hardware connection).
Skip to Main content Skip to Navigation
Journal articles

A new hybrid record linkage process to make epidemiological databases interoperable: application to the GEMO and GENEPSO studies involving BRCA1 and BRCA2 mutation carriers

yue Jiao 1, 2, 3, 4 Fabienne Lesueur 3, 4, 1, 2 Chloé-Agathe Azencott 3, 4, 2, 5 Maïté Laurent 1, 2 Noura Mebirouk 3, 4, 1, 2 Lilian Laborde 6, 7, 8 Juana Beauvallet 3, 4, 1, 2 Marie-Gabrielle Dondon 3, 4, 1, 2 Séverine Eon-Marchais 3, 4, 1, 2 Anthony Laugé 1, 2 Catherine C. Noguès 6, 7, 8, 9, 10 Nadine Andrieu 3, 4, 1, 2 Dominique Stoppa-Lyonnet 1, 2, 11, 7 Sandrine M Caputo 1, 2 Nadia Boutry-Kryza Alain Calender Sophie Giraud Mélanie Léone Brigitte Bressac- de Paillerets Olivier Caron Marine Guillaud-Bataille yves-Jean Bignon Nancy Uhrhammer Valérie Bonadona Christine Lasset Pascaline Berthet Laurent Castera Dominique Vaur Violaine Bourdon Tetsuro Noguchi Cornel Popovici Audrey Remenieras Hagay Sobol Isabelle Coupier Pierre-Olivier Harmand Pascal Pujol 12, 13 Paul Vilquin Aurélie Dumont Françoise Révillion Danièle Muller Emmanuelle Barouk-Simonet Françoise Bonnet Virginie Bubien Michel Longy Nicolas Sevenet Laurence Gladieff Rosine Guimbaud Viviane Feillel Christine Toulas Hélène Dreyfus Dominique Leroux Magalie Peysselon Christine Rebischung Amandine Baurand Geoffrey Bertolone Fanny Coron Laurence Faivre Vincent Goussot Caroline Jacquot Caroline Sawka Caroline Kientz Marine Lebrun Fabienne Prieur Sandra Fert-Ferrer Véronique Mari Laurence Venat-Bouvet Stéphane Bézieau Capucine Delnatte Isabelle Mortemousque Florence Coulet Florent Soubrier Mathilde Warcoin Myriam Bronner Sarab Lizard Johanna Sokolowska Marie-Agnès Collonge-Rame Alexandre Damette Paul Gesta Hakima Lallaoui Jean Chiesa Denise Molina-Gomes Olivier Ingster Sylvie Manouvrier-Hanu Sophie Lejeune Pauline Pontois Dominique Stoppa Lyonnet Marion Gauthier-Villars Bruno Buecher Emmanuelle Mouret-Fourme Jean-Pierre Fricker Elisabeth Luporsi Marc Frenay Francois Eisinger Jessica Moretta Catherine Dugast Chrystelle Colas Alain Lortholary Philippe Vennin Claude Adenis Tan Dat Nguyen Annick Rossi Julie Tinat Isabelle Tennevet Jean-Marc Limacher Christine Maugard Jean-yves Bignon Liliane Demange Odile Cohen-Haguenauer Brigitte Gilbert Hélène Zattara-Cannoni 
Abstract : Background: Linking independent sources of data describing the same individuals enable innovative epidemiological and health studies but require a robust record linkage approach. We describe a hybrid record linkage process to link databases from two independent ongoing French national studies, GEMO (Genetic Modifiers of BRCA1 and BRCA2), which focuses on the identification of genetic factors modifying cancer risk of BRCA1 and BRCA2 mutation carriers, and GENEPSO (prospective cohort of BRCAx mutation carriers), which focuses on environmental and lifestyle risk factors. Methods: To identify as many as possible of the individuals participating in the two studies but not registered by a shared identifier, we combined probabilistic record linkage (PRL) and supervised machine learning (ML). This approach (named "PRL + ML") combined together the candidate matches identified by both approaches. We built the ML model using the gold standard on a first version of the two databases as a training dataset. This gold standard was obtained from PRL-derived matches verified by an exhaustive manual review. Results The Random Forest (RF) algorithm showed a highest recall (0.985) among six widely used ML algorithms: RF, Bagged trees, AdaBoost, Support Vector Machine, Neural Network. Therefore, RF was selected to build the ML model since our goal was to identify the maximum number of true matches. Our combined linkage PRL + ML showed a higher recall (range 0.988-0.992) than either PRL (range 0.916-0.991) or ML (0.981) alone. It identified 1995 individuals participating in both GEMO (6375 participants) and GENEPSO (4925 participants). Conclusions: Our hybrid linkage process represents an efficient tool for linking GEMO and GENEPSO. It may be generalizable to other epidemiological studies involving other databases and registries.
Document type :
Journal articles
Complete list of metadata
Contributor : Odile Malbec Connect in order to contact the contributor
Submitted on : Wednesday, August 4, 2021 - 4:59:54 PM
Last modification on : Tuesday, July 5, 2022 - 9:48:59 AM


Publisher files allowed on an open archive


Distributed under a Creative Commons Attribution 4.0 International License



yue Jiao, Fabienne Lesueur, Chloé-Agathe Azencott, Maïté Laurent, Noura Mebirouk, et al.. A new hybrid record linkage process to make epidemiological databases interoperable: application to the GEMO and GENEPSO studies involving BRCA1 and BRCA2 mutation carriers. BMC Medical Research Methodology, BioMed Central, 2021, 21 (1), pp.155. ⟨10.1186/s12874-021-01299-6⟩. ⟨inserm-03313811⟩



Record views


Files downloads