Evaluating the impact of text duplications on a corpus of more than 600,000 clinical narratives in a French Hospital

Abstract : A significant part of medical knowledge is stored as unstructured free text. However, Clinical narratives are known to contains duplicated sections due to clinicians’ copy/paste parts of a former report into a new one. In this study, we aim at evaluating the duplications found within patients records in 650,000 French clinical narratives. We adapted a method to identify efficiently duplicated zones in a reasonable time. We evaluated the potential impact of duplications in two use-cases: the presence of (i) treatments and/or (ii) relative dates. We identified an average rate of duplication of 33%. We found that 20% of the document contained drugs mentioned only in duplicated zones and that 1.48% of the document contained mentions of relative dates in duplicated zone, that could potentially lead to erroneous interpretation. We suggest the systematic identification and annotation of duplicated zones in clinical narratives for information extraction and temporal-oriented tasks.
Complete list of metadatas

Cited literature [17 references]  Display  Hide  Download

https://hal.archives-ouvertes.fr/hal-02265124
Contributor : William Digan <>
Submitted on : Thursday, August 8, 2019 - 2:58:58 PM
Last modification on : Sunday, August 11, 2019 - 1:18:50 AM

File

WilliamDiganMedinfo2919submiss...
Files produced by the author(s)

Identifiers

  • HAL Id : hal-02265124, version 1

Citation

William Digan, Maxime Wack, Vincent Looten, Antoine Neuraz, Anita Burgun, et al.. Evaluating the impact of text duplications on a corpus of more than 600,000 clinical narratives in a French Hospital. medinfo 2019, Aug 2019, Lyon, France. ⟨hal-02265124⟩

Share

Metrics

Record views

136

Files downloads

15