Menya Edmond, Interdonato Roberto, Owuor Dickson, Roche Mathieu. 2024. Explainable epidemiological thematic features for event based disease surveillance. Expert Systems with Applications, 250:123894, 21 p.
|
Version publiée
- Anglais
Sous licence . Menya_et_al_ESWA2024.pdf Télécharger (2MB) | Prévisualisation |
Url - jeu de données - Dataverse Cirad : https://doi.org/10.18167/DVN1/WD1UC2 / Url - autres données associées : https://github.com/menya-edmond/EpidBioELECTRA
Liste HCERES des revues (en SHS) : oui
Thème(s) HCERES des revues (en SHS) : Economie-gestion
Résumé : Event based disease surveillance (EBS) systems are biosurveillance systems that have the ability to detect and alert on (re)-emerging infectious diseases by monitoring acute public or animal health event patterns from sources such as blogs, online news reports and curated expert accounts. These information rich sources, however, are largely unstructured text data requiring novel text mining techniques to achieve EBS goals such as epidemiological text classification. The main objective of this research was to improve epidemiological text classification by proposing a novel technique of enriching thematic features using a weak supervision approach. In our approach, we train and test a mixed domain language model named EpidBioELECTRA to first enrich thematic features which are then used to improve epidemiological text classification. We train EpidBioELECTRA on a large dataset which we create consisting of 70,700 annotated documents that includes 70,400 labeled thematic features. We empirically compare EpidBioELECTRA with both general purpose language models and domain specific language models in the task of epidemiological corpus classification. Our findings shows that epidemiological classification systems work best with language models pre-trained using both epidemiological and biomedical corpora with a continual pre-training strategy. EpidBioELECTRA improves epidemiological document classification by 19.2 score points as compared to its vanilla implementation BioELECTRA. We observe this by the comparison of BioELECTRA verses EpidBioELECTRA on our most challenging dataset PADI-Web where our approach records 92.33 precision score, 94.62 recall score and 93.46 score. We also experiment the impact of increasing context length of train documents in epidemiological document classification and found out that this improves the classification task by 7.79 score points as recorded by EpidBioELECTRA's performance. We also compute Almost Stochastic Order (ASO) scores to track EpidBioELECTRA's statistical dominance. In addition, we carry out ablation studies on our proposed thematic feature enrichment approach using explainable AI techniques. We present explanations for the most critical thematic features and how they influence epidemiological classification task We found out that biomedical features (such as mentions of names of diseases and symptoms) are the most influential while spatio-temporal features (such as the mention of date of a given disease outbreak) are the least influential in epidemiological document classification. Our model can easily be extended to fit other domains.
Mots-clés Agrovoc : fouille de textes, épidémiologie, surveillance épidémiologique, santé animale, analyse de données, maladie infectieuse, fouille de données, santé publique
Mots-clés libres : Text Mining, Language Model, Explainability, Event-based surveillance, Disease surveillance, Epidemic intelligence, One Health
Classification Agris : L73 - Maladies des animaux
U10 - Informatique, mathématiques et statistiques
Champ stratégique Cirad : CTS 4 (2019-) - Santé des plantes, des animaux et des écosystèmes
Agences de financement européennes : European Commission
Agences de financement hors UE : Ambassade de France à Nairobi, Direction générale de l'alimentation
Programme de financement européen : H2020
Projets sur financement : (EU) MOnitoring Outbreak events for Disease surveillance in a data science context
Auteurs et affiliations
- Menya Edmond, CIRAD-ES-UMR TETIS (FRA) - auteur correspondant
- Interdonato Roberto, CIRAD-ES-UMR TETIS (FRA) ORCID: 0000-0002-0536-6277
- Owuor Dickson, Strathmore University (KEN)
- Roche Mathieu, CIRAD-ES-UMR TETIS (FRA) ORCID: 0000-0003-3272-8568
Source : Cirad-Agritrop (https://agritrop.cirad.fr/609247/)
[ Page générée et mise en cache le 2024-12-08 ]