Agritrop
Accueil

Explainable epidemiological thematic features for event based disease surveillance

Menya Edmond, Interdonato Roberto, Owuor Dickson, Roche Mathieu. 2024. Explainable epidemiological thematic features for event based disease surveillance. Expert Systems with Applications, 250:123894, 21 p.

Article de revue ; Article de recherche ; Article de revue à facteur d'impact
[img]
Prévisualisation
Version publiée - Anglais
Sous licence Licence Creative Commons.
Menya_et_al_ESWA2024.pdf

Télécharger (2MB) | Prévisualisation

Url - jeu de données - Dataverse Cirad : https://doi.org/10.18167/DVN1/WD1UC2 / Url - autres données associées : https://github.com/menya-edmond/EpidBioELECTRA

Liste HCERES des revues (en SHS) : oui

Thème(s) HCERES des revues (en SHS) : Economie-gestion

Résumé : Event based disease surveillance (EBS) systems are biosurveillance systems that have the ability to detect and alert on (re)-emerging infectious diseases by monitoring acute public or animal health event patterns from sources such as blogs, online news reports and curated expert accounts. These information rich sources, however, are largely unstructured text data requiring novel text mining techniques to achieve EBS goals such as epidemiological text classification. The main objective of this research was to improve epidemiological text classification by proposing a novel technique of enriching thematic features using a weak supervision approach. In our approach, we train and test a mixed domain language model named EpidBioELECTRA to first enrich thematic features which are then used to improve epidemiological text classification. We train EpidBioELECTRA on a large dataset which we create consisting of 70,700 annotated documents that includes 70,400 labeled thematic features. We empirically compare EpidBioELECTRA with both general purpose language models and domain specific language models in the task of epidemiological corpus classification. Our findings shows that epidemiological classification systems work best with language models pre-trained using both epidemiological and biomedical corpora with a continual pre-training strategy. EpidBioELECTRA improves epidemiological document classification by 19.2 score points as compared to its vanilla implementation BioELECTRA. We observe this by the comparison of BioELECTRA verses EpidBioELECTRA on our most challenging dataset PADI-Web where our approach records 92.33 precision score, 94.62 recall score and 93.46 score. We also experiment the impact of increasing context length of train documents in epidemiological document classification and found out that this improves the classification task by 7.79 score points as recorded by EpidBioELECTRA's performance. We also compute Almost Stochastic Order (ASO) scores to track EpidBioELECTRA's statistical dominance. In addition, we carry out ablation studies on our proposed thematic feature enrichment approach using explainable AI techniques. We present explanations for the most critical thematic features and how they influence epidemiological classification task We found out that biomedical features (such as mentions of names of diseases and symptoms) are the most influential while spatio-temporal features (such as the mention of date of a given disease outbreak) are the least influential in epidemiological document classification. Our model can easily be extended to fit other domains.

Mots-clés Agrovoc : fouille de textes, épidémiologie, surveillance épidémiologique, santé animale, analyse de données, maladie infectieuse, fouille de données, santé publique

Mots-clés libres : Text Mining, Language Model, Explainability, Event-based surveillance, Disease surveillance, Epidemic intelligence, One Health

Classification Agris : L73 - Maladies des animaux
U10 - Informatique, mathématiques et statistiques

Champ stratégique Cirad : CTS 4 (2019-) - Santé des plantes, des animaux et des écosystèmes

Agences de financement européennes : European Commission

Agences de financement hors UE : Ambassade de France à Nairobi, Direction générale de l'alimentation

Programme de financement européen : H2020

Projets sur financement : (EU) MOnitoring Outbreak events for Disease surveillance in a data science context

Auteurs et affiliations

Source : Cirad-Agritrop (https://agritrop.cirad.fr/609247/)

Voir la notice (accès réservé à Agritrop) Voir la notice (accès réservé à Agritrop)

[ Page générée et mise en cache le 2024-12-08 ]