EpidGPT: A combined strategy to discriminate between redundant and new information for epidemiological surveillance systems

Menya Edmond, Roche Mathieu, Interdonato Roberto, Owuor Dickson. 2024. EpidGPT: A combined strategy to discriminate between redundant and new information for epidemiological surveillance systems. In : Natural language processing and information systems: 29th International Conference on Applications of Natural Language to Information Systems, NLDB 2024, Turin, Italy, June 25–27, 2024, Proceedings, Part I. Rapp Amon (ed.), Di Caro Luigi (ed.), Meziane Farid (ed.), Sugumaran Vijayan (ed.). Cham : Springer, 439-454. (Lecture Notes in Computer Science, 14762) ISBN 978-3-031-70238-9 Natural Language Processing and Information Systems (NLDB 2024), Turin, Italie, 25 Juin 2024/27 Juin 2024.

https://doi.org/10.1007/978-3-031-70239-6_30

Communication avec actes

Version publiée - Anglais
Accès réservé aux personnels Cirad
Utilisation soumise à autorisation de l'auteur ou du Cirad.
Menya_et_al_EpidGPT_2024.pdf
Télécharger (680kB) | Demander une copie

Url - jeu de données - Dataverse Cirad : https://doi.org/10.18167/DVN1/WD1UC2

Résumé : Textual documents such as online news articles have become a key source in epidemiological surveillance such as being used in the detection of new and re-emerging diseases. However, such sources suffer redundancies with the need to automate the process of identifying novel information. In this paper, we propose a framework for learning novel thematic information in epidemiological news documents. Our approach involves both extraction and classification of new, duplicate, additional and/or missing pieces of relevant information in epidemiological news documents. Firstly, we propose an initial step to solve the limited data problem where fewer gold labelled datasets exists for training text-based epidemiological surveillance systems. This initial step is built using extractive question answering technique whereby we automate the process of extracting relevant thematic features inclusive of disease and host names, location and date of reported events and reported number of cases in order to create a large silver labelled dataset. We then propose a main step where we build a novelty information classification model that is trained using our large silver labeled dataset. We then test our novelty classifier model alongside competitive ones on the challenge of detecting whether there is novel, redundant and/or missing information in a target epidemiological news article. We later carry out ablation studies on the most informative document segments in epidemiological news articles.

Mots-clés libres : Text Mining, Language Model, Animal disease surveillance

Agences de financement européennes : European Commission

Agences de financement hors UE : Ambassade de France à Nairobi, Direction générale de l'alimentation

Programme de financement européen : H2020

Projets sur financement : (EU) MOnitoring Outbreak events for Disease surveillance in a data science context

Auteurs et affiliations