Agritrop
Accueil

Enriching Epidemiological Thematic Features For Disease Surveillance Corpora Classification

Menya Edmond, Roche Mathieu, Interdonato Roberto, Owuor Dickson. 2022. Enriching Epidemiological Thematic Features For Disease Surveillance Corpora Classification. In : Proceedings of the 13th Language Resources and Evaluation Conference. Calzolari N. (ed.), Béchet F. (ed.), Blache P. (ed.), Choukri K. (ed.), Cieri C. (ed.), Declerck T. (ed.), Goggi S. (ed.), Isahara H. (ed.), Maegaard B. (ed.), Mariani j. (ed.), Mazo H. (ed.), Odijk J. (ed.), Piperidis .(ed.). ELRA. Marseille : European Language Resources Association, 3741-3750. Language Resources and Evaluation Conference (LREC 2022). 13, Marseille, France, 20 Juin 2022/25 Juin 2022.

Communication avec actes
[img]
Prévisualisation
Version publiée - Anglais
Sous licence Licence Creative Commons.
Menya_et_al_LREC2022.pdf

Télécharger (603kB) | Prévisualisation
[img]
Prévisualisation
Version publiée - Anglais
Sous licence Licence Creative Commons.
704.pdf

Télécharger (376kB) | Prévisualisation

Url - jeu de données - Dataverse Cirad : https://doi.org/10.18167/DVN1/MSLEFC

Matériel d'accompagnement : 1 poster

Résumé : We present EpidBioBERT, a biosurveillance epidemiological document tagger for disease surveillance over PADI-Web system. Our model is trained on PADI-Web corpus which contains news articles on Animal Diseases Outbreak extracted from the web. We train a classifier to discriminate between relevant and irrelevant documents based on their epidemiological thematic feature content in preparation for further epidemiology information extraction. Our approach proposes a new way to perform epidemiological document classification by enriching epidemiological thematic features namely disease, host, location and date, which are used as inputs to our epidemiological document classifier. We adopt a pre-trained biomedical language model with a novel fine tuning approach that enriches these epidemiological thematic features. We find these thematic features rich enough to improve epidemiological document classification over a smaller data set than initially used in PADI-Web classifier. This improves the classifiers ability to avoid false positive alerts on disease surveillance systems. To further understand information encoded in EpidBioBERT, we experiment the impact of each epidemiology thematic feature on the classifier under ablation studies. We compare our biomedical pre-trained approach with a general language model based model finding that thematic feature embeddings pre-trained on general English documents are not rich enough for epidemiology classification task. Our model achieves an F1-score of 95.5% over an unseen test set, with an improvement of +5.5 points on F1-Score on the PADI-Web classifier with nearly half the training data set.

Mots-clés libres : Natural Language Processing, Text Mining, Epidemic intelligence, PADI-Web, BERT

Auteurs et affiliations

Source : Cirad-Agritrop (https://agritrop.cirad.fr/601462/)

Voir la notice (accès réservé à Agritrop) Voir la notice (accès réservé à Agritrop)

[ Page générée et mise en cache le 2023-11-24 ]