Menya Edmond, Roche Mathieu, Interdonato Roberto, Owuor Dickson.
2022. Enriching Epidemiological Thematic Features For Disease Surveillance Corpora Classification.
In : Proceedings of the 13th Language Resources and Evaluation Conference. Calzolari N. (ed.), Béchet F. (ed.), Blache P. (ed.), Choukri K. (ed.), Cieri C. (ed.), Declerck T. (ed.), Goggi S. (ed.), Isahara H. (ed.), Maegaard B. (ed.), Mariani j. (ed.), Mazo H. (ed.), Odijk J. (ed.), Piperidis .(ed.). ELRA
|
Version publiée
- Anglais
Sous licence . Menya_et_al_LREC2022.pdf Télécharger (603kB) | Prévisualisation |
|
|
Version publiée
- Anglais
Sous licence . 704.pdf Télécharger (376kB) | Prévisualisation |
Url - jeu de données - Dataverse Cirad : https://doi.org/10.18167/DVN1/MSLEFC
Matériel d'accompagnement : 1 poster
Résumé : We present EpidBioBERT, a biosurveillance epidemiological document tagger for disease surveillance over PADI-Web system. Our model is trained on PADI-Web corpus which contains news articles on Animal Diseases Outbreak extracted from the web. We train a classifier to discriminate between relevant and irrelevant documents based on their epidemiological thematic feature content in preparation for further epidemiology information extraction. Our approach proposes a new way to perform epidemiological document classification by enriching epidemiological thematic features namely disease, host, location and date, which are used as inputs to our epidemiological document classifier. We adopt a pre-trained biomedical language model with a novel fine tuning approach that enriches these epidemiological thematic features. We find these thematic features rich enough to improve epidemiological document classification over a smaller data set than initially used in PADI-Web classifier. This improves the classifiers ability to avoid false positive alerts on disease surveillance systems. To further understand information encoded in EpidBioBERT, we experiment the impact of each epidemiology thematic feature on the classifier under ablation studies. We compare our biomedical pre-trained approach with a general language model based model finding that thematic feature embeddings pre-trained on general English documents are not rich enough for epidemiology classification task. Our model achieves an F1-score of 95.5% over an unseen test set, with an improvement of +5.5 points on F1-Score on the PADI-Web classifier with nearly half the training data set.
Mots-clés libres : Natural Language Processing, Text Mining, Epidemic intelligence, PADI-Web, BERT
Auteurs et affiliations
- Menya Edmond, Strathmore University (KEN)
- Roche Mathieu, CIRAD-ES-UMR TETIS (FRA) ORCID: 0000-0003-3272-8568
- Interdonato Roberto, CIRAD-ES-UMR TETIS (FRA) ORCID: 0000-0002-0536-6277
- Owuor Dickson, Strathmore University (KEN)
Source : Cirad-Agritrop (https://agritrop.cirad.fr/601462/)
[ Page générée et mise en cache le 2023-11-24 ]