Decoupes Rémy, Roche Mathieu, Teisseire Maguelonne. 2024. GeoNLPlify: A spatial data augmentation enhancing text classification for crisis monitoring. Intelligent Data Analysis, 28 (2) : 507-531.
|
Version publiée
- Anglais
Sous licence . Decoupes_et_al_IDA2024.pdf Télécharger (5MB) | Prévisualisation |
Url - autres données associées : https://github.com/remydecoupes/GeoNLPlify
Liste HCERES des revues (en SHS) : oui
Thème(s) HCERES des revues (en SHS) : Psychologie-éthologie-ergonomie
Résumé : Crises such as natural disasters and public health emergencies generate vast amounts of text data, making it challenging to classify the information into relevant categories. Acquiring expert-labeled data for such scenarios can be difficult, leading to limited training datasets for text classification by fine-tuning BERT-like models. Unfortunately, traditional data augmentation techniques only slightly improve F1-scores. How can data augmentation be used to obtain better results in this applied domain? In this paper, using neural network explicability methods, we aim to highlight that fine-tuned BERT-like models on crisis corpora give too much importance to spatial information to make their predictions. This overfitting of spatial information limits their ability to generalize especially when the event which occurs in a place has evolved and changed since the training dataset has been built. To reduce this bias, we propose GeoNLPlify, a novel data augmentation technique that leverages spatial information to generate new labeled data for text classification related to crises. Our approach aims to address overfitting without necessitating modifications to the underlying model architecture, distinguishing it from other prevalent methods employed to combat overfitting. Our results show that GeoNLPlify significantly improves F1-scores, demonstrating the potential of the spatial information for data augmentation for crisis-related text classification tasks. In order to evaluate the contribution of our method, GeoNLPlify is applied to three public datasets (PADI-web, CrisisNLP and SST2) and compared with classical natural language processing data augmentations.
Mots-clés Agrovoc : données spatiales, analyse de données, modélisation, catastrophe naturelle, santé publique, fouille de textes
Mots-clés libres : Natural Language Processing, Language Model, Explainability, Data augmentation, Crisis
Classification Agris : C30 - Documentation et information
U10 - Informatique, mathématiques et statistiques
L73 - Maladies des animaux
Champ stratégique Cirad : CTS 7 (2019-) - Hors champs stratégiques
Agences de financement européennes : European Commission
Projets sur financement : (EU) MOnitoring Outbreak events for Disease surveillance in a data science context
Auteurs et affiliations
- Decoupes Rémy, INRAE (FRA) - auteur correspondant
- Roche Mathieu, CIRAD-ES-UMR TETIS (FRA) ORCID: 0000-0003-3272-8568
- Teisseire Maguelonne, INRAE (FRA)
Source : Cirad-Agritrop (https://agritrop.cirad.fr/609252/)
[ Page générée et mise en cache le 2024-06-02 ]