Agritrop
Accueil

GeoNLPlify: A spatial data augmentation enhancing text classification for crisis monitoring

Decoupes Rémy, Roche Mathieu, Teisseire Maguelonne. 2024. GeoNLPlify: A spatial data augmentation enhancing text classification for crisis monitoring. Intelligent Data Analysis, 28 (2) : 507-531.

Article de revue ; Article de recherche ; Article de revue à facteur d'impact
[img]
Prévisualisation
Version publiée - Anglais
Sous licence Licence Creative Commons.
Decoupes_et_al_IDA2024.pdf

Télécharger (5MB) | Prévisualisation

Url - autres données associées : https://github.com/remydecoupes/GeoNLPlify

Liste HCERES des revues (en SHS) : oui

Thème(s) HCERES des revues (en SHS) : Psychologie-éthologie-ergonomie

Résumé : Crises such as natural disasters and public health emergencies generate vast amounts of text data, making it challenging to classify the information into relevant categories. Acquiring expert-labeled data for such scenarios can be difficult, leading to limited training datasets for text classification by fine-tuning BERT-like models. Unfortunately, traditional data augmentation techniques only slightly improve F1-scores. How can data augmentation be used to obtain better results in this applied domain? In this paper, using neural network explicability methods, we aim to highlight that fine-tuned BERT-like models on crisis corpora give too much importance to spatial information to make their predictions. This overfitting of spatial information limits their ability to generalize especially when the event which occurs in a place has evolved and changed since the training dataset has been built. To reduce this bias, we propose GeoNLPlify, a novel data augmentation technique that leverages spatial information to generate new labeled data for text classification related to crises. Our approach aims to address overfitting without necessitating modifications to the underlying model architecture, distinguishing it from other prevalent methods employed to combat overfitting. Our results show that GeoNLPlify significantly improves F1-scores, demonstrating the potential of the spatial information for data augmentation for crisis-related text classification tasks. In order to evaluate the contribution of our method, GeoNLPlify is applied to three public datasets (PADI-web, CrisisNLP and SST2) and compared with classical natural language processing data augmentations.

Mots-clés Agrovoc : données spatiales, analyse de données, modélisation, catastrophe naturelle, santé publique, fouille de textes

Mots-clés libres : Natural Language Processing, Language Model, Explainability, Data augmentation, Crisis

Classification Agris : C30 - Documentation et information
U10 - Informatique, mathématiques et statistiques
L73 - Maladies des animaux

Champ stratégique Cirad : CTS 7 (2019-) - Hors champs stratégiques

Agences de financement européennes : European Commission

Projets sur financement : (EU) MOnitoring Outbreak events for Disease surveillance in a data science context

Auteurs et affiliations

  • Decoupes Rémy, INRAE (FRA) - auteur correspondant
  • Roche Mathieu, CIRAD-ES-UMR TETIS (FRA) ORCID: 0000-0003-3272-8568
  • Teisseire Maguelonne, INRAE (FRA)

Source : Cirad-Agritrop (https://agritrop.cirad.fr/609252/)

Voir la notice (accès réservé à Agritrop) Voir la notice (accès réservé à Agritrop)

[ Page générée et mise en cache le 2024-05-04 ]