Monitoring online media reports for early detection of unknown diseases: Insight from a retrospective study of COVID‐19 emergence

Abstract Event‐based surveillance (EBS) systems monitor a broad range of information sources to detect early signals of disease emergence, including new and unknown diseases. In December 2019, a newly identified coronavirus emerged in Wuhan (China), causing a global coronavirus disease (COVID‐19) pandemic. A retrospective study was conducted to evaluate the capacity of three event‐based surveillance (EBS) systems (ProMED, HealthMap and PADI‐web) to detect early COVID‐19 emergence signals. We focused on changes in online news vocabulary over the period before/after the identification of COVID‐19, while also assessing its contagiousness and pandemic potential. ProMED was the timeliest EBS, detecting signals one day before the official notification. At this early stage, the specific vocabulary used was related to ‘pneumonia symptoms’ and ‘mystery illness’. Once COVID‐19 was identified, the vocabulary changed to virus family and specific COVID‐19 acronyms. Our results suggest that the three EBS systems are complementary regarding data sources, and all require timeliness improvements. EBS methods should be adapted to the different stages of disease emergence to enhance early detection of future unknown disease outbreaks.

large network of experts worldwide who produce and share verified reports on disease outbreaks in a common platform (Carrion & Madoff, 2017). HealthMap is a semi-automated system launched by the Boston Children's Hospital in 2006. The tool monitors both official and non-official web news sources . Both HealthMap and ProMED monitor a list of human and animal diseases and syndromes thereof. The Platform for Automated extraction of animal Disease Information from the web (PADI-web) was created in 2016 to monitor online animal health-related news for the French Epidemic Intelligence System (FEIS) (Arsevska et al., 2018;. Both HealthMap and PADI-web automatically retrieve health-related news from Google News using customized Really Simple Syndication (RSS) feeds. For news detection, the two systems mine terms for known diseases, as well as for clinical signs and syndromes (Arsevska et al., 2016). All three EBS systems monitor news in multiple languages, including Chinese.
On 31 December 2019, local health officials of the Chinese city of Wuhan reported a cluster of 27 cases of 'pneumonia of unknown cause'. These cases were linked to a wholesale live animal and seafood market in the city. The first death was reported in January 2020, and the causative agent was identified as a new coronavirus, that is SARS-CoV-2, and the disease was named COVID-19. The first epidemiological study on patients with laboratory-confirmed COVID-19 infection reported the onset of illness as early as 1 December 2019 (Huang et al., 2020).
This retrospective study aimed first to evaluate three EBS systems (ProMED, HealthMap and PADI-web) and their capacity for timely detection of the COVID-19 emergence in China. Secondly, we focused on PADI-web to understand how an animal health EBS system contributed to the detection of a human EID. We analysed the RSS feeds from PADI-web that detected COVID-19-related news articles (hereafter referred to as 'news'). Thirdly, we assessed the vocabulary in the news detected by PADI-web and its change in relation to identification of the pathogen and the EID spread.

| COVID-19-related news detection
News from 1 to 31 December 2019 was mined to assess the timeliness of the three EBS. We compared the first news regarding the publication date, language and source.
To gain insight into how PADI-web detected the COVID-19 emergence, we further filtered a second corpus of news published from 31 December 2019 to 26 January 2020 containing at least one of the following words in the title and body of the news: 'pneumonia', 'respiratory illness', 'coronavirus', 'nCoV' (an early name for COVID-19), and 'Wuhan'. After manual verification of their relevance, we retained 275 out of 333 news items for analysis .
We assessed the link between the detected news items and the animal health RSS feeds from PADI-web that served to retrieve those news items. To this end, we read each news item and categorized it into (i) disease-specific RSS feeds (containing specific disease names) and (ii) syndromic RSS feeds (containing combinations of symptoms and animal hosts).

| News vocabulary
We analysed the vocabulary change spanning the period from the initial discovery of the COVID-19 outbreak to its spread outside China by extracting terms from the whole corpus. A word frequency-based method was first implemented to highlight important keywords according periods ( Figure 1). Secondly, we used a ranking function based on the frequency and discriminance 1 of terms (i.e. words and multi-word terms) extracted with BioTex, a text-mining tool tailored for biomedical terminology (Lossio-Ventura, Jonquet, Roche, & Teisseire, 2016). BioTex is based on the use of (i) a relevant combination of information retrieval techniques and statistical methods and (ii) a list of syntactic structures of terms that have been learnt F I G U R E 1 Wordclouds generated from COVID-19-related news articles during three consecutive periods: (a) 31 December 2019 -08 January 2020, (b) 09-19 January 2020, (c) 20-26 January 2020 [Colour figure can be viewed at wileyonlinelibrary.com] via relevant sources (e.g. UMLS, MeSH). BioTex-extracted terms can be lowercase words (e.g. influenza), or phrases (e.g. avian influenza).
We further identified terms referring to COVID-19, such as 'new virus' and 'mystery pneumonia'. We manually categorized the terms as 'mystery' (referring to the unknown threat), 'pneumonia' (referring to the clinical signs), 'coronavirus' (referring to the virus taxonomy) and 'technical' (technical acronyms specifically pertaining to the virus). One news item could contain terms from different categories.
We calculated the daily proportion of each category, expressed as the sum of occurrences of the category divided by the total number of occurrences.

| News detection
Program for Monitoring Emerging Diseases was the first EBS system to detect and report a news item from a Chinese online source. 2 The ProMED report dated back to 30 December 2019-a day before the first official notification of pneumonia-like cases in Wuhan (Wuhan Municipal Health Commission, 2020). PADI-web and HealthMap respectively detected three and one COVID-19-related news items on 31 December 2019-the same day as the first official notification of pneumonia-like cases in Wuhan (one HealthMap news item from an English source, three PADI-web news items from two English sources and one Chinese source). The news detected by the three EBS originated from five different media outlets.
Among the three EBS systems compared, only ProMED relies on local expert information to alert on health threats. This result suggests that the network of local field experts is crucial for the detection of EID events and their reporting. Otherwise, HealthMap and PADI-web detected news on the same day as the official reporting.
It is therefore essential to understand their current limitations and promote the key role of experts in EBS systems. Further studies should also focus on assessing whether the timeliness of automated systems depends on the communication strategies of online media, as well as on determining their health event reporting threshold, and how these features impact the sensitivity of EBS systems.
The three EBS systems included in this study monitor media in multiple languages, thus facilitating detection of local media news.
A further increase in the number of available languages should enhance the sensitivity of EBS systems (Barboza et al., 2014). Our study also showed that the three EBS systems were complementary regarding scope (animal and public health), moderation (manual, semi-automated, automated) and number of covered languages.
PADI-web could retrieve COVID-19-related news through animal health-related RSS feeds, thus proving its usefulness for the detection of information of relevance for public health risk assessors.
From 275 COVID-19-related-news items retrieved by PADI-web, 54.5% (n = 150) were retrieved via syndromic RSS feeds, while the remaining 45.5% (n = 125) were retrieved via disease-specific RSS (Table 1). The fact that disease-specific RSS feeds contributed as much as syndromic RSS feeds to the detection of COVID-19 news by PADIweb was unexpected, thus highlighting the importance of combining (which is not a zoonotic disease), thus explaining why they were detected by PADI-web.

Content
The ability of EBS tools to encompass a broad scope of health-related topics through a limited number of queries (RSS feeds) is a major asset compared to formal sources. This capacity largely depends on the intrinsic features of online news in which outbreak-related content is often bulked up with additional information, such as comparisons with previous disease outbreaks, thus increasing the probability of being detected by EBS tools. However, the probability of detection of an EID event might be higher for (actual or assumed) zoonotic diseases and countries with ongoing animal disease outbreaks. This is not a major shortcoming in practice.

| News vocabulary
From the terms referring to either the virus or the disease, 18 terms were in the 'pneumonia' category, eight terms in the 'mystery' category, three terms in the 'coronavirus' category (one of them, 'coronovirus' being a misspelt form of 'coronavirus'), and seven terms in the 'technical' category ( Table 2).
The wordclouds generated from the overall news contents mined over three consecutive periods are shown in Figure 1.
Before identification of the virus (31 December 2019 -8 January 2020), 58.1% (n = 317) of the COVID-19 terms were in the importantly, EID event alerts should feed the risk assessment process to ensure early mitigation of EID events by the health managers and decision-makers.

ACK N OWLED G EM ENTS
The authors thank ProMED for data sharing. This work was partly funded by the H2020 'Monitoring outbreak events for disease sur- under the Investments for the Future Program, referred to as ANR-16-CONV-0004.

CO N FLI C T S O F I NTE R E S T
The authors declare no conflicts of interest.

E TH I C A L A PPROVA L
The authors confirm compliance with the ethical policies of the journal, as noted on the journal's author guidelines page. No ethical approval was required because this study did not involve any experimental protocol on humans or animals, and only open source online data were used.

DATA AVA I L A B I L I T Y S TAT E M E N T
The data that support the findings of this study are available for