Bactrocera dorsalis in the Indian Ocean: A tale of two invasions

Abstract An increasing number of invasive fruit fly pests are colonizing new grounds. With this study, we aimed to uncover the invasion pathways of the oriental fruit fly, Bactrocera dorsalis into the islands of the Indian Ocean. By using genome‐wide SNP data and a multipronged approach consisting of PCA, ancestry analysis, phylogenetic inference, and kinship networks, we were able to resolve two independent invasion pathways. A western invasion pathway involved the stepping‐stone migration of B. dorsalis from the east African coast into the Comoros, along Mayotte and into Madagascar with a decreasing genetic diversity. The Mascarene islands (Reunion and Mauritius), on the contrary, were colonized directly from Asia and formed a distinct cluster. The low nucleotide diversity suggests that only a few genotypes invaded the Mascarenes. The presence of many long runs of homozygosity (ROH) in the introduced populations is indicative of population bottlenecks, with evidence of a more severe bottleneck for populations along the western migration pathway than on the Mascarene islands. More strict phytosanitary regulations are recommended in order to prevent the further spread of B. dorsalis.


| INTRODUC TI ON
In response to globalization, there has been an increased concern surrounding invasive phytophagous insect species, which is motivated by the accumulating economic losses in the fruit trade (Suckling et al., 2013;Welsh et al., 2021). The oriental fruit fly, Bactrocera dorsalis (Hendel) (Diptera: Tephritidae), is a major pest species natively found in wet tropical regions of Asia (Clarke et al., 2019;Zeng et al., 2019). The host range of B. dorsalis consists of a large variety of plants in both the species' native (Tan & Serit, 1994;Zeng et al., 2019) and invasive range (Hassani et al., 2016;Moquet et al., 2021;Rwomushana et al., 2008). With increasing invasions worldwide, many efforts have been undertaken to eradicate or suppress the species locally (Seewooruthun, Soonnoo, & Alleck, 2000;Vargas et al., 2010;Zeng et al., 2019), including the Sterile Insect Technique (SIT), a method relying on the release of sterile male flies that compete with resident fertile males in order to suppress the target pest population (Dyck et al., 2021). Additionally, B. dorsalis is effectively displacing native fruit fly species as has been recently shown by Moquet et al. (2021), drawing the attention of researchers.
Up to now, the genetic population structure of B. dorsalis has been studied using microsatellite markers (Kim et al., 2021;Shi et al., 2012) (Choudhary et al., 2016;Kim et al., 2021;Shi et al., 2012). However, the weak spatial structure of B. dorsalis could also be related to the markers used so far, which may not allow highresolution characterization of population structure.
Within the last century B. dorsalis has invaded the USA in Hawaii (Vargas et al., 2010), California (Barr et al., 2014), and Florida (Alvarez et al., 2016;Steck et al., 2019;Zeng et al., 2019), was introduced to Africa and Oceania (Vargas et al., 2015), and expanded its range in China (Garzón-Orduña et al., 2019;Shi et al., 2012). While recent range expansions in India (Choudhary et al., 2016) and several other Asian countries (Zeng et al., 2019) have also been suggested, these statements likely originate from the confounding of "first detection" with "first taxonomic record" where the latter does not imply that the species was previously absent and incomplete knowledge of B. dorsalis taxonomic history (Clarke et al., 2019). Furthermore, B. dorsalis was first documented in (East) Africa in the Coast Province of Kenya, during routine field surveys by Lux et al. (2003) where it was first inaccurately described as B. invadens (Ekesi et al., 2006;Schutze et al., 2015;Zhang et al., 2016). From there, it has rapidly expanded westwards and southwards along the Indian Ocean coast and has reached South Africa in 2010 (Drew et al., 2005;Zeng et al., 2019).
During the last decade, B. dorsalis managed to be re-established in Florida in 2015 (Steck et al., 2019) and caused several outbreaks in California during (Barr et al., 2014USDA APHIS, 2021) after initial eradication. First detections of B. dorsalis in Europe occurred in Italy (Nugnes et al., 2018) and France (https:// draaf.paca.agric ulture.gouv.fr/) in recent years.
Range expansions of B. dorsalis are fairly well documented, but little research has been done on the genetic nature of migration pathways of B. dorsalis on a regional scale (Garzón-Orduña et al., 2019;Zeng et al., 2019). Especially information on humanmediated range expansions, which can be complex in nature (e.g., colonization of oceanic islands) is generally lacking and could provide new information beneficial to sanitary protocols such as improved quarantine systems and better detection and screening methods (Kim et al., 2021;Suckling et al., 2013;Welsh et al., 2021 (Seewooruthun, Permalloo, et al., 2000;Seewooruthun, Soonnoo, & Alleck, 2000). In 2013, the species was recorded again, but its re-occurrence was soon followed by a second eradication (Sookar & Deguine, 2016). In 2015, B. dorsalis was trapped for the third time in Mauritius, and no largescale eradication programs have been pursued since then (Sookar et al., 2020).
Knowledge of the invasion pathway and genetic structure of B. dorsalis in the Indian Ocean is lacking and could be beneficial for policymakers (Szyniszewska, 2016). Moreover, the invasion of the islands in the Indian Ocean by B. dorsalis could be a prime example of founder effects as a byproduct of island hopping (Mayr, 1942;Sendell-Price et al., 2021;Tinghitella et al., 2011). Colonized populations may undergo a step-wise reduction in genetic diversity relative to their sources and rapid differentiation from source populations (Sendell-Price et al., 2021;Tinghitella et al., 2011).
Given the different dates for historical recordings on several islands in the western Indian Ocean and possible discrepancies between the detection timeline and actual range expansion, we aim to explore the colonization pathway of this invasive species and its genetic relationship with the population in its native Asian range and invaded range on mainland Africa. Furthermore, we expect to pick up an erosion of genetic diversity within the invasive range, especially on oceanic islands, and a buildup of homozygous genomic regions within the invasive range compared with the genetic source. In order to test the above, we performed whole genome (re-)sequencing of B. dorsalis samples collected in the Indian Ocean, adjacent continental Africa, and Asia. A multi-pronged analytic approach was employed, including an allele frequency-based tree building algorithm and network theory on SNP markers.

| Sampling, sequence filtering, and genotyping
For this study, we used 237 B. dorsalis specimens obtained from 21 populations in Asia (seven populations, 66 specimens), Africa (three populations, 30 specimens), and several islands in the Indian Ocean (11 populations and 141 specimens): Grande Comoro, Anjouan, Mohéli (Comoros W, E, and C, respectively), Mayotte, Madagascar, and the Mascarene islands (Mauritius and Réunion) using methyl eugenol traps. More than one population was included for Mayotte, Madagascar, and Réunion (two, three, and two, respectively; Table S1). Additionally, seven specimens of one Bactrocera zonata population collected in Tel Aviv, Israel, were included to serve as an outgroup for phylogeographic tree reconstruction. Samples were collected between 2016 and 2021, with the exception of those from Burundi and Malawi, which were collected in 2011. See Table S1 for further details on coordinates, number of samples, and collection year per location. Assessment of the general quality of the raw reads was performed using "fastQC" (https://github.com/s-andre ws/FastQC).
Reads were then quality trimmed with "fastp" (Chen et al., 2018), discarding read pairs of which more than 2% of the bases failed to meet a quality threshold of 10. Trimmed reads were aligned to the B. dorsalis reference genome (GenBank assembly accession: GCA_020283865.1) using the bwa-mem command from the burrows-wheeler aligner tool (Li & Durbin, 2009). Variant calling was performed using the GATK Haplotypecaller algorithm implemented in "Elprep" (Herzeel et al., 2021). "ElPrep" is a tool designed as an in-memory and multithreaded toolset to fully take advantage of the processing power available with modern servers and consists of a 5-step variant calling best practices pipeline. After individual variant calling, gVCF files were combined using the GATK CombineVCFs command (McKenna et al., 2010). Joint variant calling was performed separately for each of the six chromosomes using GATK GenotypeGVCFs, and chromosome VCF files were merged afterward. Joint calling allows for the rescue of low-coverage genotypes in one sample when another sample provides evidence supporting the existence of the genotype. Biallelic variants were hard filtered using GATK VariantFiltration using the recommended settings (quality-bydepth ratio (QD) < 2.0 || read mapping quality (MQ) < 40.0 || probability of strand bias (FS) > 60.0 || symmetric odds ratio (SOR) > 3.0 || MQRankSum < −12.5 || ReadPosRankSum < −8.0). Filtering on QD normalizes QUAL by the number of reads supporting every SNP and is preferred above filtering quality and depth in individual steps. Samples with missingness larger than 20% were omitted from further analysis. Furthermore, SNPs with more than two alleles, a minimum depth and minor allele count lower than three and a missingness higher than 10% were omitted from further analysis using "VCFtools" (Danecek et al., 2011). Finally, to account for the effects of linkage between SNPs in the datasets, we pruned the VCF file based on a pairwise correlation threshold of 0.1 and a window size of 10,000 bases using the package "SNPRelate" (Zheng et al., 2012) implemented in R (version 4.0.3).

| Population genetic parameters
Multiple estimates of genetic diversity were calculated using the same filters as stated above, but no minor allele filtering or linkage pruning was applied in order to best approximate the real genomic diversity. We used "STACKS populations" (Catchen et al., 2013) to estimate observed heterozygosity (H O ), expected heterozygosity (H E ), nucleotide diversity (π), and the number of private alleles (A PR ) at the population level. A measure of genetic differentiation (F ST ) between population pairs was calculated using the same software. Lastly, we attributed populations to a higher level grouping (i.e., Asia, Africa, the Comoros islands, Mayotte, Madagascar, and the Mascarene islands) based on admixture results and geographical borders and repeated the estimation for all genetic parameters at this level.

| Runs of homozygosity
To infer the demographic history and more specifically, historical bottlenecks, we estimated the total number (NROH) and total length (SROH) of long runs of homozygosity (ROH) using "bcftools" (Danecek et al., 2021). Different demographic scenarios cause different ROH signatures (Ceballos et al., 2018), hence, ROH analyses are increasingly being used in population genomic studies (e.g., in the range-restricted Pyrenean Desman, Escoda & Castresana, 2021; Swedish wels catfish, Jensen et al., 2021 andin killer whales, Foote et al., 2021). Since ROH are sensitive to minor allele frequency (MAF) filtering and linkage pruning, such filters were not applied before calculating the total number of ROH and the total summed length of ROH (Meyermans et al., 2020). To test for significant differences between the six geographical groups, a parametric linear model was carried out using SROH as the predictor variable. A p-value was calculated for every pairwise comparison between geographical groups, and a Tukey's range test was applied to correct for multiple testing using the package "mixlm" (https://github.com/khlil and/mixlm) implemented in R.

| Admixture analysis and PCA
To infer admixture between samples across the whole dataset, ancestry coefficients for K number of ancestral populations were calculated using the sNMF function (Frichot et al., 2014) of the "LEA" package (Frichot & François, 2015) implemented in R. K varied from 2 to 10 while performing 10 replicate runs for each K. For each value of K, the best run was selected based on the cross-entropy criterion.
Ancestral proportions (Q values) were then visualized for K = 2 up to the value of K with the lowest cross-entropy level + 1 to ensure that further subdivision is not meaningful using the R package "ggplot2" (Wickham, 2017).
In order to investigate and visualize the partitioning of genetic variation, an analysis of principal components (PCA) was performed using the snpgdsPCA function of the R package "SNPRelate" (Zheng et al., 2012). To test for significant differences between geographical groups, a permutational ANOVA (permanova) was carried out using the first seven eigenvectors as predictor variables and employing 10,000 permutations using the pairwise.adonis2 function of the "pairwiseAdonis" package (https://github.com/pmart ineza rbizu/ pairw iseAd onis). A p-value was calculated for every pairwise comparison, and the Benjamini-Hochberg method (Benjamini & Hochberg, 1995) was applied to correct for multiple testing.

| PoMo phylogenetic tree
In order to infer a population-level phylogenetic tree, a Polymorphisms-aware phylogenetic Model (PoMo) was constructed (De Maio et al., 2015). This approach is well-suited for tree estimation in scenarios of incomplete lineage sorting and accounts for ancestral variation while being computationally more efficient than other methods. Secondly, bcftools consensus (Li, 2011) was used to create a whole genome alignment in fasta format using the B. dorsalis reference genome (GenBank assembly accession: GCA_020283865.1).
The alignment was then converted to a "counts file" containing the allele counts per population for each site with the publicly available python library "cflib" (https://github.com/pomo-dev/cflib). Sites with missing data were subsequently omitted from the count's file.
To account for the assumption of independent sites, 1 million sites across the whole genome were randomly selected of which 8680 were variant sites. Based on these data, we employed a generalized time-reversible polymorphism-aware evolutionary model with eight rate heterogeneity categories (GTR + G8 + P; Schrempf et al., 2019) to reconstruct a maximum-likelihood phylogeny in "IQ-tree" (Nguyen et al., 2015). This substitution model features an expanded nucleotide state alphabet so that taxa can have polymorphic states, and thus extends traditional models that are based on nucleotide states that are fixed within a taxon (Schrempf et al., 2016). One-thousand ultrafast bootstrap trees were constructed to calculate node support (Hoang et al., 2018). The resulting tree reconstruction was visualized with ITOL (Letunic & Bork, 2021).

| Kinship networks
As a flexible and highly scalable approach for analyzing individualbased genetic relationships, a kinship network was constructed. The advantage of using an individual-based metric over a populationbased one such as F ST is that no a priori assumptions are made about (meta-)population structure and thus allowing the data to directly inform the fine-grain genetic structure (Jones & Manseau, 2022).
Kinship is ideally suited to infer the impact of recent events on relationships between individuals (such as captive breeding programs; Escoda et al., 2019) but is also applicable to wild populations without prior knowledge of demography, shared ancestry, or pedigree (Staples et al., 2014).
Kinship was estimated using the snpgdsIBDMLE function within the R package "SNPRelate" (Zheng et al., 2012) applying the Expectation-Maximization (EM) algorithm for exploring the maximum value of the log-likelihood function (Milligan, 2003;Zheng et al., 2012). Two networks were created. The first network includes all edges with edge weight in proportion to the respective kinship value. A second network was constructed by extracting the backbone structure of the network, simplifying the network by retaining the most important edges, using the disparity filter algorithm by Serrano et al. (2009) implemented in the R package "backbone" (available at: https://CRAN.R-proje ct.org/packa ge=backbone). By imposing a significance level α = 0.05 or 0.10, the links that carry weights that are not different than under the null hypothesis can be pruned from the network. This method is preferred above a global weight threshold pruning method and reveals the multilayered and often complex nature of the network (Jones & Manseau, 2022).
Pruning at a small value of α reveals those edges that constitute a central part of the network backbone while pruning at a higher value of α can reveal the more shallow substructures within the network.
Nodes without edges are not shown. Node degree is calculated as the number of edges surrounding the respective node.

| RE SULTS
On average, 56,655,392 reads were produced per sample with an average number of mapped reads equal to 48,252,941 (85.17% mapping success). After SNP calling and application of GATK hard filtering, 140,132,105 SNPs remained. Five samples were omitted from further analysis based on missingness larger than 20% (three samples from Reunion, one sample from Burundi, and one sample from Sri Lanka). Consecutive filtering at minimum depth and minor allele count higher than or equal to three, site missingness lower than 10% and linkage pruning rendered 909,776 biallelic SNPs for further PCA and kinship network analysis.

| Runs of homozygosity
Madagascar-Mayotte (p = 0.79). See Table S3 for the full results of the post hoc comparisons.

| Admixture analysis, PCA, and PoMo tree
The sNMF-based estimates of shared ancestry point toward the presence of four distinct genetic clusters with K = 5 as the optimal K according to the level of cross-entropy ( Figure 2). The first cluster contains all Asian samples, excluding five samples from Tamil Nadu (India S, Table S1) Table S1) are forming a distinct cluster, agreeing with results from the admixture analysis ( Figure 2). Individuals from Africa were more closely related to individuals sampled in the Comoros, corroborating the clinal pattern seen in the admixture type analysis. The first and second principal components explained 3.58% and 2.57% of total variations, respectively. The permanova showed significant structuring (p adj < 0.05) for all but one pairwise comparison between the six major geographical groups. There was no observed significant difference between Mayotte and the Comoros (p = 0.105).
Overall, the topology of the PoMo tree was strongly supported based on bootstrap values (Figure 4). The phylogeography of  (Figure 4). were also exclusively detected between individuals belonging to the same population and could primarily be found in the Western group

| DISCUSS ION
In light of tracing the origin of invasive fruit fly incursions, researchers have increasingly focused on resolving the population structure of various tephritid species (Deschepper et al., 2021;Dupuis et al., 2018;Kim et al., 2021;Narde et al., 2005;Popa-Báez et al., 2021;Ruiz-Arce et al., 2020;Virgilio et al., 2010). For example, by amplifying the N5N4 mitochondrial region in Ceratitis capitata, Ruiz-Arce et al. (2020) were able to discriminate between six worldwide geographical regions but struggled to provide a more detailed population genetic structure. A comparable approach was used by Kim et al. (2021) andQ in et al. (2018), who used COI and NAD6 haplotype diversity, respectively, in combination with microsatellite markers to infer regional clustering patterns of B. dorsalis in a large part of its Asian range. Here, they found evidence for founder effects but did not achieve proof for extensive genetic structure, details that could inform invasion pathways. To our knowledge, our study is the first to use SNP markers resulting from whole-genome resequencing to examine tephritid population structuring and have high enough resolution to characterize invasion routes. Within the Asian range, we were able to discriminate between populations using a combination of population-wide allele frequency information in the PoMo analysis and kinship metrics to add granularity at the individual level (Figures 4 and 5). We found populations in Telangana, India (Table S1), and Sri Lanka to be a sister group to populations in Punjab and Tamil Nadu, India (Table S1) Figure 4). The Comoros, Mayotte, and Madagascar, together with the African populations, belonged to one pathway, while the Mascarenes were contained within a separate cluster (Figures 2-4).
Source populations for both colonization pathways are likely different but share genetic variation to some extent. However, fully shared genetically diverse origins cannot be excluded, with drift effects and bottlenecks driving the divergent genetic structure. In the latter case, we would expect to see a larger degree of shared ancestry (Figure 2) in the case of a single origin for both pathways, certainly because the sampling took place only 4-6 years after the invasion of the Mascarenes, rendering little time for drift effects to alter the genetic composition in such a drastic way as observed here.
Furthermore, the ML tree reconstruction showed a strong support for two independent invasion routes within the Indian Ocean with a basal position of the Mascarenes ( Figure 5).
It has to be noted that every population in this study is characterized by a different time frame between the moment of sampling and the time of colonization, which could bias the amount of genetic differentiation that can be observed between the source and introduced population. However, drift especially acts strongly in the early stages after a population bottleneck and wears off over time (Gilchrist et al., 2012). The fast lifecycle of B. dorsalis could imply that drift rapidly changes population genetic structure and differentiation after colonization took place and time and thus only a limited time is needed to pick up signatures of a genetic bottleneck after the invasion.

| Western invasion pathway
A western invasion pathway is characterized by step-wise island hopping starting from the east African coast and ending in Madagascar.
Grande Comore (Comoros W) was most likely the first Indian Ocean island to be colonized as suggested by the backbone structure of

| Eastern invasion pathway
Réunion and Mauritius are always clustered together (Figures 2-5) and are significantly related (Figure 5b). In contrast to islands along the western invasion pathway, the Mascarene islands were colonized directly from Asia and our results do not suggest a relationship with the western pathway. The weak links between both pathways observed in Figure 5a disappear when backbone pruning is applied and thus suggests the insignificance of these links (Figure 5b,c).  Figure 1). Hence, the low genetic diversity is a product of the recency of the founder effect rather than long-term drift on the islands themselves. This is in contrast to Madagascar and Mayotte where the lower genetic diversity compared with the African mainland is likely related to a series of bottlenecks upon introduction on the islands, accompanied by prolonged periods of drift and inbreeding as indicated by the higher SROH ( Figure 1). Additionally, the high number of private alleles, both on a regional and at a population level (Table 1), corroborates the lack of drift on the Mascarenes though caution is advised when interpreting the number of private alleles since this statistic is influenced by a priori assignment of samples to groups.
Ultimately, we think that this study could provide policymakers with valuable information to prevent the re-establishment of B. dorsalis after successful eradication or prevent new invasions.
Long-distance dispersal in the case of isolated areas such as oceanic islands is usually linked with human activities. The main driver for the invasion of agricultural pests is still considered to trade (see Hulme, 2021 for a review). Nevertheless, other major pathways such as luggage of airline passengers (Liebhold et al., 2006) and international shipment of goods (both private and through internet commerce, Lenda et al., 2014) are considered of increasing importance. In popular tourist destinations with a large proportion of the local population having foreign roots and family ties, as is the case in the majority of the western Indian Ocean islands, these pathways are even more pertinent. Characterizing invasion pathways can, therefore, assist policymakers in developing appropriate measures and policies to curb further introductions. It will allow them to target particular trade routes and visitors for more scrutinous phytosanitary inspections, which can even be modeled and adapted by taking into account aspects such as traffic peaks, travel, seasons, and temporal abundance of potential pests in countries of origin (Szyniszewska, 2016). Models for a particular horticultural pest such as the one developed in this research can then be extrapolated to other organisms that are similarly associated with the same crops. Moreover, looking at infestation rates and host range of the invasive B dorsalis populations from the Mascarenes (Moquet et al., 2021;Sookar et al., 2020) compared with the ones from Madagascar or Comoros (Hassani et al., 2016;Rasolofoarivao et al., 2022) it seems that the Mascarenes populations have slightly different ecological parameters with a broader host range and higher infestation rates, highlighting the importance of preventive measures for pest control.

| CON CLUS ION
Our study is the first to investigate the population structure and invasion history of B. dorsalis in the Indian Ocean. By using whole genome SNP data, we resolved two independent invasion path-

CO N FLI C T O F I NTE R E S T
All authors declare that they have no conflicts of interest.

DATA AVA I L A B I L I T Y S TAT E M E N T
With the aim of full access and benefit-sharing within the framework of the Nagoya Protocol, a zenodo data archive was created containing a filtered VCF file, a metadata file, and a count file. The VCF file contains raw SNP information with only the GATK filters applied.
The data can be accessed using the following doi: 10.5281/zenodo.6602072. Raw paired-end read data are stored as a BioProject (PRJNA893460). Admittance to related genetic resources and databases is herewith provided. Research and development results are fully shared by all co-authors involved, and joint ownership of relevant intellectual property rights is assured.

B EN EFIT-S H A R I N G S TATEM ENT
All geographic samples were collected by the different co-authors in compliance with their national jurisdiction or sourced from collections obtained prior to the Nagoya Protocol ratification.