Expert analysis

11 December 2018

What drives internal displacement? A machine learning approach

IDMC has partnered with the ReDi School of Digital Integration to explore how to study displacement in relation to a wide range of socioeconomic indicators. Data Analytics students used machine learning, a field of Artificial Intelligence which uses advanced analytical tools to perform new analysis, with World Bank data to draw the following conclusions. 

Every year, millions of citizens are displaced around the globe. In 2017 alone, over 30 million people were displaced across 143 different countries due to conflict and environmental catastrophes. Can wReDi logoe predict these events in advance and so provide effective humanitarian support? Do we understand the complexity of factors associated with large-scale internal displacement when a disaster occurs? Nowadays, big data and machine learning tools offer us an opportunity to revise these questions and extract more robust conclusions about the implications of wider factors such as education, gender equality or public sector infrastructure in displacement.

There are two main causes of internal displacement, conflict and disasters. However, there are a multitude of other factors that could influence the scale of displacement. Displacement can affect individuals on one hand, or larger groups on the other. Furthermore, one might argue that the risk of suffering internal displacement due to disasters is simply correlated to geolocation or climatic conditions. However, countries like Costa Rica or Japan prove that those assumptions are not always correct. Thus, our hypothesis is that the reasons behind the extent of internal displacements must be multiple and less superficial.

Organisations such as the UN, the World Bank and the NRC / IDMC have collected data for over 10 years, covering facts of concern, such as displacements, as well as social-economic indicators which might be considered as risk factors. Facing the overwhelming number of different factors to consider, machine learning methods, like linear regression, provide the means to select those which describe historical events most accurately. 

Robust socioeconomic analysis requires reliable and large data sets. Thanks to the work done by IDMC, we have public records of internal displacements since 2007, for 35 and 107 countries due to conflict or environmental disasters respectively. We will focus only on disasters for the sake of robust conclusions. To further simplify the analysis, we sum the number of people displaced per year in a given country regardless of the size of each displacement event in that year. Finally, the number of displacements is divided by the total number of inhabitants and expressed on a logarithmic scale. 

The World Data Bank offers a reliable collection of socioeconomic factors. For the analysis we consider the 232 relative indicators with sufficient data available for all reported countries in the time frame since 2007. In this case, we rely on common machine learning tools for the extraction of the most relevant indicators. First we build 100 subsets out of the data to cross validate the results and fit on each subset a Lasso linear regression model that retrieves only the indicators with more impact on the model. We then only consider those indicators that are selected in more than 2/3 of the subsets for further analysis.

The graph illustrates quantitatively the percentage of indicators per category selected from the World Bank, after the first screening with the Lasso linear regression and after the statistical analysis with OLS
Figure 1. The graph illustrates quantitatively the percentage of indicators per category selected from the World Bank, after the first screening with the Lasso linear regression and after the statistical analysis with OLS.

Notice that the quality of healthcare, infrastructure, urban development or social protection are not especially relevant for the incidence of large internal displacements due to disasters. On the other hand, the importance of the private sector, the status and growth of the national economy, the environmental conditions and the level of education is quite high. Additionally, energy, mining and public sector acquired more relevance after the analysis. 

The final indicators are extracted from a linear regression model (OLS) and only those with a significance higher than 95% are further considered (third column in barplot of figure 1). We observe that more internal displacement occurs more often in countries with higher restrictions on access to fuel or technologies for cooking, a lower proportion of women in national parliaments, a shorter duration of primary school and lower labor tax contributions. The findings should not be taken at face value, nevertheless the fact that all of these factors predict smaller numbers of displacements strongly suggests the importance of good governance to ensure the very fundamental needs of the broad population. 

With this analysis we like to showcase how a data-first approach, using predictive models, can provide interesting insights on the driving forces of a phenomena such as displacement. We hope that this kind of research can serve as a starting point to further investigate the underlying dynamics.

ReDI School of Digital Integration is a non-profit digital school for newcomers in Berlin, driven by the belief that education and technology have the potential to break down barriers and connect human potential with opportunities. 

This project emerges from an educational collaboration between IDMC and ReDI School, which has given the students of the Data Analytics course the opportunity to work on a real challenge suggested by IDMC. The resulting study, which has been fully developed by ReDI students Noelia, Shireen, Liubomyr, Tarek, Ramez and Vincetita, under the supervision of their volunteer teachers Celsa, Levin, Max, Evan, Amelie and Jörg, is the outcome of their learning journey, and constitutes the final project for their course.