Skip directly to content

Preparing Child Mortality Data Set for Analysis and Visualization

on Sun, 12/27/2015 - 20:06

The year 2015 marks the end of the Millennium Development Goals (MDGs) and the final milestone for world's countries to reach the eight goals and their targets.  In this context, I started a project to visualize the MDG 4: Reduce Child Mortality; particularly to show the global and country's achievements toward reaching Target 4.A: Reduce by two thirds the under-five mortality rate between 1990 and 2015.

The United Nation Inter-agency Group for Child Mortality Estimation (UN IGME) -integrated by UNICEF, WHO, the World Bank and the Population Division of the United Nations Department of Economic and Social Affairs- provides reliable and comparable estimates of child and infant mortality, which is the perfect data source for this project. Data sets are available online in the Child Mortality Estimate - CME Info web site.

After a quick inspection to the data sets, we realized that data format is not ready for analysis and visualization, requiring data preparation and transformation. This article describes the problems of the data format and illustrate a data preparation solution using Alteryx.

We will use the dataset of estimates for under-five, infant and neonatal mortality (file: RatedDeaths_AllIndicators.xlsx). Figure 1 shows the data table format.

Figure 1. Data table format as is distributed in CME Info web site

Why this format is not ready for data analysis and visualization?

Check out the bullets below:

  • Rows/lines 1 to 6 contains titles and descriptive texts, which is not part of the data set itself. 
  • Row 7 contains the names of variables or column names
  • Inthe current format, there are three dimension (first 3 columns), ISO Code and CountryName for identification of countries, and Uncertainty bounds containing categories for main estimate, and lower and upper uncertainty limits 
  • There are a lot of measures (those columns containing numbers), where names of columns are formed with the combination of indicator short name, a dot, and a four digit number that represent year. For instance, U5MR.1976 is the name for the measure under-five mortality rates for year 1976. The data set contains six indicators for years 1950 to 2015. This means there are combination of six indicators and 65 years, for a total of 390 measures.

This type of data format is called a cross-tabulation, which is sort of a report, but it is very difficult for data exploration, analysis, and visualization. Having repetitive columns with combination of dimensions and measures increase friction and disruption during the analytic cycle. For instance, it is hard to create a trend line of under-five mortality rates for a selected country by years because of year is fragmented in multiple columns.

This format need to be transformed and better preparaed for visualization. Figure 2 shows the workflow to transform the original data set and prepare a new data set ready for analysis and visualization. 

Figure 2. Data preparation workflow using Alteryx

What this workflow does?

  1. Remove lines 1 to 6
  2. Take values from row 7 as names for columns or variables
  3. Separate the first three columns ISO Code, Country Name, and Uncertainty bounds
  4. For each indicator, dinamically select those columns (combination of indicator names and year) related to it, remove indicator name part to leave juts the year, and join them to the first three columns (those that were selected in step 3). Then columns of years are transpose to create the new dimension year, This step transforms the format from horizontal to vertical. These steps are repeated for the six indicators included in the data set.
  5. The six data streams are finaly unionized and null values are filtered.
  6. The data stream from step 5 is then blended (using a left-join) with the Country list data set in order to add variables such as region and income groups. 
  7. Final step creates two similar data sets, one in comma separated values (.csv) format and the other in Tableau data extract (.tde) format.

The transformed and ready-to-use data set is displayed below in figure 3.

Figure 3. New data format as result of the data preparation and transformation process

The new format of the data set contains ten columns or fields: nine dimensions of analysis (ISO Code, Country Name, Uncertainty bounds, Year, Metric, WHO Member States, WHO Region, WHO Region Code, and Income group ) and one measure (Value) . Now using Tableau Desktop, with few steps we are able to create a line chart to see trends of child mortality rates from 1950 to 2015 for a selected country.

Steps to create the line chart in Tableau Desktop:

  1. Connect to data set in Tableau data exteract format (.tde)
  2. Drag-and-drop Year in Columns shelf
  3. Drag-and-drop Value in Rows shelf
  4. Drag-and-drop Uncertainty bounds in Filters shelf and select Median
  5. Drag-and-drop Metric in Filters and select Under Five Mortality rates
  6. Drag Country Name to Filters shelf, select one country and show the filter

Figure 4. Trends of child mortality rates in selected country, 1950-2015 

Conclusions

Data preparation is a time consuming and a fundamental step in visual analytics and visualization. An undersatnding of concepts such as dimensions, measures, database relational model, dimensional model, and how data should be better organized for data exploration, analysis and visualization is key for a successful project. Spend some time reviewing how the data source is organized and formated in order to determine if data preparation is required.  

PS. I presented and discussed this data preparation solution and workflow in the Washington DC Alteryx User Group Meeting on December 16, 2015. 

Comments

George Ikuta Mbevi's picture

Useful insights

brian's picture

It would really be helpful if the .yxmd was available for download. Am I missing the link somewhere?

Post new comment