Tuesday, December 29, 2015

Reproducible research from a book

 

Preamble

Sometimes, you don't have direct access to the data, or the data changes over time. 
 
Yeah, I know, scary. So that's my point in this post. Provide a URL to a "frozen" version of your data, if at all possible. Toward the end of the article I provide a link to the notebook. This repo also holds the data I used for the visualization.
 
Let's get right into it...
 
In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
sns.set_context("talk")

 

Reproducible visualization

In "The Functional Art: An introduction to information graphics and visualization" by Alberto Cairo, on page 12 we are presented with a visualization of UN data time series of Fertility rate (average number of children per woman) per country:

Figure 1.6 Highlighting the relevant, keeping the secondary in the background.

Book url:
The Functional Art


Let's try to reproduce this.

 

Getting the data

The visualization was done in 2012, but limited the visualization to 2010.

This should make it easy, in theory, to get the data, since it is historical. These are directly available as excel spreadsheets now, we'll just ignore the last bucket (2010-2015).
Pandas allows loading an excel spreadsheet straight from a URL, but here we will download it first so we have a local copy.

In [3]:
!wget 'http://esa.un.org/unpd/wpp/DVD/Files/1_Indicators%20(Standard)/
EXCEL_FILES/2_Fertility/WPP2015_FERT_F04_TOTAL_FERTILITY.XLS'
--2015-12-29 16:57:23--  http://esa.un.org/unpd/wpp/DVD/Files/
1_Indicators%20(Standard)/EXCEL_FILES/2_Fertility/
WPP2015_FERT_F04_TOTAL_FERTILITY.XLS
Resolving esa.un.org... 157.150.185.69
Connecting to esa.un.org|157.150.185.69|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 869376 (849K) [application/vnd.ms-excel]
Saving to: 'WPP2015_FERT_F04_TOTAL_FERTILITY.XLS'

WPP2015_FERT_F04_TO 100%[=====================>] 849.00K   184KB/s   in 4.6s   

2015-12-29 16:57:28 (184 KB/s) - 
'WPP2015_FERT_F04_TOTAL_FERTILITY.XLS' saved [869376/869376]

 

World Population Prospects: The 2015 Revision

File FERT/4: Total fertility by major area, region and country, 1950-2100 (children per woman)
Estimates, 1950 - 2015                                
POP/DB/WPP/Rev.2015/FERT/F04                                
July 2015 - Copyright © 2015 by United Nations. All rights reserved                                
Suggested citation: United Nations, Department of Economic and Social Affairs,
Population Division (2015). 
World Population Prospects: The 2015 Revision, DVD Edition.
 
In [2]:
df = pd.read_excel('WPP2015_FERT_F04_TOTAL_FERTILITY.XLS', skiprows=16,  
                   index_col = 'Country code')
df = df[df.index < 900]
 
In [3]:
len(df)
Out[3]:
201
 
In [4]:
df.head()
Out[4]:

Index Variant Major area, region, country or area * Notes 1950-1955 1955-1960 1960-1965 1965-1970 1970-1975 1975-1980 1980-1985 1985-1990 1990-1995 1995-2000 2000-2005 2005-2010 2010-2015
Country code
















108 15 Estimates Burundi NaN 6.8010 6.8570 7.0710 7.2680 7.3430 7.4760 7.4280 7.5920 7.4310 7.1840 6.908 6.523 6.0756
174 16 Estimates Comoros NaN 6.0000 6.6010 6.9090 7.0500 7.0500 7.0500 7.0500 6.7000 6.1000 5.6000 5.200 4.900 4.6000
262 17 Estimates Djibouti NaN 6.3120 6.3874 6.5470 6.7070 6.8450 6.6440 6.2570 6.1810 5.8500 4.8120 4.210 3.700 3.3000
232 18 Estimates Eritrea NaN 6.9650 6.9650 6.8150 6.6990 6.6200 6.6200 6.7000 6.5100 6.2000 5.6000 5.100 4.800 4.4000
231 19 Estimates Ethiopia NaN 7.1696 6.9023 6.8972 6.8691 7.1038 7.1838 7.4247 7.3673 7.0888 6.8335 6.131 5.258 4.5889

First problem... The book states on page 8:
--
Yet we have 201 countries (codes 900+ are regions) with complete data. We do not have a easy way to identify which countries were added to this. Still, let's move forward and prep our data.
In [5]:
df.rename(columns={df.columns[2]:'Description'}, inplace=True)
In [6]:
df.drop(df.columns[[0, 1, 3, 16]], axis=1, inplace=True) # drop what we dont need
In [7]:
df.head()
Out[7]:

Description 1950-1955 1955-1960 1960-1965 1965-1970 1970-1975 1975-1980 1980-1985 1985-1990 1990-1995 1995-2000 2000-2005 2005-2010
Country code












108 Burundi 6.8010 6.8570 7.0710 7.2680 7.3430 7.4760 7.4280 7.5920 7.4310 7.1840 6.908 6.523
174 Comoros 6.0000 6.6010 6.9090 7.0500 7.0500 7.0500 7.0500 6.7000 6.1000 5.6000 5.200 4.900
262 Djibouti 6.3120 6.3874 6.5470 6.7070 6.8450 6.6440 6.2570 6.1810 5.8500 4.8120 4.210 3.700
232 Eritrea 6.9650 6.9650 6.8150 6.6990 6.6200 6.6200 6.7000 6.5100 6.2000 5.6000 5.100 4.800
231 Ethiopia 7.1696 6.9023 6.8972 6.8691 7.1038 7.1838 7.4247 7.3673 7.0888 6.8335 6.131 5.258
In [8]:
highlight_countries = ['Niger','Yemen','India',
                       'Brazil','Norway','France','Sweden','United Kingdom',
                       'Spain','Italy','Germany','Japan', 'China'
                      ]
In [9]:
# Subset only countries to highlight, transpose for timeseries
df_high = df[df.Description.isin(highlight_countries)].T[1:]
In [10]:
# Subset the rest of the countries, transpose for timeseries
df_bg = df[~df.Description.isin(highlight_countries)].T[1:]

Let's make some art

In [11]:
# background
ax = df_bg.plot(legend=False, color='k', alpha=0.02, figsize=(12,12))
ax.xaxis.tick_top()

# highlighted countries
df_high.plot(legend=False, ax=ax)

# replacement level line
ax.hlines(y=2.1, xmin=0, xmax=12, color='k', alpha=1, linestyle='dashed')

# Average over time on all countries
df.mean().plot(ax=ax, color='k', label='World\naverage')

# labels for highlighted countries on the right side
for country in highlight_countries:
    ax.text(11.2,df[df.Description==country].values[0][12],country)
    
# start y axis at 1
ax.set_ylim(ymin=1)
Out[11]:
(1, 9.0)