Makeover Monday, 16 April 2018

Analysis and visualisation of (simulated) malaria cases for Makeover Monday.

Data from VisualizeNoMalaria via Makeover Monday.

In [1]:
import collections
from datetime import datetime
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.cm as cm
%matplotlib inline

import numpy as np
import pandas as pd
import scipy.stats

Read the dataset

Rename the columns while we're here.

In [2]:
malaria_raw = pd.read_excel('Simulated VisualizeNoMalaria Counts.xlsx').drop('Disclaimer', axis=1)
malaria_raw.columns = ['country', 'province', 'district', 'ruralurban', 'date', 'report', 'cases']
malaria_raw.head()
Out[2]:
country province district ruralurban date report cases
0 Zambia Southern Chikankata Rural 2014-01-01 Health Facility 0
1 Zambia Southern Chikankata Rural 2014-01-01 Community Health Worker 288
2 Zambia Southern Chikankata Rural 2014-02-01 Health Facility 0
3 Zambia Southern Chikankata Rural 2014-02-01 Community Health Worker 251
4 Zambia Southern Chikankata Rural 2014-03-01 Health Facility 0

Explore the data

Just see how many items there are for each category

In [3]:
malaria_raw.country.value_counts()
Out[3]:
Zambia    3586
Name: country, dtype: int64
In [4]:
malaria_raw.province.value_counts()
Out[4]:
Southern    3586
Name: province, dtype: int64
In [5]:
malaria_raw.district.value_counts()
Out[5]:
Monze          400
Kalomo         400
Kazungula      400
Mazabuka       400
Choma          400
Pemba          200
Chikankata     200
Gwembe         200
Siavonga       200
Namwala        200
Zimba          200
Sinazongwe     200
Livingstone    186
Name: district, dtype: int64
In [6]:
malaria_raw.ruralurban.value_counts()
Out[6]:
Rural    2404
Urban    1182
Name: ruralurban, dtype: int64
In [7]:
malaria_raw.report.value_counts()
Out[7]:
Health Facility            1793
Community Health Worker    1793
Name: report, dtype: int64

Country and province don't mean anything.

In [8]:
malaria_raw.groupby(['district', 'ruralurban']).size()
Out[8]:
district     ruralurban
Chikankata   Rural         200
Choma        Rural         200
             Urban         200
Gwembe       Rural         200
Kalomo       Rural         200
             Urban         200
Kazungula    Rural         200
             Urban         200
Livingstone  Rural           4
             Urban         182
Mazabuka     Rural         200
             Urban         200
Monze        Rural         200
             Urban         200
Namwala      Rural         200
Pemba        Rural         200
Siavonga     Rural         200
Sinazongwe   Rural         200
Zimba        Rural         200
dtype: int64

Initial plots

Just a quick few plots to see what the data looks like.

In [9]:
malaria_raw.groupby('date').sum().plot()
Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff983268588>
In [10]:
malaria_raw.groupby(['date', 'report']).sum().unstack().plot()
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff981157438>
In [11]:
ax = malaria_raw.groupby(['date', 'district']).sum().unstack().plot()
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))
Out[11]:
<matplotlib.legend.Legend at 0x7ff9810d8128>
In [12]:
malaria_raw.groupby('district').sum().sort_values(by='cases')
Out[12]:
cases
district
Livingstone 4790
Namwala 6439
Mazabuka 8068
Monze 9243
Chikankata 13917
Zimba 14984
Choma 32397
Kazungula 33731
Pemba 35081
Kalomo 35529
Siavonga 40703
Gwembe 64059
Sinazongwe 158874
In [13]:
ax = malaria_raw.groupby(['date', 'ruralurban']).sum().unstack().plot()
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))
Out[13]:
<matplotlib.legend.Legend at 0x7ff981089d68>
In [14]:
ax = malaria_raw.groupby(['date', 'district', 'report']).sum().unstack([-2, -1]).plot(figsize=(15, 15))
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5));

Just 2014

Atypical things are happening in 2014. Let's look at just this data.

In [15]:
malaria_2014 = malaria_raw[malaria_raw.date.dt.year == 2014]
malaria_2014.head()
Out[15]:
country province district ruralurban date report cases
0 Zambia Southern Chikankata Rural 2014-01-01 Health Facility 0
1 Zambia Southern Chikankata Rural 2014-01-01 Community Health Worker 288
2 Zambia Southern Chikankata Rural 2014-02-01 Health Facility 0
3 Zambia Southern Chikankata Rural 2014-02-01 Community Health Worker 251
4 Zambia Southern Chikankata Rural 2014-03-01 Health Facility 0
In [16]:
malaria_2015p = malaria_raw[malaria_raw.date.dt.year >= 2015]
malaria_2015p.head()
Out[16]:
country province district ruralurban date report cases
24 Zambia Southern Chikankata Rural 2015-01-01 Health Facility 0
25 Zambia Southern Chikankata Rural 2015-01-01 Community Health Worker 87
26 Zambia Southern Chikankata Rural 2015-02-01 Health Facility 0
27 Zambia Southern Chikankata Rural 2015-02-01 Community Health Worker 77
28 Zambia Southern Chikankata Rural 2015-03-01 Health Facility 0
In [17]:
ax = malaria_2014.groupby(['date', 'district']).sum().unstack().plot()
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5));
In [18]:
ax = malaria_2014.groupby(['date', 'district', 'report']).sum().unstack([-2, -1]).plot()
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5));
In [19]:
ax = malaria_2015p.groupby(['date', 'district']).sum().unstack().plot()
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5));
In [20]:
ax = malaria_2015p.groupby(['date', 'district', 'report']).sum().unstack([-2, -1]).plot()
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5));

Sinazongwe

Sinazongwe is an outlier. Let's look at just that, and everything except Sinazongwe.

In [21]:
malaria_sina = malaria_raw[malaria_raw.district == 'Sinazongwe']
malaria_sina.head()
Out[21]:
country province district ruralurban date report cases
3186 Zambia Southern Sinazongwe Rural 2014-01-01 Health Facility 0
3187 Zambia Southern Sinazongwe Rural 2014-01-01 Community Health Worker 3
3188 Zambia Southern Sinazongwe Rural 2014-02-01 Health Facility 0
3189 Zambia Southern Sinazongwe Rural 2014-02-01 Community Health Worker 0
3190 Zambia Southern Sinazongwe Rural 2014-03-01 Health Facility 0
In [22]:
malaria_not_sina = malaria_raw[malaria_raw.district != 'Sinazongwe']
malaria_not_sina.head()
Out[22]:
country province district ruralurban date report cases
0 Zambia Southern Chikankata Rural 2014-01-01 Health Facility 0
1 Zambia Southern Chikankata Rural 2014-01-01 Community Health Worker 288
2 Zambia Southern Chikankata Rural 2014-02-01 Health Facility 0
3 Zambia Southern Chikankata Rural 2014-02-01 Community Health Worker 251
4 Zambia Southern Chikankata Rural 2014-03-01 Health Facility 0
In [23]:
malaria_not_sina.groupby(['date', 'report']).sum().unstack().plot()
Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff981149978>