K-Means Clustering - Data Exploration


UN Data on Countries of the World

We are going to explore or dataset which we get in a csv format but may have missing values. We need to be able to drill down on useful dimensions to explore after cleaning up the data. Since we only have one observation per country, we may not have the option to use columns where there are many missing values as we are effectively going to drop many countries when we drop rows with missing values. But then how did we drop such rows before? Because in those cases there were many observations per individual entity and dropping some did not eliminate an entity altogether.

So first we import the data and explore the columns and types - this time rather than doing it manually we are going to use the facilities in our software to do that.

In [5]:
%pylab inline
import pandas as pd
df = pd.read_csv('../datasets/UN.csv')
print('----')
# print the raw column information plus summary header
print(df)
print('----')
# look at the types of each column explicitly
print('Individual columns - Python data types')
[(x, type(df[x][0])) for x in df.columns] 
Populating the interactive namespace from numpy and matplotlib
----
                  country   region   tfr  contraception  educationMale  \
0             Afghanistan     Asia  6.90            NaN            NaN   
1                 Albania   Europe  2.60            NaN            NaN   
2                 Algeria   Africa  3.81             52           11.1   
3          American.Samoa     Asia   NaN            NaN            NaN   
4                 Andorra   Europe   NaN            NaN            NaN   
5                  Angola   Africa  6.69            NaN            NaN   
6                 Antigua  America   NaN             53            NaN   
7               Argentina  America  2.62            NaN            NaN   
8                 Armenia   Europe  1.70             22            NaN   
9               Australia  Oceania  1.89             76           16.3   
10                Austria   Europe  1.42             71           14.4   
11             Azerbaijan     Asia  2.30             17            NaN   
12                Bahamas  America  1.95             62           12.1   
13                Bahrain     Asia  2.97             53           12.6   
14             Bangladesh     Asia  3.14             49            NaN   
15               Barbados  America  1.73             55            NaN   
16                Belarus   Europe  1.40             50            NaN   
17                Belgium   Europe  1.62             79           15.6   
18                 Belize  America  3.66             47           10.6   
19                  Benin   Africa  5.83             16            NaN   
20                 Bhutan     Asia  5.89             19            NaN   
21                Bolivia  America  4.36             45            NaN   
22                 Bosnia   Europe  1.40            NaN            NaN   
23               Botswana   Africa  4.45             33           10.5   
24                 Brazil  America  2.17             74            NaN   
25                 Brunei     Asia  2.70            NaN           11.8   
26               Bulgaria   Europe  1.45            NaN           11.8   
27           Burkina.Faso   Africa  6.57              8            3.3   
28                Burundi   Africa  6.28              9            5.1   
29               Cambodia     Asia  4.50            NaN            NaN   
..                    ...      ...   ...            ...            ...   
177             Swaziland   Africa  4.46             20           11.5   
178                Sweden   Europe  1.80             78           13.9   
179           Switzerland   Europe  1.46             71           14.5   
180                 Syria     Asia  4.00             36            9.8   
181            Tajikistan     Asia  3.93             21            NaN   
182              Tanzania   Africa  5.48             18            NaN   
183              Thailand     Asia  1.74             74            NaN   
184                  Togo   Africa  6.08             12            NaN   
185                 Tonga  Oceania  4.02             74            NaN   
186   Trinidad.and.Tobago  America  2.10             53           10.1   
187               Tunisia   Africa  2.92             60            NaN   
188                Turkey     Asia  2.50             63           10.6   
189          Turkmenistan     Asia  3.58             20            NaN   
190                Tuvalu  Oceania   NaN            NaN            NaN   
191                Uganda   Africa  7.10             15            NaN   
192               Ukraine   Europe  1.38             23            NaN   
193  United.Arab.Emirates     Asia  3.46            NaN            9.8   
194        United.Kingdom   Europe  1.72             82           16.1   
195         United.States  America  1.96             71           15.4   
196               Uruguay  America  2.25            NaN            NaN   
197            Uzbekistan     Asia  3.48             56            NaN   
198               Vanuatu  Oceania  4.36             15            NaN   
199             Venezuela  America  2.98             52           10.2   
200              Viet.Nam     Asia  2.97             65            NaN   
201        Virgin.Islands  America  3.03            NaN            NaN   
202        Western.Sahara   Africa  3.98            NaN            NaN   
203                 Yemen     Asia  7.60              7            NaN   
204            Yugoslavia   Europe  1.80            NaN            NaN   
205                Zambia   Africa  5.49             25            7.9   
206              Zimbabwe   Africa  4.68             48            NaN   

     educationFemale  lifeMale  lifeFemale  infantMortality  GDPperCapita  \
0                NaN      45.0        46.0              154          2848   
1                NaN      68.0        74.0               32           863   
2                9.9      67.5        70.3               44          1531   
3                NaN      68.0        73.0               11           NaN   
4                NaN       NaN         NaN              NaN           NaN   
5                NaN      44.9        48.1              124           355   
6                NaN       NaN         NaN               24          6966   
7                NaN      69.6        76.8               22          8055   
8                NaN      67.2        74.0               25           354   
9               16.1      75.4        81.2                6         20046   
10              14.2      73.7        80.1                6         29006   
11               NaN      66.5        74.5               33           321   
12              13.2      70.5        77.1               14         12545   
13              13.3      71.1        75.3               18          9073   
14               NaN      58.1        58.2               78           280   
15               NaN      73.6        78.7                9          7173   
16               NaN      64.4        74.8               15           994   
17              15.4      73.9        80.6                7         26582   
18              10.4      73.4        76.1               30          2569   
19               NaN      52.4        57.2               84           391   
20               NaN      51.6        54.9              104           166   
21               NaN      59.8        63.2               66           909   
22               NaN      70.5        75.9               13           271   
23              10.7      48.9        51.7               56          3640   
24               NaN      63.4        71.2               42          4510   
25              12.1      73.4        78.1                9         16683   
26              12.5      67.8        74.9               16          1518   
27               2.0      45.1        47.0               97           165   
28               4.0      45.5        48.8              114           205   
29               NaN      52.6        55.4              102           130   
..               ...       ...         ...              ...           ...   
177             10.8      57.7        62.3               65          1389   
178             14.5      76.2        80.8                5         26253   
179             13.5      75.3        81.8                5         42416   
180              8.5      66.7        71.2               33          3573   
181              NaN      64.2        70.2               56           122   
182              NaN      50.0        52.8               80           139   
183              NaN      66.3        72.3               30          2896   
184              NaN      48.8        51.5               86           322   
185              NaN      67.0        71.0                3          1787   
186             11.3      71.5        76.2               14          4083   
187              NaN      68.4        70.7               37          2030   
188              8.7      66.5        71.7               44          2814   
189              NaN      61.2        68.0               57           321   
190              NaN       NaN         NaN              NaN           NaN   
191              NaN      40.4        42.3              113           305   
192              NaN      63.6        74.0               18           694   
193             10.3      73.9        76.5               15         17690   
194             16.6      74.5        79.8                6         18913   
195             16.2      73.4        80.1                7         26037   
196              NaN      69.6        76.1               17          5602   
197              NaN      64.3        70.7               43           435   
198              NaN      65.5        69.5               38          1289   
199             10.7      70.0        75.7               21          3496   
200              NaN      64.9        69.6               37           270   
201              NaN       NaN         NaN               12           NaN   
202              NaN      59.8        63.1               64           NaN   
203              NaN      57.4        58.4               80           732   
204              NaN      69.8        75.3               19          1487   
205              6.8      42.2        43.7              103           382   
206              NaN      47.6        49.4               68           786   

     economicActivityMale  economicActivityFemale  illiteracyMale  \
0                    87.5                     7.2          52.800   
1                     NaN                     NaN             NaN   
2                    76.4                     7.8          26.100   
3                    58.8                    42.4           0.264   
4                     NaN                     NaN             NaN   
5                     NaN                     NaN             NaN   
6                    74.4                    56.2             NaN   
7                    76.2                    41.3           3.800   
8                    65.0                    52.0           0.300   
9                    74.0                    53.8             NaN   
10                   69.5                    47.7             NaN   
11                    NaN                     NaN           0.300   
12                   81.2                    67.0           1.500   
13                   88.2                    29.2          10.900   
14                   88.8                    55.9          50.600   
15                   73.4                    61.4           2.000   
16                   76.4                    61.3           0.300   
17                    NaN                     NaN             NaN   
18                   79.0                    34.0          21.252   
19                   90.0                    57.8          51.300   
20                    NaN                     NaN          43.800   
21                   74.1                    56.3           9.500   
22                    NaN                     NaN             NaN   
23                   75.4                    41.7          19.500   
24                   84.0                    53.6          16.700   
25                   82.2                    46.4           7.400   
26                   60.7                    53.8           0.924   
27                   88.9                    79.4          70.500   
28                   90.1                    90.6          50.700   
29                   77.3                    84.7             NaN   
..                    ...                     ...             ...   
177                  64.3                    27.7          22.000   
178                  80.0                    75.6             NaN   
179                  78.5                    56.8             NaN   
180                   NaN                     NaN          14.300   
181                  75.0                    60.0           0.300   
182                   NaN                     NaN          20.600   
183                  83.8                    65.2           4.000   
184                  77.5                    50.8          33.000   
185                  74.2                    45.4           0.264   
186                  75.5                    44.9           1.200   
187                  75.4                    20.3          21.400   
188                  75.9                    30.6           8.300   
189                  78.0                    62.0           0.200   
190                   NaN                     NaN             NaN   
191                   NaN                     NaN          26.300   
192                  69.1                    57.1           0.330   
193                  92.5                    24.2          21.100   
194                  71.9                    53.5             NaN   
195                  74.9                    59.3           2.244   
196                  74.0                    46.7           3.100   
197                  75.0                    61.0           0.200   
198                  88.6                    79.3          34.914   
199                  82.1                    41.2           8.200   
200                  81.6                    74.1           3.500   
201                  72.3                    59.5             NaN   
202                   NaN                     NaN             NaN   
203                  80.6                     1.9          32.406   
204                   NaN                     NaN           1.782   
205                   NaN                     NaN          14.400   
206                  77.7                    46.7           9.600   

     illiteracyFemale  
0              85.000  
1                 NaN  
2              51.000  
3               0.360  
4                 NaN  
5                 NaN  
6                 NaN  
7               3.800  
8               0.500  
9                 NaN  
10                NaN  
11              0.500  
12              2.000  
13             20.600  
14             73.900  
15              3.200  
16              0.600  
17                NaN  
18             23.472  
19             74.200  
20             71.900  
21             24.000  
22                NaN  
23             40.100  
24             16.800  
25             16.600  
26              2.376  
27             90.800  
28             77.500  
29                NaN  
..                ...  
177            24.400  
178               NaN  
179               NaN  
180            44.200  
181             0.400  
182            43.200  
183             8.400  
184            63.000  
185             0.504  
186             3.000  
187            45.400  
188            27.600  
189             0.400  
190               NaN  
191            49.800  
192             2.160  
193            20.200  
194               NaN  
195             2.232  
196             2.300  
197             0.400  
198            46.368  
199             9.700  
200             8.800  
201               NaN  
202               NaN  
203            69.552  
204             9.072  
205            28.700  
206            20.100  

[207 rows x 14 columns]
----
Individual columns - Python data types
Out[5]:
[('country', str),
 ('region', str),
 ('tfr', numpy.float64),
 ('contraception', numpy.float64),
 ('educationMale', numpy.float64),
 ('educationFemale', numpy.float64),
 ('lifeMale', numpy.float64),
 ('lifeFemale', numpy.float64),
 ('infantMortality', numpy.float64),
 ('GDPperCapita', numpy.float64),
 ('economicActivityMale', numpy.float64),
 ('economicActivityFemale', numpy.float64),
 ('illiteracyMale', numpy.float64),
 ('illiteracyFemale', numpy.float64)]

Here we see that we have 14 columns with country and region being string types and the rest being floats. We also see that the country column has 207 values, ie this is data on 207 countries. The region columns also has 207 entries, but the rest of the columns have many missing entries, indicated by number of non-null values less than 207.

We see that tfr, lifeMale, lifeFemale and GDP, and infantMortality are the columns closest to 207. That is, if we use these columns we will only drop a few countries and not whole clusters as we might if we used educationMale and educationFemale. On the other hand were we to use educationMale and educatonFemale we would have to drop almost 2/3 of the data. So we focus on the columns with non-null values close to 207.

So our short list is now, country, region, tfr, lifeMale, lifeFemale and GDP, and infantMortality.

We suspect that there is clustering of lifeMale, lifeFemale and infantMortality according to GDP and we are going to pull out the heavy machinery of K-Means sofwtare to analyse this in detail and look at the clusters.

We don't know in advance how many clusters there will be which is different from the iris example where we had a 'species' label and there were three unique species.

So while using our KMeans software we will also look at some analytical measures to decide what the right number of clusters might be after looking at multiple such possibilities from 1 through 10 candidate clusters.

Finally, to be able to apply the KMeans modeling software we convert each field in our file to a scientific float format that the numerical algorithms expect.

Onward!

In [2]:
from IPython.core.display import HTML
def css_styling():
    styles = open("../styles/custom.css", "r").read()
    return HTML(styles)
css_styling()
Out[2]: