# HIDDEN
import numpy as np
from datascience import *
np.set_printoptions(threshold=50)
url = 'http://www.census.gov/popest/data/national/asrh/2014/files/NC-EST2014-AGESEX-RES.csv'
full = Table.read_table(url)
partial = full.select(['SEX', 'AGE', 4, 8])
simple = partial.relabeled(2, '2010').relabeled(3, '2014')
census = simple.with_columns([
'Change', simple.column('2014')-simple.column('2010'),
'Growth', (simple.column('2014')/simple.column('2010')) ** (1/4) - 1
])
census.set_format('Growth', PercentFormatter)
census.set_format([2, 3, 4], NumberFormatter)
pass
In this section, we will continue to use the US Census dataset. We will focus only on the 2014 population estimate.
census_2014 = census.select(['SEX', 'AGE', '2014']).where(census.column('AGE') != 999)
census_2014
SEX | AGE | 2014 |
---|---|---|
0 | 0 | 3,948,350 |
0 | 1 | 3,962,123 |
0 | 2 | 3,957,772 |
0 | 3 | 4,005,190 |
0 | 4 | 4,003,448 |
0 | 5 | 4,004,858 |
0 | 6 | 4,134,352 |
0 | 7 | 4,154,000 |
0 | 8 | 4,119,524 |
0 | 9 | 4,106,832 |
... (293 rows omitted)
Functions can be used to compute new columns in a table based on existing column values. For example, we can transform the codes in the SEX
column to strings that are easier to interpret.
def male_female(code):
if code == 0:
return 'Total'
elif code == 1:
return 'Male'
elif code == 2:
return 'Female'
This function takes an individual code — 0, 1, or 2 — and returns a string that describes its meaning.
male_female(0)
'Total'
male_female(1)
'Male'
male_female(2)
'Female'
We could also transform ages into age categories.
def age_group(age):
if age < 2:
return 'Baby'
elif age < 13:
return 'Child'
elif age < 20:
return 'Teen'
else:
return 'Adult'
age_group(15)
'Teen'
Apply. The apply
method of a table calls a function on each element of a column, forming a new array of return values. To indicate which function to call, just name it (without quotation marks). The name of the column of input values must still appear within quotation marks.
census_2014.apply(male_female, 'SEX')
array(['Total', 'Total', 'Total', ..., 'Female', 'Female', 'Female'], dtype='<U6')
This array, which has the same length as the original SEX
column of the population
table, can be used as the values in a new column called Male/Female
alongside the existing AGE
and Population
columns.
population = Table().with_columns([
'Male/Female', census_2014.apply(male_female, 'SEX'),
'Age Group', census_2014.apply(age_group, 'AGE'),
'Population', census_2014.column('2014')
])
population
Male/Female | Age Group | Population |
---|---|---|
Total | Baby | 3948350 |
Total | Baby | 3962123 |
Total | Child | 3957772 |
Total | Child | 4005190 |
Total | Child | 4003448 |
Total | Child | 4004858 |
Total | Child | 4134352 |
Total | Child | 4154000 |
Total | Child | 4119524 |
Total | Child | 4106832 |
... (293 rows omitted)
Groups. The group
method with a single argument counts the number of rows for each category in a column. The result contains one row per unique value in the grouped column.
population.group('Age Group')
Age Group | count |
---|---|
Adult | 243 |
Baby | 6 |
Child | 33 |
Teen | 21 |
The optional second argument names the function that will be used to aggregate values in other columns for all of those rows. For instance, sum
will sum up the populations in all rows that match each category. This result also contains one row per unique value in the grouped column, but it has the same number of columns as the original table.
totals = population.where(0, 'Total').select(['Age Group', 'Population'])
totals
Age Group | Population |
---|---|
Baby | 3948350 |
Baby | 3962123 |
Child | 3957772 |
Child | 4005190 |
Child | 4003448 |
Child | 4004858 |
Child | 4134352 |
Child | 4154000 |
Child | 4119524 |
Child | 4106832 |
... (91 rows omitted)
totals.group('Age Group', sum)
Age Group | Population sum |
---|---|
Adult | 236721454 |
Baby | 7910473 |
Child | 44755656 |
Teen | 29469473 |
The groups
method behaves in the same way, but accepts a list of columns as its first argument. The resulting table has one row for every unique combination of values that appear together in the grouped columns. Again, a single argument (a list, in this case) gives row counts.
population.groups(['Male/Female', 'Age Group'])
Male/Female | Age Group | count |
---|---|---|
Female | Adult | 81 |
Female | Baby | 2 |
Female | Child | 11 |
Female | Teen | 7 |
Male | Adult | 81 |
Male | Baby | 2 |
Male | Child | 11 |
Male | Teen | 7 |
Total | Adult | 81 |
Total | Baby | 2 |
... (2 rows omitted)
A second argument to groups
aggregates all other columns that do not appear in the list of grouped columns.
population.groups(['Male/Female', 'Age Group'], sum)
Male/Female | Age Group | Population sum |
---|---|---|
Female | Adult | 121754366 |
Female | Baby | 3869363 |
Female | Child | 21903805 |
Female | Teen | 14393035 |
Male | Adult | 114967088 |
Male | Baby | 4041110 |
Male | Child | 22851851 |
Male | Teen | 15076438 |
Total | Adult | 236721454 |
Total | Baby | 7910473 |
... (2 rows omitted)
Pivot. The pivot
method is closely related to the groups
method: it groups together rows that share a combination of values. It differs from groups
because it organizes the resulting values in a grid. The first argument to pivot
is a column that contains the values that will be used to form new columns in the result. The second argument is a column used for grouping rows. The result gives the count of all rows that share the combination of column and row values.
population.pivot('Male/Female', 'Age Group')
Age Group | Female | Male | Total |
---|---|---|---|
Adult | 81 | 81 | 81 |
Baby | 2 | 2 | 2 |
Child | 11 | 11 | 11 |
Teen | 7 | 7 | 7 |
An optional third argument indicates a column of values that will replace the counts in each cell of the grid. The fourth argument indicates how to aggregate all of the values that match the combination of column and row values.
pivoted = population.pivot('Male/Female', 'Age Group', 'Population', sum)
pivoted
Age Group | Female | Male | Total |
---|---|---|---|
Adult | 121754366 | 114967088 | 236721454 |
Baby | 3869363 | 4041110 | 7910473 |
Child | 21903805 | 22851851 | 44755656 |
Teen | 14393035 | 15076438 | 29469473 |
The advantage of pivot is that it places grouped values into adjacent columns, so that they can be combined. For instance, this pivoted table allows us to compute the proportion of each age group that is male. We find the surprising result that younger age groups are predominantly male, but among adults there are substantially more females.
pivoted.with_column('Male Percentage', pivoted.column('Male')/pivoted.column('Total')).set_format(4, PercentFormatter)
Age Group | Female | Male | Total | Male Percentage |
---|---|---|---|---|
Adult | 121754366 | 114967088 | 236721454 | 48.57% |
Baby | 3869363 | 4041110 | 7910473 | 51.09% |
Child | 21903805 | 22851851 | 44755656 | 51.06% |
Teen | 14393035 | 15076438 | 29469473 | 51.16% |