Elements of Data Science
by Allen Downey
Loading and cleaning data from Pew Research Center, Religion and Public Life.
"In U.S., Decline of Christianity Continues at Rapid Pace: An update on America's changing religious landscape", October 17, 2019.
We can download the tables directly from Pew Research:
import os
if not os.path.exists('Detailed-Tables-v1-FOR-WEB.pdf'):
!wget https://www.pewforum.org/wp-content/uploads/sites/7/2019/10/Detailed-Tables-v1-FOR-WEB.pdf
--2019-10-24 15:21:35-- https://www.pewforum.org/wp-content/uploads/sites/7/2019/10/Detailed-Tables-v1-FOR-WEB.pdf Resolving www.pewforum.org (www.pewforum.org)... 192.0.66.2 Connecting to www.pewforum.org (www.pewforum.org)|192.0.66.2|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 1069037 (1.0M) [application/pdf] Saving to: ‘Detailed-Tables-v1-FOR-WEB.pdf’ Detailed-Tables-v1- 100%[===================>] 1.02M --.-KB/s in 0.07s 2019-10-24 15:21:36 (13.9 MB/s) - ‘Detailed-Tables-v1-FOR-WEB.pdf’ saved [1069037/1069037]
The data are in a table in a PDF document, so I'll use tabula-py to read them.
import sys
IN_COLAB = 'google.colab' in sys.modules
if IN_COLAB:
!pip install tabula-py
from tabula import read_pdf
import pandas as pd
I'll read the table from page 1, specifying which column to use as the index and which columns to read. I'm using data from the Aggregated Political Surveys (columns 3-12) and skipping the Religious Landscape Studies (columns 1-2).
pages = 1
pandas_options = dict(index_col=0, usecols=[0,3,4,5,6,7,8,9,10,11,12])
df = read_pdf('Detailed-Tables-v1-FOR-WEB.pdf', pages=pages, pandas_options=pandas_options)
df
2009 | 2010 | 2011 | 2012 | 2013 | 2014.1 | 2015 | 2016 | 2017 | 2019 | |
---|---|---|---|---|---|---|---|---|---|---|
Christian | 77 | 76 | 75 | 73 | 73 | 71.0 | 69.0 | 68.0 | 67.0 | 65 |
Protestant | 51 | 51 | 50 | 49 | 49 | 48.0 | 46.0 | 44.0 | 45.0 | 43 |
Catholic | 23 | 23 | 23 | 22 | 22 | 21.0 | 21.0 | 21.0 | 20.0 | 20 |
Mormon | 2 | 2 | 2 | 2 | 2 | 2.0 | 2.0 | 2.0 | 2.0 | 2 |
Orthodox Christian | 1 | 1 | 1 | 1 | 1 | 1.0 | 1.0 | 1.0 | 1.0 | <1 |
Other Christian | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Non-Christian faiths | 5 | 5 | 5 | 6 | 5 | 6.0 | 6.0 | 7.0 | 6.0 | 7 |
Jewish | 2 | 2 | 2 | 2 | 2 | 2.0 | 1.0 | 2.0 | 2.0 | 2 |
Muslim | 1 | 1 | 1 | 1 | 1 | 1.0 | 1.0 | 1.0 | 1.0 | 1 |
Buddhist | 1 | <1 | 1 | 1 | 1 | 1.0 | 1.0 | 1.0 | 1.0 | 1 |
Hindu | <1 | <1 | <1 | <1 | <1 | 1.0 | 1.0 | 1.0 | 1.0 | 1 |
Other | 2 | 2 | 2 | 2 | 2 | 2.0 | 2.0 | 3.0 | 2.0 | 3 |
Unaffiliated | 17 | 17 | 19 | 19 | 20 | 21.0 | 24.0 | 23.0 | 25.0 | 26 |
Atheist | 2 | 2 | 2 | 2 | 3 | 3.0 | 4.0 | 4.0 | 4.0 | 4 |
Agnostic | 3 | 2 | 3 | 3 | 3 | 3.0 | 4.0 | 4.0 | 5.0 | 5 |
Nothing in particular | 12 | 12 | 14 | 14 | 14 | 14.0 | 16.0 | 16.0 | 16.0 | 17 |
Don’t know/refused | 2 | 2 | 1 | 2 | 2 | 1.0 | 1.0 | 2.0 | 2.0 | 2 |
NaN | 100 | 100 | 100 | 100 | 100 | 100.0 | 100.0 | 100.0 | 100.0 | 100 |
Because there are two columns with the heading 2014
, tabula-py
makes one of them 2014.1
. So I'll rename it.
df = df.rename(columns={'2014.1': '2014'})
df
2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2019 | |
---|---|---|---|---|---|---|---|---|---|---|
Christian | 77 | 76 | 75 | 73 | 73 | 71.0 | 69.0 | 68.0 | 67.0 | 65 |
Protestant | 51 | 51 | 50 | 49 | 49 | 48.0 | 46.0 | 44.0 | 45.0 | 43 |
Catholic | 23 | 23 | 23 | 22 | 22 | 21.0 | 21.0 | 21.0 | 20.0 | 20 |
Mormon | 2 | 2 | 2 | 2 | 2 | 2.0 | 2.0 | 2.0 | 2.0 | 2 |
Orthodox Christian | 1 | 1 | 1 | 1 | 1 | 1.0 | 1.0 | 1.0 | 1.0 | <1 |
Other Christian | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Non-Christian faiths | 5 | 5 | 5 | 6 | 5 | 6.0 | 6.0 | 7.0 | 6.0 | 7 |
Jewish | 2 | 2 | 2 | 2 | 2 | 2.0 | 1.0 | 2.0 | 2.0 | 2 |
Muslim | 1 | 1 | 1 | 1 | 1 | 1.0 | 1.0 | 1.0 | 1.0 | 1 |
Buddhist | 1 | <1 | 1 | 1 | 1 | 1.0 | 1.0 | 1.0 | 1.0 | 1 |
Hindu | <1 | <1 | <1 | <1 | <1 | 1.0 | 1.0 | 1.0 | 1.0 | 1 |
Other | 2 | 2 | 2 | 2 | 2 | 2.0 | 2.0 | 3.0 | 2.0 | 3 |
Unaffiliated | 17 | 17 | 19 | 19 | 20 | 21.0 | 24.0 | 23.0 | 25.0 | 26 |
Atheist | 2 | 2 | 2 | 2 | 3 | 3.0 | 4.0 | 4.0 | 4.0 | 4 |
Agnostic | 3 | 2 | 3 | 3 | 3 | 3.0 | 4.0 | 4.0 | 5.0 | 5 |
Nothing in particular | 12 | 12 | 14 | 14 | 14 | 14.0 | 16.0 | 16.0 | 16.0 | 17 |
Don’t know/refused | 2 | 2 | 1 | 2 | 2 | 1.0 | 1.0 | 2.0 | 2.0 | 2 |
NaN | 100 | 100 | 100 | 100 | 100 | 100.0 | 100.0 | 100.0 | 100.0 | 100 |
I transpose the DataFrame so the dates are the index.
df = df.transpose()
df
Christian | Protestant | Catholic | Mormon | Orthodox Christian | Other Christian | Non-Christian faiths | Jewish | Muslim | Buddhist | Hindu | Other | Unaffiliated | Atheist | Agnostic | Nothing in particular | Don’t know/refused | NaN | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2009 | 77 | 51 | 23 | 2 | 1 | NaN | 5 | 2 | 1 | 1 | <1 | 2 | 17 | 2 | 3 | 12 | 2 | 100 |
2010 | 76 | 51 | 23 | 2 | 1 | NaN | 5 | 2 | 1 | <1 | <1 | 2 | 17 | 2 | 2 | 12 | 2 | 100 |
2011 | 75 | 50 | 23 | 2 | 1 | NaN | 5 | 2 | 1 | 1 | <1 | 2 | 19 | 2 | 3 | 14 | 1 | 100 |
2012 | 73 | 49 | 22 | 2 | 1 | NaN | 6 | 2 | 1 | 1 | <1 | 2 | 19 | 2 | 3 | 14 | 2 | 100 |
2013 | 73 | 49 | 22 | 2 | 1 | NaN | 5 | 2 | 1 | 1 | <1 | 2 | 20 | 3 | 3 | 14 | 2 | 100 |
2014 | 71 | 48 | 21 | 2 | 1 | NaN | 6 | 2 | 1 | 1 | 1 | 2 | 21 | 3 | 3 | 14 | 1 | 100 |
2015 | 69 | 46 | 21 | 2 | 1 | NaN | 6 | 1 | 1 | 1 | 1 | 2 | 24 | 4 | 4 | 16 | 1 | 100 |
2016 | 68 | 44 | 21 | 2 | 1 | NaN | 7 | 2 | 1 | 1 | 1 | 3 | 23 | 4 | 4 | 16 | 2 | 100 |
2017 | 67 | 45 | 20 | 2 | 1 | NaN | 6 | 2 | 1 | 1 | 1 | 2 | 25 | 4 | 5 | 16 | 2 | 100 |
2019 | 65 | 43 | 20 | 2 | <1 | NaN | 7 | 2 | 1 | 1 | 1 | 3 | 26 | 4 | 5 | 17 | 2 | 100 |
Now I can convert the index from string to int.
df.index = df.index.astype(int)
df
Christian | Protestant | Catholic | Mormon | Orthodox Christian | Other Christian | Non-Christian faiths | Jewish | Muslim | Buddhist | Hindu | Other | Unaffiliated | Atheist | Agnostic | Nothing in particular | Don’t know/refused | NaN | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2009 | 77 | 51 | 23 | 2 | 1 | NaN | 5 | 2 | 1 | 1 | <1 | 2 | 17 | 2 | 3 | 12 | 2 | 100 |
2010 | 76 | 51 | 23 | 2 | 1 | NaN | 5 | 2 | 1 | <1 | <1 | 2 | 17 | 2 | 2 | 12 | 2 | 100 |
2011 | 75 | 50 | 23 | 2 | 1 | NaN | 5 | 2 | 1 | 1 | <1 | 2 | 19 | 2 | 3 | 14 | 1 | 100 |
2012 | 73 | 49 | 22 | 2 | 1 | NaN | 6 | 2 | 1 | 1 | <1 | 2 | 19 | 2 | 3 | 14 | 2 | 100 |
2013 | 73 | 49 | 22 | 2 | 1 | NaN | 5 | 2 | 1 | 1 | <1 | 2 | 20 | 3 | 3 | 14 | 2 | 100 |
2014 | 71 | 48 | 21 | 2 | 1 | NaN | 6 | 2 | 1 | 1 | 1 | 2 | 21 | 3 | 3 | 14 | 1 | 100 |
2015 | 69 | 46 | 21 | 2 | 1 | NaN | 6 | 1 | 1 | 1 | 1 | 2 | 24 | 4 | 4 | 16 | 1 | 100 |
2016 | 68 | 44 | 21 | 2 | 1 | NaN | 7 | 2 | 1 | 1 | 1 | 3 | 23 | 4 | 4 | 16 | 2 | 100 |
2017 | 67 | 45 | 20 | 2 | 1 | NaN | 6 | 2 | 1 | 1 | 1 | 2 | 25 | 4 | 5 | 16 | 2 | 100 |
2019 | 65 | 43 | 20 | 2 | <1 | NaN | 7 | 2 | 1 | 1 | 1 | 3 | 26 | 4 | 5 | 17 | 2 | 100 |
And write the result to a file.
df.to_csv('pew_religion_table1.csv')