Pew Research data on religious affiliation¶

Elements of Data Science

Loading and cleaning data from Pew Research Center, Religion and Public Life.

"In U.S., Decline of Christianity Continues at Rapid Pace: An update on America's changing religious landscape", October 17, 2019.

We can download the tables directly from Pew Research:

In [1]:

import os

if not os.path.exists('Detailed-Tables-v1-FOR-WEB.pdf'):
    !wget https://www.pewforum.org/wp-content/uploads/sites/7/2019/10/Detailed-Tables-v1-FOR-WEB.pdf

--2019-10-24 15:21:35--  https://www.pewforum.org/wp-content/uploads/sites/7/2019/10/Detailed-Tables-v1-FOR-WEB.pdf
Resolving www.pewforum.org (www.pewforum.org)... 192.0.66.2
Connecting to www.pewforum.org (www.pewforum.org)|192.0.66.2|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1069037 (1.0M) [application/pdf]
Saving to: ‘Detailed-Tables-v1-FOR-WEB.pdf’

Detailed-Tables-v1- 100%[===================>]   1.02M  --.-KB/s    in 0.07s   

2019-10-24 15:21:36 (13.9 MB/s) - ‘Detailed-Tables-v1-FOR-WEB.pdf’ saved [1069037/1069037]

The data are in a table in a PDF document, so I'll use tabula-py to read them.

In [8]:

import sys
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    !pip install tabula-py

In [9]:

from tabula import read_pdf
import pandas as pd

I'll read the table from page 1, specifying which column to use as the index and which columns to read. I'm using data from the Aggregated Political Surveys (columns 3-12) and skipping the Religious Landscape Studies (columns 1-2).

In [3]:

pages = 1
pandas_options = dict(index_col=0, usecols=[0,3,4,5,6,7,8,9,10,11,12])
df = read_pdf('Detailed-Tables-v1-FOR-WEB.pdf', pages=pages, pandas_options=pandas_options)
df

Out[3]:

	2009	2010	2011	2012	2013	2014.1	2015	2016	2017	2019
Christian	77	76	75	73	73	71.0	69.0	68.0	67.0	65
Protestant	51	51	50	49	49	48.0	46.0	44.0	45.0	43
Catholic	23	23	23	22	22	21.0	21.0	21.0	20.0	20
Mormon	2	2	2	2	2	2.0	2.0	2.0	2.0	2
Orthodox Christian	1	1	1	1	1	1.0	1.0	1.0	1.0	<1
Other Christian	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Non-Christian faiths	5	5	5	6	5	6.0	6.0	7.0	6.0	7
Jewish	2	2	2	2	2	2.0	1.0	2.0	2.0	2
Muslim	1	1	1	1	1	1.0	1.0	1.0	1.0	1
Buddhist	1	<1	1	1	1	1.0	1.0	1.0	1.0	1
Hindu	<1	<1	<1	<1	<1	1.0	1.0	1.0	1.0	1
Other	2	2	2	2	2	2.0	2.0	3.0	2.0	3
Unaffiliated	17	17	19	19	20	21.0	24.0	23.0	25.0	26
Atheist	2	2	2	2	3	3.0	4.0	4.0	4.0	4
Agnostic	3	2	3	3	3	3.0	4.0	4.0	5.0	5
Nothing in particular	12	12	14	14	14	14.0	16.0	16.0	16.0	17
Don’t know/refused	2	2	1	2	2	1.0	1.0	2.0	2.0	2
NaN	100	100	100	100	100	100.0	100.0	100.0	100.0	100

Because there are two columns with the heading 2014, tabula-py makes one of them 2014.1. So I'll rename it.

In [4]:

df = df.rename(columns={'2014.1': '2014'})
df

Out[4]:

	2009	2010	2011	2012	2013	2014	2015	2016	2017	2019
Christian	77	76	75	73	73	71.0	69.0	68.0	67.0	65
Protestant	51	51	50	49	49	48.0	46.0	44.0	45.0	43
Catholic	23	23	23	22	22	21.0	21.0	21.0	20.0	20
Mormon	2	2	2	2	2	2.0	2.0	2.0	2.0	2
Orthodox Christian	1	1	1	1	1	1.0	1.0	1.0	1.0	<1
Other Christian	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Non-Christian faiths	5	5	5	6	5	6.0	6.0	7.0	6.0	7
Jewish	2	2	2	2	2	2.0	1.0	2.0	2.0	2
Muslim	1	1	1	1	1	1.0	1.0	1.0	1.0	1
Buddhist	1	<1	1	1	1	1.0	1.0	1.0	1.0	1
Hindu	<1	<1	<1	<1	<1	1.0	1.0	1.0	1.0	1
Other	2	2	2	2	2	2.0	2.0	3.0	2.0	3
Unaffiliated	17	17	19	19	20	21.0	24.0	23.0	25.0	26
Atheist	2	2	2	2	3	3.0	4.0	4.0	4.0	4
Agnostic	3	2	3	3	3	3.0	4.0	4.0	5.0	5
Nothing in particular	12	12	14	14	14	14.0	16.0	16.0	16.0	17
Don’t know/refused	2	2	1	2	2	1.0	1.0	2.0	2.0	2
NaN	100	100	100	100	100	100.0	100.0	100.0	100.0	100

I transpose the DataFrame so the dates are the index.

In [5]:

df = df.transpose()
df

Out[5]:

	Christian	Protestant	Catholic	Mormon	Orthodox Christian	Other Christian	Non-Christian faiths	Jewish	Muslim	Buddhist	Hindu	Other	Unaffiliated	Atheist	Agnostic	Nothing in particular	Don’t know/refused	NaN
2009	77	51	23	2	1	NaN	5	2	1	1	<1	2	17	2	3	12	2	100
2010	76	51	23	2	1	NaN	5	2	1	<1	<1	2	17	2	2	12	2	100
2011	75	50	23	2	1	NaN	5	2	1	1	<1	2	19	2	3	14	1	100
2012	73	49	22	2	1	NaN	6	2	1	1	<1	2	19	2	3	14	2	100
2013	73	49	22	2	1	NaN	5	2	1	1	<1	2	20	3	3	14	2	100
2014	71	48	21	2	1	NaN	6	2	1	1	1	2	21	3	3	14	1	100
2015	69	46	21	2	1	NaN	6	1	1	1	1	2	24	4	4	16	1	100
2016	68	44	21	2	1	NaN	7	2	1	1	1	3	23	4	4	16	2	100
2017	67	45	20	2	1	NaN	6	2	1	1	1	2	25	4	5	16	2	100
2019	65	43	20	2	<1	NaN	7	2	1	1	1	3	26	4	5	17	2	100

Now I can convert the index from string to int.

In [6]:

df.index = df.index.astype(int)
df

Out[6]:

	Christian	Protestant	Catholic	Mormon	Orthodox Christian	Other Christian	Non-Christian faiths	Jewish	Muslim	Buddhist	Hindu	Other	Unaffiliated	Atheist	Agnostic	Nothing in particular	Don’t know/refused	NaN
2009	77	51	23	2	1	NaN	5	2	1	1	<1	2	17	2	3	12	2	100
2010	76	51	23	2	1	NaN	5	2	1	<1	<1	2	17	2	2	12	2	100
2011	75	50	23	2	1	NaN	5	2	1	1	<1	2	19	2	3	14	1	100
2012	73	49	22	2	1	NaN	6	2	1	1	<1	2	19	2	3	14	2	100
2013	73	49	22	2	1	NaN	5	2	1	1	<1	2	20	3	3	14	2	100
2014	71	48	21	2	1	NaN	6	2	1	1	1	2	21	3	3	14	1	100
2015	69	46	21	2	1	NaN	6	1	1	1	1	2	24	4	4	16	1	100
2016	68	44	21	2	1	NaN	7	2	1	1	1	3	23	4	4	16	2	100
2017	67	45	20	2	1	NaN	6	2	1	1	1	2	25	4	5	16	2	100
2019	65	43	20	2	<1	NaN	7	2	1	1	1	3	26	4	5	17	2	100

And write the result to a file.

In [7]:

df.to_csv('pew_religion_table1.csv')

In [ ]: