Notebook

Getting started with R¶

By Christine Zhang (Knight-Mozilla / Los Angeles Times) & Ryan Menezes (Los Angeles Times)

IRE Conference -- New Orleans, LA

June 18, 2016

This workshop is a basic introduction to R, a free, open-source software for data analysis and statistics.

R is a powerful tool that can help you quickly and effectively answer questions using data.

Take our host city, New Orleans, for example. Hurricane Katrina was a devastating natural disaster that substantially affected the population of New Orleans. The hurricane took place in August 2005, which coincidentally falls between the U.S. Census full population counts in 2000 and 2010.

In this session, we will use the "Demographic Profile” -- a large summary file with many different demographic variables downloaded from the U.S. Census Bureau website -- from 2000 and 2010, for all census tracts in the state of Louisiana.

In this session, we will:

Load the 2000 data in R
Select a few variables pertaining to population counts and housing occupancy
Clean the data
Focus on the Orleans Parish in particular, which represents the city of New Orleans, and tally the population
Perform the same steps on the 2010 data
Merge the two sets
Write the merged data out to a CSV file for our next class, More with R, where we'll do some more in depth analysis

Basic analysis techniques like the ones you will learn in this class can help you write data-driven stories, like this one written by The Times-Picayune shortly after the census released its 2010 tally.

The story begins:

Five years after Hurricane Katrina emptied New Orleans and prompted the largest mass migration in modern American history, the 2010 Census counted 343,829 people living in the still-recovering city, a 29 percent drop since the last head count a decade ago, according to data released today.

Using the data we have, we will attempt to replicate the calculations in that lede.

The following code and annotations were written in a Jupyter notebook. The code is best run in RStudio version 0.99.902 using R version 3.3.0

Loading data¶

We'll start by loading in the 2000 data, which is stored in a CSV (comma-separated values) file. CSVs are plain-text files of data where commas separate the columns within a line. It is sometimes preferable to work with CSVs as opposed to files of a proprietary format, such as Microsoft Excel files, but the Census Bureau readily makes data available in both formats.

Let's run R's read.csv command and save the data to an object called census2000. Here, we are using assignment with <-, which tells R to run the right side and assign the result to the object named on the left.

In [1]:

census2000 <- read.csv('2000_census_demographic_profile.csv')

Now that this ran without incident, let's inspect the first few rows using head, which by default prints out the first six rows of a data frame (R's internal term for a spreadsheet):

In [2]:

head(census2000)

Out[2]:

	GEO.id	GEO.id2	GEO.display.label	HC01_VC01	HC02_VC01	HC01_VC03	HC02_VC03	HC01_VC04	HC02_VC04	HC01_VC05	HC02_VC05	HC01_VC06	HC02_VC06	HC01_VC07	HC02_VC07	HC01_VC08	HC02_VC08	HC01_VC09	HC02_VC09	HC01_VC10	HC02_VC10	HC01_VC11	HC02_VC11	HC01_VC12	HC02_VC12	HC01_VC13	HC02_VC13	HC01_VC14	HC02_VC14	HC01_VC15	HC02_VC15	HC01_VC16	HC02_VC16	HC01_VC17	HC02_VC17	HC01_VC18	HC02_VC18	HC01_VC19	HC02_VC19	HC01_VC20	HC02_VC20	HC01_VC21	HC02_VC21	HC01_VC22	HC02_VC22	HC01_VC23	HC02_VC23	HC01_VC24	HC02_VC24	HC01_VC25	HC02_VC25	HC01_VC26	HC02_VC26	HC01_VC28	HC02_VC28	HC01_VC29	HC02_VC29	HC01_VC30	HC02_VC30	HC01_VC31	HC02_VC31	HC01_VC32	HC02_VC32	HC01_VC33	HC02_VC33	HC01_VC34	HC02_VC34	HC01_VC35	HC02_VC35	HC01_VC36	HC02_VC36	HC01_VC37	HC02_VC37	HC01_VC38	HC02_VC38	HC01_VC39	HC02_VC39	HC01_VC40	HC02_VC40	HC01_VC41	HC02_VC41	HC01_VC42	HC02_VC42	HC01_VC43	HC02_VC43	HC01_VC44	HC02_VC44	HC01_VC45	HC02_VC45	HC01_VC46	HC02_VC46	HC01_VC48	HC02_VC48	HC01_VC49	HC02_VC49	HC01_VC50	HC02_VC50	HC01_VC51	HC02_VC51	HC01_VC52	HC02_VC52	HC01_VC53	HC02_VC53	HC01_VC55	HC02_VC55	HC01_VC56	HC02_VC56	HC01_VC57	HC02_VC57	HC01_VC58	HC02_VC58	HC01_VC59	HC02_VC59	HC01_VC60	HC02_VC60	HC01_VC61	HC02_VC61	HC01_VC62	HC02_VC62	HC01_VC64	HC02_VC64	HC01_VC65	HC02_VC65	HC01_VC66	HC02_VC66	HC01_VC67	HC02_VC67	HC01_VC68	HC02_VC68	HC01_VC69	HC02_VC69	HC01_VC70	HC02_VC70	HC01_VC71	HC02_VC71	HC01_VC72	HC02_VC72	HC01_VC73	HC02_VC73	HC01_VC74	HC02_VC74	HC01_VC75	HC02_VC75	HC01_VC76	HC02_VC76	HC01_VC78	HC02_VC78	HC01_VC79	HC02_VC79	HC01_VC80	HC02_VC80	HC01_VC81	HC02_VC81	HC01_VC82	HC02_VC82	HC01_VC83	HC02_VC83	HC01_VC84	HC02_VC84	HC01_VC85	HC02_VC85	HC01_VC86	HC02_VC86	HC01_VC87	HC02_VC87	HC01_VC88	HC02_VC88	HC01_VC89	HC02_VC89	HC01_VC90	HC02_VC90	HC01_VC91	HC02_VC91	HC01_VC93	HC02_VC93	HC01_VC94	HC02_VC94	HC01_VC95	HC02_VC95	HC01_VC96	HC02_VC96	HC01_VC97	HC02_VC97	HC01_VC98	HC02_VC98	HC01_VC100	HC02_VC100	HC01_VC101	HC02_VC101	HC01_VC102	HC02_VC102	HC01_VC103	HC02_VC103	HC01_VC104	HC02_VC104
1	Id	Id2	Geography	Number; Total population	Percent; Total population	Number; Total population - SEX AND AGE - Male	Percent; Total population - SEX AND AGE - Male	Number; Total population - SEX AND AGE - Female	Percent; Total population - SEX AND AGE - Female	Number; Total population - SEX AND AGE - Under 5 years	Percent; Total population - SEX AND AGE - Under 5 years	Number; Total population - SEX AND AGE - 5 to 9 years	Percent; Total population - SEX AND AGE - 5 to 9 years	Number; Total population - SEX AND AGE - 10 to 14 years	Percent; Total population - SEX AND AGE - 10 to 14 years	Number; Total population - SEX AND AGE - 15 to 19 years	Percent; Total population - SEX AND AGE - 15 to 19 years	Number; Total population - SEX AND AGE - 20 to 24 years	Percent; Total population - SEX AND AGE - 20 to 24 years	Number; Total population - SEX AND AGE - 25 to 34 years	Percent; Total population - SEX AND AGE - 25 to 34 years	Number; Total population - SEX AND AGE - 35 to 44 years	Percent; Total population - SEX AND AGE - 35 to 44 years	Number; Total population - SEX AND AGE - 45 to 54 years	Percent; Total population - SEX AND AGE - 45 to 54 years	Number; Total population - SEX AND AGE - 55 to 59 years	Percent; Total population - SEX AND AGE - 55 to 59 years	Number; Total population - SEX AND AGE - 60 to 64 years	Percent; Total population - SEX AND AGE - 60 to 64 years	Number; Total population - SEX AND AGE - 65 to 74 years	Percent; Total population - SEX AND AGE - 65 to 74 years	Number; Total population - SEX AND AGE - 75 to 84 years	Percent; Total population - SEX AND AGE - 75 to 84 years	Number; Total population - SEX AND AGE - 85 years and over	Percent; Total population - SEX AND AGE - 85 years and over	Number; Total population - SEX AND AGE - Median age (years)	Percent; Total population - SEX AND AGE - Median age (years)	Number; Total population - SEX AND AGE - 18 years and over	Percent; Total population - SEX AND AGE - 18 years and over	Number; Total population - SEX AND AGE - 18 years and over - Male	Percent; Total population - SEX AND AGE - 18 years and over - Male	Number; Total population - SEX AND AGE - 18 years and over - Female	Percent; Total population - SEX AND AGE - 18 years and over - Female	Number; Total population - SEX AND AGE - 21 years and over	Percent; Total population - SEX AND AGE - 21 years and over	Number; Total population - SEX AND AGE - 62 years and over	Percent; Total population - SEX AND AGE - 62 years and over	Number; Total population - SEX AND AGE - 65 years and over	Percent; Total population - SEX AND AGE - 65 years and over	Number; Total population - SEX AND AGE - 65 years and over - Male	Percent; Total population - SEX AND AGE - 65 years and over - Male	Number; Total population - SEX AND AGE - 65 years and over - Female	Percent; Total population - SEX AND AGE - 65 years and over - Female	Number; Total population - RACE - One race	Percent; Total population - RACE - One race	Number; Total population - RACE - One race - White	Percent; Total population - RACE - One race - White	Number; Total population - RACE - One race - Black or African American	Percent; Total population - RACE - One race - Black or African American	Number; Total population - RACE - One race - American Indian and Alaska Native	Percent; Total population - RACE - One race - American Indian and Alaska Native	Number; Total population - RACE - One race - Asian	Percent; Total population - RACE - One race - Asian	Number; Total population - RACE - One race - Asian - Asian Indian	Percent; Total population - RACE - One race - Asian - Asian Indian	Number; Total population - RACE - One race - Asian - Chinese	Percent; Total population - RACE - One race - Asian - Chinese	Number; Total population - RACE - One race - Asian - Filipino	Percent; Total population - RACE - One race - Asian - Filipino	Number; Total population - RACE - One race - Asian - Japanese	Percent; Total population - RACE - One race - Asian - Japanese	Number; Total population - RACE - One race - Asian - Korean	Percent; Total population - RACE - One race - Asian - Korean	Number; Total population - RACE - One race - Asian - Vietnamese	Percent; Total population - RACE - One race - Asian - Vietnamese	Number; Total population - RACE - One race - Asian - Other Asian [1]	Percent; Total population - RACE - One race - Asian - Other Asian [1]	Number; Total population - RACE - One race - Native Hawaiian and Other Pacific Islander	Percent; Total population - RACE - One race - Native Hawaiian and Other Pacific Islander	Number; Total population - RACE - One race - Native Hawaiian and Other Pacific Islander - Native Hawaiian	Percent; Total population - RACE - One race - Native Hawaiian and Other Pacific Islander - Native Hawaiian	Number; Total population - RACE - One race - Native Hawaiian and Other Pacific Islander - Guamanian or Chamorro	Percent; Total population - RACE - One race - Native Hawaiian and Other Pacific Islander - Guamanian or Chamorro	Number; Total population - RACE - One race - Native Hawaiian and Other Pacific Islander - Samoan	Percent; Total population - RACE - One race - Native Hawaiian and Other Pacific Islander - Samoan	Number; Total population - RACE - One race - Native Hawaiian and Other Pacific Islander - Other Pacific Islander [2]	Percent; Total population - RACE - One race - Native Hawaiian and Other Pacific Islander - Other Pacific Islander [2]	Number; Total population - RACE - One race - Some other race	Percent; Total population - RACE - One race - Some other race	Number; Total population - RACE - Two or more races	Percent; Total population - RACE - Two or more races	Number; Total population - RACE - Race alone or in combination with one or more other races [3] - White	Percent; Total population - RACE - Race alone or in combination with one or more other races [3] - White	Number; Total population - RACE - Race alone or in combination with one or more other races [3] - Black or African American	Percent; Total population - RACE - Race alone or in combination with one or more other races [3] - Black or African American	Number; Total population - RACE - Race alone or in combination with one or more other races [3] - American Indian and Alaska Native	Percent; Total population - RACE - Race alone or in combination with one or more other races [3] - American Indian and Alaska Native	Number; Total population - RACE - Race alone or in combination with one or more other races [3] - Asian	Percent; Total population - RACE - Race alone or in combination with one or more other races [3] - Asian	Number; Total population - RACE - Race alone or in combination with one or more other races [3] - Native Hawaiian and Other Pacific Islander	Percent; Total population - RACE - Race alone or in combination with one or more other races [3] - Native Hawaiian and Other Pacific Islander	Number; Total population - RACE - Race alone or in combination with one or more other races [3] - Some other race	Percent; Total population - RACE - Race alone or in combination with one or more other races [3] - Some other race	Number; HISPANIC OR LATINO AND RACE - Total population	Percent; HISPANIC OR LATINO AND RACE - Total population	Number; HISPANIC OR LATINO AND RACE - Total population - Hispanic or Latino (of any race)	Percent; HISPANIC OR LATINO AND RACE - Total population - Hispanic or Latino (of any race)	Number; HISPANIC OR LATINO AND RACE - Total population - Hispanic or Latino (of any race) - Mexican	Percent; HISPANIC OR LATINO AND RACE - Total population - Hispanic or Latino (of any race) - Mexican	Number; HISPANIC OR LATINO AND RACE - Total population - Hispanic or Latino (of any race) - Puerto Rican	Percent; HISPANIC OR LATINO AND RACE - Total population - Hispanic or Latino (of any race) - Puerto Rican	Number; HISPANIC OR LATINO AND RACE - Total population - Hispanic or Latino (of any race) - Cuban	Percent; HISPANIC OR LATINO AND RACE - Total population - Hispanic or Latino (of any race) - Cuban	Number; HISPANIC OR LATINO AND RACE - Total population - Hispanic or Latino (of any race) - Other Hispanic or Latino	Percent; HISPANIC OR LATINO AND RACE - Total population - Hispanic or Latino (of any race) - Other Hispanic or Latino	Number; HISPANIC OR LATINO AND RACE - Total population - Not Hispanic or Latino	Percent; HISPANIC OR LATINO AND RACE - Total population - Not Hispanic or Latino	Number; HISPANIC OR LATINO AND RACE - Total population - Not Hispanic or Latino - White alone	Percent; HISPANIC OR LATINO AND RACE - Total population - Not Hispanic or Latino - White alone	Number; RELATIONSHIP - Total population	Percent; RELATIONSHIP - Total population	Number; RELATIONSHIP - Total population - In households	Percent; RELATIONSHIP - Total population - In households	Number; RELATIONSHIP - Total population - In households - Householder	Percent; RELATIONSHIP - Total population - In households - Householder	Number; RELATIONSHIP - Total population - In households - Spouse	Percent; RELATIONSHIP - Total population - In households - Spouse	Number; RELATIONSHIP - Total population - In households - Child	Percent; RELATIONSHIP - Total population - In households - Child	Number; RELATIONSHIP - Total population - In households - Child - Own child under 18 years	Percent; RELATIONSHIP - Total population - In households - Child - Own child under 18 years	Number; RELATIONSHIP - Total population - In households - Other relatives	Percent; RELATIONSHIP - Total population - In households - Other relatives	Number; RELATIONSHIP - Total population - In households - Other relatives - Under 18 years	Percent; RELATIONSHIP - Total population - In households - Other relatives - Under 18 years	Number; RELATIONSHIP - Total population - In households - Nonrelatives	Percent; RELATIONSHIP - Total population - In households - Nonrelatives	Number; RELATIONSHIP - Total population - In households - Nonrelatives - Unmarried partner	Percent; RELATIONSHIP - Total population - In households - Nonrelatives - Unmarried partner	Number; RELATIONSHIP - Total population - In group quarters	Percent; RELATIONSHIP - Total population - In group quarters	Number; RELATIONSHIP - Total population - In group quarters - Institutionalized population	Percent; RELATIONSHIP - Total population - In group quarters - Institutionalized population	Number; RELATIONSHIP - Total population - In group quarters - Noninstitutionalized population	Percent; RELATIONSHIP - Total population - In group quarters - Noninstitutionalized population	Number; HOUSEHOLDS BY TYPE - Total households	Percent; HOUSEHOLDS BY TYPE - Total households	Number; HOUSEHOLDS BY TYPE - Total households - Family households (families)	Percent; HOUSEHOLDS BY TYPE - Total households - Family households (families)	Number; HOUSEHOLDS BY TYPE - Total households - Family households (families) - With own children under 18 years	Percent; HOUSEHOLDS BY TYPE - Total households - Family households (families) - With own children under 18 years	Number; HOUSEHOLDS BY TYPE - Total households - Family households (families) - Married-couple family	Percent; HOUSEHOLDS BY TYPE - Total households - Family households (families) - Married-couple family	Number; HOUSEHOLDS BY TYPE - Total households - Family households (families) - Married-couple family - With own children under 18 years	Percent; HOUSEHOLDS BY TYPE - Total households - Family households (families) - Married-couple family - With own children under 18 years	Number; HOUSEHOLDS BY TYPE - Total households - Family households (families) - Female householder, no husband present	Percent; HOUSEHOLDS BY TYPE - Total households - Family households (families) - Female householder, no husband present	Number; HOUSEHOLDS BY TYPE - Total households - Family households (families) - Female householder, no husband present - With own children under 18 years	Percent; HOUSEHOLDS BY TYPE - Total households - Family households (families) - Female householder, no husband present - With own children under 18 years	Number; HOUSEHOLDS BY TYPE - Total households - Nonfamily households	Percent; HOUSEHOLDS BY TYPE - Total households - Nonfamily households	Number; HOUSEHOLDS BY TYPE - Total households - Nonfamily households - Householder living alone	Percent; HOUSEHOLDS BY TYPE - Total households - Nonfamily households - Householder living alone	Number; HOUSEHOLDS BY TYPE - Total households - Nonfamily households - Householder living alone - Householder 65 years and over	Percent; HOUSEHOLDS BY TYPE - Total households - Nonfamily households - Householder living alone - Householder 65 years and over	Number; HOUSEHOLDS BY TYPE - Total households - Households with individuals under 18 years	Percent; HOUSEHOLDS BY TYPE - Total households - Households with individuals under 18 years	Number; HOUSEHOLDS BY TYPE - Total households - Households with individuals 65 years and over	Percent; HOUSEHOLDS BY TYPE - Total households - Households with individuals 65 years and over	Number; HOUSEHOLDS BY TYPE - Total households - Average household size	Percent; HOUSEHOLDS BY TYPE - Total households - Average household size	Number; HOUSEHOLDS BY TYPE - Total households - Average family size	Percent; HOUSEHOLDS BY TYPE - Total households - Average family size	Number; HOUSING OCCUPANCY - Total housing units	Percent; HOUSING OCCUPANCY - Total housing units	Number; HOUSING OCCUPANCY - Total housing units - Occupied housing units	Percent; HOUSING OCCUPANCY - Total housing units - Occupied housing units	Number; HOUSING OCCUPANCY - Total housing units - Vacant housing units	Percent; HOUSING OCCUPANCY - Total housing units - Vacant housing units	Number; HOUSING OCCUPANCY - Total housing units - Vacant housing units - For seasonal, recreational, or occasional use	Percent; HOUSING OCCUPANCY - Total housing units - Vacant housing units - For seasonal, recreational, or occasional use	Number; HOUSING OCCUPANCY - Total housing units - Homeowner vacancy rate (percent)	Percent; HOUSING OCCUPANCY - Total housing units - Homeowner vacancy rate (percent)	Number; HOUSING OCCUPANCY - Total housing units - Rental vacancy rate (percent)	Percent; HOUSING OCCUPANCY - Total housing units - Rental vacancy rate (percent)	Number; HOUSING TENURE - Occupied housing units	Percent; HOUSING TENURE - Occupied housing units	Number; HOUSING TENURE - Occupied housing units - Owner-occupied housing units	Percent; HOUSING TENURE - Occupied housing units - Owner-occupied housing units	Number; HOUSING TENURE - Occupied housing units - Renter-occupied housing units	Percent; HOUSING TENURE - Occupied housing units - Renter-occupied housing units	Number; HOUSING TENURE - Occupied housing units - Average household size of owner-occupied unit	Percent; HOUSING TENURE - Occupied housing units - Average household size of owner-occupied unit	Number; HOUSING TENURE - Occupied housing units - Average household size of renter-occupied unit	Percent; HOUSING TENURE - Occupied housing units - Average household size of renter-occupied unit
2	1400000US22001960100	22001960100	Census Tract 9601, Acadia Parish, Louisiana	6,188	100	2,920	47	3,268	53	462	8	502	8	541	9	572	9	375	6	728	12	913	15	699	11	301	5	252	4	433	7	287	5	123	2	34	(X)	4,304	70	1,957	32	2,347	38	4,031	65	996	16	843	14	295	5	548	9	6,174	100	4,455	72	1,675	27	12	0	7	0	0	0	0	0	2	0	0	0	1	0	0	0	4	0	0	0	0	0	0	0	0	0	0	0	25	0	14	0	4,468	72	1,677	27	19	0	8	0	0	0	30	1	6,188	100	87	1	51	1	1	0	0	0	35	1	6,101	99	4,398	71	6,188	100	6,030	97	2,236	36	1,119	18	2,199	36	1,700	28	274	4	143	2	202	3	109	2	158	3	151	2	7	0	2,236	100	1,595	71	868	39	1,119	50	573	26	363	16	237	11	641	29	585	26	303	14	962	43	569	25	3	(X)	3	(X)	2,410	100	2,236	93	174	7	15	1	1	(X)	8	(X)	2,236	100	1,526	68	710	32	3	(X)	3	(X)
3	1400000US22001960200	22001960200	Census Tract 9602, Acadia Parish, Louisiana	5,056	100	2,562	51	2,494	49	346	7	416	8	476	9	463	9	298	6	579	12	861	17	709	14	250	5	203	4	263	5	150	3	42	1	34	(X)	3,527	70	1,758	35	1,769	35	3,289	65	570	11	455	9	217	4	238	5	5,035	100	4,799	95	216	4	6	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	13	0	21	0	4,816	95	226	5	13	0	6	0	4	0	16	0	5,056	100	35	1	18	0	0	0	0	0	17	0	5,021	99	4,775	94	5,056	100	5,056	100	1,764	35	1,216	24	1,791	35	1,413	28	173	3	86	2	112	2	61	1	0	0	0	0	0	0	1,764	100	1,408	80	722	41	1,216	69	617	35	134	8	81	5	356	20	310	18	128	7	781	44	339	19	3	(X)	3	(X)	1,909	100	1,764	92	145	8	31	2	1	(X)	7	(X)	1,764	100	1,461	83	303	17	3	(X)	3	(X)
4	1400000US22001960300	22001960300	Census Tract 9603, Acadia Parish, Louisiana	3,149	100	1,593	51	1,556	49	209	7	251	8	305	10	260	8	204	7	368	12	520	17	409	13	148	5	130	4	209	7	104	3	32	1	35	(X)	2,233	71	1,103	35	1,130	36	2,081	66	435	14	345	11	150	5	195	6	3,140	100	3,058	97	67	2	8	0	2	0	0	0	1	0	0	0	0	0	0	0	0	0	1	0	1	0	1	0	0	0	0	0	0	0	4	0	9	0	3,066	97	69	2	13	0	2	0	1	0	7	0	3,149	100	15	1	4	0	0	0	0	0	11	0	3,134	100	3,049	97	3,149	100	3,148	100	1,145	36	750	24	1,091	35	854	27	73	2	38	1	89	3	48	2	1	0	0	0	1	0	1,145	100	883	77	445	39	750	66	369	32	93	8	52	5	262	23	228	20	91	8	475	42	247	22	3	(X)	3	(X)	1,246	100	1,145	92	101	8	19	2	1	(X)	7	(X)	1,145	100	1,041	91	104	9	3	(X)	3	(X)
5	1400000US22001960400	22001960400	Census Tract 9604, Acadia Parish, Louisiana	5,617	100	2,754	49	2,863	51	429	8	406	7	520	9	476	9	353	6	691	12	914	16	684	12	254	5	222	4	410	7	193	3	65	1	34	(X)	3,944	70	1,911	34	2,033	36	3,716	66	800	14	668	12	302	5	366	7	5,583	99	5,347	95	207	4	18	0	6	0	1	0	1	0	4	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	5	0	34	1	5,381	96	211	4	43	1	8	0	0	0	9	0	5,617	100	43	1	24	0	0	0	0	0	19	0	5,574	99	5,307	95	5,617	100	5,592	100	1,991	35	1,291	23	1,994	36	1,554	28	196	4	101	2	120	2	70	1	25	0	10	0	15	0	1,991	100	1,555	78	804	40	1,291	65	641	32	168	8	99	5	436	22	388	20	189	10	861	43	482	24	3	(X)	3	(X)	2,176	100	1,991	92	185	9	23	1	1	(X)	6	(X)	1,991	100	1,630	82	361	18	3	(X)	3	(X)
6	1400000US22001960500	22001960500	Census Tract 9605, Acadia Parish, Louisiana	4,927	100	2,461	50	2,466	50	400	8	438	9	439	9	418	9	319	7	704	14	777	16	644	13	227	5	154	3	234	5	134	3	39	1	32	(X)	3,405	69	1,675	34	1,730	35	3,162	64	499	10	407	8	167	3	240	5	4,901	100	4,498	91	378	8	15	0	4	0	0	0	0	0	1	0	0	0	0	0	3	0	0	0	1	0	0	0	0	0	1	0	0	0	5	0	26	1	4,524	92	385	8	35	1	4	0	1	0	9	0	4,927	100	61	1	37	1	1	0	0	0	23	1	4,866	99	4,448	90	4,927	100	4,921	100	1,692	34	1,068	22	1,809	37	1,418	29	174	4	71	1	178	4	100	2	6	0	0	0	6	0	1,692	100	1,326	78	762	45	1,068	63	611	36	182	11	103	6	366	22	300	18	136	8	808	48	316	19	3	(X)	3	(X)	1,796	100	1,692	94	104	6	22	1	1	(X)	5	(X)	1,692	100	1,419	84	273	16	3	(X)	3	(X)

Upon inspection, we can see that the file came with two header rows. R, by default, takes the first row of a CSV to be the header. We clearly do not need the first one so we can rerun the read.csv command and tell it so:

In [3]:

census2000 <- read.csv('2000_census_demographic_profile.csv', skip = 1)

In [4]:

head(census2000)

Out[4]:

	Id	Id2	Geography	Number..Total.population	Percent..Total.population	Number..Total.population...SEX.AND.AGE...Male	Percent..Total.population...SEX.AND.AGE...Male	Number..Total.population...SEX.AND.AGE...Female	Percent..Total.population...SEX.AND.AGE...Female	Number..Total.population...SEX.AND.AGE...Under.5.years	Percent..Total.population...SEX.AND.AGE...Under.5.years	Number..Total.population...SEX.AND.AGE...5.to.9.years	Percent..Total.population...SEX.AND.AGE...5.to.9.years	Number..Total.population...SEX.AND.AGE...10.to.14.years	Percent..Total.population...SEX.AND.AGE...10.to.14.years	Number..Total.population...SEX.AND.AGE...15.to.19.years	Percent..Total.population...SEX.AND.AGE...15.to.19.years	Number..Total.population...SEX.AND.AGE...20.to.24.years	Percent..Total.population...SEX.AND.AGE...20.to.24.years	Number..Total.population...SEX.AND.AGE...25.to.34.years	Percent..Total.population...SEX.AND.AGE...25.to.34.years	Number..Total.population...SEX.AND.AGE...35.to.44.years	Percent..Total.population...SEX.AND.AGE...35.to.44.years	Number..Total.population...SEX.AND.AGE...45.to.54.years	Percent..Total.population...SEX.AND.AGE...45.to.54.years	Number..Total.population...SEX.AND.AGE...55.to.59.years	Percent..Total.population...SEX.AND.AGE...55.to.59.years	Number..Total.population...SEX.AND.AGE...60.to.64.years	Percent..Total.population...SEX.AND.AGE...60.to.64.years	Number..Total.population...SEX.AND.AGE...65.to.74.years	Percent..Total.population...SEX.AND.AGE...65.to.74.years	Number..Total.population...SEX.AND.AGE...75.to.84.years	Percent..Total.population...SEX.AND.AGE...75.to.84.years	Number..Total.population...SEX.AND.AGE...85.years.and.over	Percent..Total.population...SEX.AND.AGE...85.years.and.over	Number..Total.population...SEX.AND.AGE...Median.age..years.	Percent..Total.population...SEX.AND.AGE...Median.age..years.	Number..Total.population...SEX.AND.AGE...18.years.and.over	Percent..Total.population...SEX.AND.AGE...18.years.and.over	Number..Total.population...SEX.AND.AGE...18.years.and.over...Male	Percent..Total.population...SEX.AND.AGE...18.years.and.over...Male	Number..Total.population...SEX.AND.AGE...18.years.and.over...Female	Percent..Total.population...SEX.AND.AGE...18.years.and.over...Female	Number..Total.population...SEX.AND.AGE...21.years.and.over	Percent..Total.population...SEX.AND.AGE...21.years.and.over	Number..Total.population...SEX.AND.AGE...62.years.and.over	Percent..Total.population...SEX.AND.AGE...62.years.and.over	Number..Total.population...SEX.AND.AGE...65.years.and.over	Percent..Total.population...SEX.AND.AGE...65.years.and.over	Number..Total.population...SEX.AND.AGE...65.years.and.over...Male	Percent..Total.population...SEX.AND.AGE...65.years.and.over...Male	Number..Total.population...SEX.AND.AGE...65.years.and.over...Female	Percent..Total.population...SEX.AND.AGE...65.years.and.over...Female	Number..Total.population...RACE...One.race	Percent..Total.population...RACE...One.race	Number..Total.population...RACE...One.race...White	Percent..Total.population...RACE...One.race...White	Number..Total.population...RACE...One.race...Black.or.African.American	Percent..Total.population...RACE...One.race...Black.or.African.American	Number..Total.population...RACE...One.race...American.Indian.and.Alaska.Native	Number..Total.population...RACE...One.race...Asian	Number..Total.population...RACE...One.race...Asian...Asian.Indian	Number..Total.population...RACE...One.race...Asian...Chinese	Number..Total.population...RACE...One.race...Asian...Filipino	Number..Total.population...RACE...One.race...Asian...Korean	Number..Total.population...RACE...One.race...Asian...Vietnamese	Number..Total.population...RACE...One.race...Asian...Other.Asian..1.	Number..Total.population...RACE...One.race...Native.Hawaiian.and.Other.Pacific.Islander	Number..Total.population...RACE...One.race...Native.Hawaiian.and.Other.Pacific.Islander...Native.Hawaiian	Number..Total.population...RACE...One.race...Native.Hawaiian.and.Other.Pacific.Islander...Samoan	Number..Total.population...RACE...One.race...Some.other.race	Number..Total.population...RACE...Two.or.more.races	Percent..Total.population...RACE...Two.or.more.races	Number..Total.population...RACE...Race.alone.or.in.combination.with.one.or.more.other.races..3....White	Percent..Total.population...RACE...Race.alone.or.in.combination.with.one.or.more.other.races..3....White	Number..Total.population...RACE...Race.alone.or.in.combination.with.one.or.more.other.races..3....Black.or.African.American	Percent..Total.population...RACE...Race.alone.or.in.combination.with.one.or.more.other.races..3....Black.or.African.American	Number..Total.population...RACE...Race.alone.or.in.combination.with.one.or.more.other.races..3....American.Indian.and.Alaska.Native	Percent..Total.population...RACE...Race.alone.or.in.combination.with.one.or.more.other.races..3....American.Indian.and.Alaska.Native	Number..Total.population...RACE...Race.alone.or.in.combination.with.one.or.more.other.races..3....Asian	Number..Total.population...RACE...Race.alone.or.in.combination.with.one.or.more.other.races..3....Native.Hawaiian.and.Other.Pacific.Islander	Number..Total.population...RACE...Race.alone.or.in.combination.with.one.or.more.other.races..3....Some.other.race	Percent..Total.population...RACE...Race.alone.or.in.combination.with.one.or.more.other.races..3....Some.other.race	Number..HISPANIC.OR.LATINO.AND.RACE...Total.population	Percent..HISPANIC.OR.LATINO.AND.RACE...Total.population	Number..HISPANIC.OR.LATINO.AND.RACE...Total.population...Hispanic.or.Latino..of.any.race.	Percent..HISPANIC.OR.LATINO.AND.RACE...Total.population...Hispanic.or.Latino..of.any.race.	Number..HISPANIC.OR.LATINO.AND.RACE...Total.population...Hispanic.or.Latino..of.any.race....Mexican	Percent..HISPANIC.OR.LATINO.AND.RACE...Total.population...Hispanic.or.Latino..of.any.race....Mexican	Number..HISPANIC.OR.LATINO.AND.RACE...Total.population...Hispanic.or.Latino..of.any.race....Puerto.Rican	Number..HISPANIC.OR.LATINO.AND.RACE...Total.population...Hispanic.or.Latino..of.any.race....Other.Hispanic.or.Latino	Percent..HISPANIC.OR.LATINO.AND.RACE...Total.population...Hispanic.or.Latino..of.any.race....Other.Hispanic.or.Latino	Number..HISPANIC.OR.LATINO.AND.RACE...Total.population...Not.Hispanic.or.Latino	Percent..HISPANIC.OR.LATINO.AND.RACE...Total.population...Not.Hispanic.or.Latino	Number..HISPANIC.OR.LATINO.AND.RACE...Total.population...Not.Hispanic.or.Latino...White.alone	Percent..HISPANIC.OR.LATINO.AND.RACE...Total.population...Not.Hispanic.or.Latino...White.alone	Number..RELATIONSHIP...Total.population	Percent..RELATIONSHIP...Total.population	Number..RELATIONSHIP...Total.population...In.households	Percent..RELATIONSHIP...Total.population...In.households	Number..RELATIONSHIP...Total.population...In.households...Householder	Percent..RELATIONSHIP...Total.population...In.households...Householder	Number..RELATIONSHIP...Total.population...In.households...Spouse	Percent..RELATIONSHIP...Total.population...In.households...Spouse	Number..RELATIONSHIP...Total.population...In.households...Child	Percent..RELATIONSHIP...Total.population...In.households...Child	Number..RELATIONSHIP...Total.population...In.households...Child...Own.child.under.18.years	Percent..RELATIONSHIP...Total.population...In.households...Child...Own.child.under.18.years	Number..RELATIONSHIP...Total.population...In.households...Other.relatives	Percent..RELATIONSHIP...Total.population...In.households...Other.relatives	Number..RELATIONSHIP...Total.population...In.households...Other.relatives...Under.18.years	Percent..RELATIONSHIP...Total.population...In.households...Other.relatives...Under.18.years	Number..RELATIONSHIP...Total.population...In.households...Nonrelatives	Percent..RELATIONSHIP...Total.population...In.households...Nonrelatives	Number..RELATIONSHIP...Total.population...In.households...Nonrelatives...Unmarried.partner	Percent..RELATIONSHIP...Total.population...In.households...Nonrelatives...Unmarried.partner	Number..RELATIONSHIP...Total.population...In.group.quarters	Percent..RELATIONSHIP...Total.population...In.group.quarters	Number..RELATIONSHIP...Total.population...In.group.quarters...Institutionalized.population	Percent..RELATIONSHIP...Total.population...In.group.quarters...Institutionalized.population	Number..RELATIONSHIP...Total.population...In.group.quarters...Noninstitutionalized.population	Number..HOUSEHOLDS.BY.TYPE...Total.households	Percent..HOUSEHOLDS.BY.TYPE...Total.households	Number..HOUSEHOLDS.BY.TYPE...Total.households...Family.households..families.	Percent..HOUSEHOLDS.BY.TYPE...Total.households...Family.households..families.	Number..HOUSEHOLDS.BY.TYPE...Total.households...Family.households..families....With.own.children.under.18.years	Percent..HOUSEHOLDS.BY.TYPE...Total.households...Family.households..families....With.own.children.under.18.years	Number..HOUSEHOLDS.BY.TYPE...Total.households...Family.households..families....Married.couple.family	Percent..HOUSEHOLDS.BY.TYPE...Total.households...Family.households..families....Married.couple.family	Number..HOUSEHOLDS.BY.TYPE...Total.households...Family.households..families....Married.couple.family...With.own.children.under.18.years	Percent..HOUSEHOLDS.BY.TYPE...Total.households...Family.households..families....Married.couple.family...With.own.children.under.18.years	Number..HOUSEHOLDS.BY.TYPE...Total.households...Family.households..families....Female.householder..no.husband.present	Percent..HOUSEHOLDS.BY.TYPE...Total.households...Family.households..families....Female.householder..no.husband.present	Number..HOUSEHOLDS.BY.TYPE...Total.households...Family.households..families....Female.householder..no.husband.present...With.own.children.under.18.years	Percent..HOUSEHOLDS.BY.TYPE...Total.households...Family.households..families....Female.householder..no.husband.present...With.own.children.under.18.years	Number..HOUSEHOLDS.BY.TYPE...Total.households...Nonfamily.households	Percent..HOUSEHOLDS.BY.TYPE...Total.households...Nonfamily.households	Number..HOUSEHOLDS.BY.TYPE...Total.households...Nonfamily.households...Householder.living.alone	Percent..HOUSEHOLDS.BY.TYPE...Total.households...Nonfamily.households...Householder.living.alone	Number..HOUSEHOLDS.BY.TYPE...Total.households...Nonfamily.households...Householder.living.alone...Householder.65.years.and.over	Percent..HOUSEHOLDS.BY.TYPE...Total.households...Nonfamily.households...Householder.living.alone...Householder.65.years.and.over	Number..HOUSEHOLDS.BY.TYPE...Total.households...Households.with.individuals.under.18.years	Percent..HOUSEHOLDS.BY.TYPE...Total.households...Households.with.individuals.under.18.years	Number..HOUSEHOLDS.BY.TYPE...Total.households...Households.with.individuals.65.years.and.over	Percent..HOUSEHOLDS.BY.TYPE...Total.households...Households.with.individuals.65.years.and.over	Number..HOUSEHOLDS.BY.TYPE...Total.households...Average.household.size	Percent..HOUSEHOLDS.BY.TYPE...Total.households...Average.household.size	Number..HOUSEHOLDS.BY.TYPE...Total.households...Average.family.size	Percent..HOUSEHOLDS.BY.TYPE...Total.households...Average.family.size	Number..HOUSING.OCCUPANCY...Total.housing.units	Percent..HOUSING.OCCUPANCY...Total.housing.units	Number..HOUSING.OCCUPANCY...Total.housing.units...Occupied.housing.units	Percent..HOUSING.OCCUPANCY...Total.housing.units...Occupied.housing.units	Number..HOUSING.OCCUPANCY...Total.housing.units...Vacant.housing.units	Percent..HOUSING.OCCUPANCY...Total.housing.units...Vacant.housing.units	Number..HOUSING.OCCUPANCY...Total.housing.units...Vacant.housing.units...For.seasonal..recreational..or.occasional.use	Percent..HOUSING.OCCUPANCY...Total.housing.units...Vacant.housing.units...For.seasonal..recreational..or.occasional.use	Number..HOUSING.OCCUPANCY...Total.housing.units...Homeowner.vacancy.rate..percent.	Percent..HOUSING.OCCUPANCY...Total.housing.units...Homeowner.vacancy.rate..percent.	Number..HOUSING.OCCUPANCY...Total.housing.units...Rental.vacancy.rate..percent.	Percent..HOUSING.OCCUPANCY...Total.housing.units...Rental.vacancy.rate..percent.	Number..HOUSING.TENURE...Occupied.housing.units	Percent..HOUSING.TENURE...Occupied.housing.units	Number..HOUSING.TENURE...Occupied.housing.units...Owner.occupied.housing.units	Percent..HOUSING.TENURE...Occupied.housing.units...Owner.occupied.housing.units	Number..HOUSING.TENURE...Occupied.housing.units...Renter.occupied.housing.units	Percent..HOUSING.TENURE...Occupied.housing.units...Renter.occupied.housing.units	Number..HOUSING.TENURE...Occupied.housing.units...Average.household.size.of.owner.occupied.unit	Percent..HOUSING.TENURE...Occupied.housing.units...Average.household.size.of.owner.occupied.unit	Number..HOUSING.TENURE...Occupied.housing.units...Average.household.size.of.renter.occupied.unit	Percent..HOUSING.TENURE...Occupied.housing.units...Average.household.size.of.renter.occupied.unit
1	1400000US22001960100	22001960100	Census Tract 9601, Acadia Parish, Louisiana	6,188	100	2,920	47	3,268	53	462	8	502	8	541	9	572	9	375	6	728	12	913	15	699	11	301	5	252	4	433	7	287	5	123	2	34	(X)	4,304	70	1,957	32	2,347	38	4,031	65	996	16	843	14	295	5	548	9	6,174	100	4,455	72	1,675	27	12	7	0	0	2	1	0	4	0	0	0	25	14	0	4,468	72	1,677	27	19	0	8	0	30	1	6,188	100	87	1	51	1	1	35	1	6,101	99	4,398	71	6,188	100	6,030	97	2,236	36	1,119	18	2,199	36	1,700	28	274	4	143	2	202	3	109	2	158	3	151	2	7	2,236	100	1,595	71	868	39	1,119	50	573	26	363	16	237	11	641	29	585	26	303	14	962	43	569	25	3	(X)	3	(X)	2,410	100	2,236	93	174	7	15	1	1	(X)	8	(X)	2,236	100	1,526	68	710	32	3	(X)	3	(X)
2	1400000US22001960200	22001960200	Census Tract 9602, Acadia Parish, Louisiana	5,056	100	2,562	51	2,494	49	346	7	416	8	476	9	463	9	298	6	579	12	861	17	709	14	250	5	203	4	263	5	150	3	42	1	34	(X)	3,527	70	1,758	35	1,769	35	3,289	65	570	11	455	9	217	4	238	5	5,035	100	4,799	95	216	4	6	1	0	0	0	0	0	1	0	0	0	13	21	0	4,816	95	226	5	13	0	6	4	16	0	5,056	100	35	1	18	0	0	17	0	5,021	99	4,775	94	5,056	100	5,056	100	1,764	35	1,216	24	1,791	35	1,413	28	173	3	86	2	112	2	61	1	0	0	0	0	0	1,764	100	1,408	80	722	41	1,216	69	617	35	134	8	81	5	356	20	310	18	128	7	781	44	339	19	3	(X)	3	(X)	1,909	100	1,764	92	145	8	31	2	1	(X)	7	(X)	1,764	100	1,461	83	303	17	3	(X)	3	(X)
3	1400000US22001960300	22001960300	Census Tract 9603, Acadia Parish, Louisiana	3,149	100	1,593	51	1,556	49	209	7	251	8	305	10	260	8	204	7	368	12	520	17	409	13	148	5	130	4	209	7	104	3	32	1	35	(X)	2,233	71	1,103	35	1,130	36	2,081	66	435	14	345	11	150	5	195	6	3,140	100	3,058	97	67	2	8	2	0	1	0	0	0	1	1	1	0	4	9	0	3,066	97	69	2	13	0	2	1	7	0	3,149	100	15	1	4	0	0	11	0	3,134	100	3,049	97	3,149	100	3,148	100	1,145	36	750	24	1,091	35	854	27	73	2	38	1	89	3	48	2	1	0	0	0	1	1,145	100	883	77	445	39	750	66	369	32	93	8	52	5	262	23	228	20	91	8	475	42	247	22	3	(X)	3	(X)	1,246	100	1,145	92	101	8	19	2	1	(X)	7	(X)	1,145	100	1,041	91	104	9	3	(X)	3	(X)
4	1400000US22001960400	22001960400	Census Tract 9604, Acadia Parish, Louisiana	5,617	100	2,754	49	2,863	51	429	8	406	7	520	9	476	9	353	6	691	12	914	16	684	12	254	5	222	4	410	7	193	3	65	1	34	(X)	3,944	70	1,911	34	2,033	36	3,716	66	800	14	668	12	302	5	366	7	5,583	99	5,347	95	207	4	18	6	1	1	4	0	0	0	0	0	0	5	34	1	5,381	96	211	4	43	1	8	0	9	0	5,617	100	43	1	24	0	0	19	0	5,574	99	5,307	95	5,617	100	5,592	100	1,991	35	1,291	23	1,994	36	1,554	28	196	4	101	2	120	2	70	1	25	0	10	0	15	1,991	100	1,555	78	804	40	1,291	65	641	32	168	8	99	5	436	22	388	20	189	10	861	43	482	24	3	(X)	3	(X)	2,176	100	1,991	92	185	9	23	1	1	(X)	6	(X)	1,991	100	1,630	82	361	18	3	(X)	3	(X)
5	1400000US22001960500	22001960500	Census Tract 9605, Acadia Parish, Louisiana	4,927	100	2,461	50	2,466	50	400	8	438	9	439	9	418	9	319	7	704	14	777	16	644	13	227	5	154	3	234	5	134	3	39	1	32	(X)	3,405	69	1,675	34	1,730	35	3,162	64	499	10	407	8	167	3	240	5	4,901	100	4,498	91	378	8	15	4	0	0	1	0	3	0	1	0	1	5	26	1	4,524	92	385	8	35	1	4	1	9	0	4,927	100	61	1	37	1	1	23	1	4,866	99	4,448	90	4,927	100	4,921	100	1,692	34	1,068	22	1,809	37	1,418	29	174	4	71	1	178	4	100	2	6	0	0	0	6	1,692	100	1,326	78	762	45	1,068	63	611	36	182	11	103	6	366	22	300	18	136	8	808	48	316	19	3	(X)	3	(X)	1,796	100	1,692	94	104	6	22	1	1	(X)	5	(X)	1,692	100	1,419	84	273	16	3	(X)	3	(X)
6	1400000US22001960600	22001960600	Census Tract 9606, Acadia Parish, Louisiana	5,654	100	2,647	47	3,007	53	464	8	471	8	442	8	460	8	358	6	760	13	871	15	615	11	243	4	209	4	415	7	241	4	105	2	33	(X)	3,999	71	1,791	32	2,208	39	3,736	66	869	15	761	14	271	5	490	9	5,620	99	4,809	85	782	14	7	12	0	3	2	1	6	0	0	0	0	10	34	1	4,842	86	792	14	18	0	15	0	21	0	5,654	100	49	1	28	1	0	21	0	5,605	99	4,774	84	5,654	100	5,526	98	2,073	37	1,076	19	1,891	33	1,478	26	289	5	160	3	197	4	110	2	128	2	128	2	0	2,073	100	1,477	71	796	38	1,076	52	541	26	310	15	189	9	596	29	521	25	243	12	882	43	510	25	3	(X)	3	(X)	2,292	100	2,073	90	219	10	11	1	1	(X)	14	(X)	2,073	100	1,474	71	599	29	3	(X)	3	(X)

Cleaning the data¶

Visually, we can see that this data set is very wide. In fact, there are 195 columns.

Spaces are not allowed in R column names. That's why they've been automatically converted to periods, as in Number..Total.population.

Let's keep a handful of these:

Id2: This is what the census bureau calls a FIPS code. It is a unique numerical identifier for all census tracts. This will be important when we join our two datasets together.
Geography: This is a text description of the tract, with the parish name.
Number..Total.population: The total population of the tract.
Number..HOUSING.OCCUPANCY...Total.housing.units, Number..HOUSING.OCCUPANCY...Total.housing.units...Occupied.housing.units, and Number..HOUSING.OCCUPANCY...Total.housing.units...Vacant.housing.units: The total, occupied and vacant housing units.

To help us trim the data set to just these six columns, we are going to import a package. There are thousands of packages for R created by the open-source community, which help improve on what is included in R by default.

The one we will use here is called dplyr.

In [5]:

## if dplyr was not installed we would have to run this
# install.packages('dplyr')

## to import the package and all of its functions 
library('dplyr')

Warning message:
: package ‘dplyr’ was built under R version 3.2.4
Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

From dplyr, we will use the select function to trim the data set and save it to a new variable called census2000.trimmed:

In [6]:

census2000.trimmed <- select(
    census2000, # name of the data frame
    # list of all the six column names we want to keep
    Id2, 
    Geography, 
    Number..Total.population, 
    Number..HOUSING.OCCUPANCY...Total.housing.units, 
    Number..HOUSING.OCCUPANCY...Total.housing.units...Occupied.housing.units, 
    Number..HOUSING.OCCUPANCY...Total.housing.units...Vacant.housing.units
)

head(census2000.trimmed)

Out[6]:

	Id2	Geography	Number..Total.population	Number..HOUSING.OCCUPANCY...Total.housing.units	Number..HOUSING.OCCUPANCY...Total.housing.units...Occupied.housing.units	Number..HOUSING.OCCUPANCY...Total.housing.units...Vacant.housing.units
1	22001960100	Census Tract 9601, Acadia Parish, Louisiana	6,188	2,410	2,236	174
2	22001960200	Census Tract 9602, Acadia Parish, Louisiana	5,056	1,909	1,764	145
3	22001960300	Census Tract 9603, Acadia Parish, Louisiana	3,149	1,246	1,145	101
4	22001960400	Census Tract 9604, Acadia Parish, Louisiana	5,617	2,176	1,991	185
5	22001960500	Census Tract 9605, Acadia Parish, Louisiana	4,927	1,796	1,692	104
6	22001960600	Census Tract 9606, Acadia Parish, Louisiana	5,654	2,292	2,073	219

This shows us that we were able to select the columns correctly. But one lingering issue is that these column names are long and unwieldy. Since we are going to be typing them often, let's rename them to shorter, more convenient versions:

In [7]:

colnames(census2000.trimmed) <- c(
    'fips.code', 'geography', 'population',
    'total.housing.units', 'occupied.housing.units', 'vacant.housing.units'
)
head(census2000.trimmed)

Out[7]:

	fips.code	geography	population	total.housing.units	occupied.housing.units	vacant.housing.units
1	22001960100	Census Tract 9601, Acadia Parish, Louisiana	6,188	2,410	2,236	174
2	22001960200	Census Tract 9602, Acadia Parish, Louisiana	5,056	1,909	1,764	145
3	22001960300	Census Tract 9603, Acadia Parish, Louisiana	3,149	1,246	1,145	101
4	22001960400	Census Tract 9604, Acadia Parish, Louisiana	5,617	2,176	1,991	185
5	22001960500	Census Tract 9605, Acadia Parish, Louisiana	4,927	1,796	1,692	104
6	22001960600	Census Tract 9606, Acadia Parish, Louisiana	5,654	2,292	2,073	219

Another helpful command to run on any data set is str, which gives you the structure of the variable as defined by R:

In [8]:

str(census2000.trimmed)

'data.frame':	1106 obs. of  6 variables:
 $ fips.code             : num  2.2e+10 2.2e+10 2.2e+10 2.2e+10 2.2e+10 ...
 $ geography             : Factor w/ 1106 levels "Census Tract 10.01, Lafayette Parish, Louisiana",..: 970 978 985 993 1000 1006 1011 1015 1018 1021 ...
 $ population            : Factor w/ 1019 levels "0","1","10,248",..: 859 710 350 801 692 806 647 804 711 842 ...
 $ total.housing.units   : Factor w/ 905 levels "1","10","1,002",..: 565 404 104 499 357 531 397 522 426 594 ...
 $ occupied.housing.units: Factor w/ 876 levels "0","1","1,001",..: 512 363 74 446 338 468 347 470 374 514 ...
 $ vacant.housing.units  : Factor w/ 374 levels "0","1","10","100",..: 86 54 5 98 9 134 83 86 110 196 ...

The structure tells us that this is a data frame with 1106 rows and six columns. It further tells us the type of each column.

Notice how the FIPS code read in as a number but the other numeric columns read in as “factors”? That's R-speak for a categorical variable, and any character variables are by default set to this type. This happened because the numbers and those columns have commas. The presence of a single character within a number makes R treat the entire column as strings. This will be an issue later when we try to add two numbers together, as R doesn't know how to add two characters.

The solution: we need to remove the comma from all the strings, then recast the variable as a number.

To help with this we are going to use another package called stringr, and a function from within it called str_replace:

In [9]:

# install.packages('stringr')

library('stringr')

Warning message:
: package ‘stringr’ was built under R version 3.2.5

Let's start with the population variable. First, let's remove the comma and write the result to the original column. (The format for calling a column from a data frame in R is df.name$column.name)

In [10]:

census2000.trimmed$population <- str_replace(
    census2000.trimmed$population, 
    pattern = ',', 
    replacement = ''
)

Then we'll visually inspect the head:

In [11]:

head(census2000.trimmed)

Out[11]:

	fips.code	geography	population	total.housing.units	occupied.housing.units	vacant.housing.units
1	22001960100	Census Tract 9601, Acadia Parish, Louisiana	6188	2,410	2,236	174
2	22001960200	Census Tract 9602, Acadia Parish, Louisiana	5056	1,909	1,764	145
3	22001960300	Census Tract 9603, Acadia Parish, Louisiana	3149	1,246	1,145	101
4	22001960400	Census Tract 9604, Acadia Parish, Louisiana	5617	2,176	1,991	185
5	22001960500	Census Tract 9605, Acadia Parish, Louisiana	4927	1,796	1,692	104
6	22001960600	Census Tract 9606, Acadia Parish, Louisiana	5654	2,292	2,073	219

This appeared to work. But R will still think this is a character variable unless we explicitly tell it otherwise:

In [12]:

census2000.trimmed$population <- as.numeric(census2000.trimmed$population)

Running str will help us ensure this worked:

In [13]:

str(census2000.trimmed)

'data.frame':	1106 obs. of  6 variables:
 $ fips.code             : num  2.2e+10 2.2e+10 2.2e+10 2.2e+10 2.2e+10 ...
 $ geography             : Factor w/ 1106 levels "Census Tract 10.01, Lafayette Parish, Louisiana",..: 970 978 985 993 1000 1006 1011 1015 1018 1021 ...
 $ population            : num  6188 5056 3149 5617 4927 ...
 $ total.housing.units   : Factor w/ 905 levels "1","10","1,002",..: 565 404 104 499 357 531 397 522 426 594 ...
 $ occupied.housing.units: Factor w/ 876 levels "0","1","1,001",..: 512 363 74 446 338 468 347 470 374 514 ...
 $ vacant.housing.units  : Factor w/ 374 levels "0","1","10","100",..: 86 54 5 98 9 134 83 86 110 196 ...

For the rest of the columns we can nest the first function within the second to speed things up:

In [14]:

census2000.trimmed$total.housing.units <- as.numeric(str_replace(census2000.trimmed$total.housing.units, pattern = ',', replacement = ''))
census2000.trimmed$occupied.housing.units <- as.numeric(str_replace(census2000.trimmed$occupied.housing.units, pattern = ',', replacement = ''))
census2000.trimmed$vacant.housing.units <- as.numeric(str_replace(census2000.trimmed$vacant.housing.units, pattern = ',', replacement = ''))

In [15]:

str(census2000.trimmed)

'data.frame':	1106 obs. of  6 variables:
 $ fips.code             : num  2.2e+10 2.2e+10 2.2e+10 2.2e+10 2.2e+10 ...
 $ geography             : Factor w/ 1106 levels "Census Tract 10.01, Lafayette Parish, Louisiana",..: 970 978 985 993 1000 1006 1011 1015 1018 1021 ...
 $ population            : num  6188 5056 3149 5617 4927 ...
 $ total.housing.units   : num  2410 1909 1246 2176 1796 ...
 $ occupied.housing.units: num  2236 1764 1145 1991 1692 ...
 $ vacant.housing.units  : num  174 145 101 185 104 219 171 174 196 284 ...

By default, head will print the first six lines. But we can override the default to show as many as we want (we'll show 10 here):

In [16]:

head(census2000.trimmed, n = 10)

Out[16]:

	fips.code	geography	population	total.housing.units	occupied.housing.units	vacant.housing.units
1	22001960100	Census Tract 9601, Acadia Parish, Louisiana	6188	2410	2236	174
2	22001960200	Census Tract 9602, Acadia Parish, Louisiana	5056	1909	1764	145
3	22001960300	Census Tract 9603, Acadia Parish, Louisiana	3149	1246	1145	101
4	22001960400	Census Tract 9604, Acadia Parish, Louisiana	5617	2176	1991	185
5	22001960500	Census Tract 9605, Acadia Parish, Louisiana	4927	1796	1692	104
6	22001960600	Census Tract 9606, Acadia Parish, Louisiana	5654	2292	2073	219
7	22001960700	Census Tract 9607, Acadia Parish, Louisiana	4614	1894	1723	171
8	22001960800	Census Tract 9608, Acadia Parish, Louisiana	5640	2254	2080	174
9	22001960900	Census Tract 9609, Acadia Parish, Louisiana	5059	1978	1782	196
10	22001961000	Census Tract 9610, Acadia Parish, Louisiana	5965	2526	2242	284

That worked!

But in the interest of full disclosure, you should know that we added those commas to the original CSVs from the Census Bureau to facilitate this exercise. “Commafied” numbers are one of the most frequent stumbling blocks to creating a cleaned data set.

For our last cleaning exercise, we'll work with the geography column. It has a lot of information in there, but it would be more useful if the census tract, parish name and state were separated, to help us aggregate some of these numbers.

The package tidyr has a function that helps us do just that:

In [17]:

# install.packages('tidyr')

library('tidyr')

Should you run into a function and not know what arguments it takes, running the function name, a pair of of empty parentheses afterwards, preceded by a question mark will allow you to access the documentation on that function:

In [18]:

# ?separate()

In [19]:

census2000.trimmed <- separate(
    census2000.trimmed, # name of the data frame
    geography, # column to split
    c('tract', 'parish', 'state'), # new column names
    ', ' # delimiter to split on (note the space after the comma)
)

In [20]:

head(census2000.trimmed)

Out[20]:

	fips.code	tract	parish	state	population	total.housing.units	occupied.housing.units	vacant.housing.units
1	22001960100	Census Tract 9601	Acadia Parish	Louisiana	6188	2410	2236	174
2	22001960200	Census Tract 9602	Acadia Parish	Louisiana	5056	1909	1764	145
3	22001960300	Census Tract 9603	Acadia Parish	Louisiana	3149	1246	1145	101
4	22001960400	Census Tract 9604	Acadia Parish	Louisiana	5617	2176	1991	185
5	22001960500	Census Tract 9605	Acadia Parish	Louisiana	4927	1796	1692	104
6	22001960600	Census Tract 9606	Acadia Parish	Louisiana	5654	2292	2073	219

Our data set is as cleaned up as we need it to be now.

Let's summarize it with a frequency table of the county names:

In [21]:

table(census2000.trimmed$parish)

Out[21]:

              Acadia Parish                Allen Parish 
                         12                           5 
           Ascension Parish           Assumption Parish 
                         14                           6 
           Avoyelles Parish           Beauregard Parish 
                          9                           7 
           Bienville Parish              Bossier Parish 
                          5                          19 
               Caddo Parish            Calcasieu Parish 
                         64                          41 
            Caldwell Parish              Cameron Parish 
                          3                           2 
           Catahoula Parish            Claiborne Parish 
                          3                           5 
           Concordia Parish              De Soto Parish 
                          5                           7 
    East Baton Rouge Parish         East Carroll Parish 
                         89                           3 
      East Feliciana Parish           Evangeline Parish 
                          4                           8 
            Franklin Parish                Grant Parish 
                          6                           5 
              Iberia Parish            Iberville Parish 
                         15                           8 
             Jackson Parish      Jefferson Davis Parish 
                          5                           7 
           Jefferson Parish            Lafayette Parish 
                        123                          41 
           Lafourche Parish             La Salle Parish 
                         22                           3 
             Lincoln Parish           Livingston Parish 
                         10                          13 
             Madison Parish            Morehouse Parish 
                          5                           8 
        Natchitoches Parish              Orleans Parish 
                          9                         181 
            Ouachita Parish          Plaquemines Parish 
                         41                           8 
       Pointe Coupee Parish              Rapides Parish 
                          6                          34 
           Red River Parish             Richland Parish 
                          2                           6 
              Sabine Parish          St. Bernard Parish 
                          7                          17 
         St. Charles Parish           St. Helena Parish 
                         13                           2 
           St. James Parish St. John the Baptist Parish 
                          7                          11 
          St. Landry Parish           St. Martin Parish 
                         19                           9 
            St. Mary Parish          St. Tammany Parish 
                         16                          35 
          Tangipahoa Parish               Tensas Parish 
                         18                           3 
          Terrebonne Parish                Union Parish 
                         20                           6 
           Vermilion Parish               Vernon Parish 
                         10                           9 
          Washington Parish              Webster Parish 
                         10                          11 
    West Baton Rouge Parish         West Carroll Parish 
                          4                           3 
      West Feliciana Parish                 Winn Parish 
                          3                           4

Now we need to run all of the above cleaning steps on the 2010 data:

In [22]:

census2010 <- read.csv('2010_census_demographic_profile.csv', skip = 1)

census2010.trimmed <- select(
  census2010, # name of the data frame
  # list of all the column names we want to keep
  Id2, Geography, Number..SEX.AND.AGE...Total.population, 
  Number..HOUSING.OCCUPANCY...Total.housing.units, 
  Number..HOUSING.OCCUPANCY...Total.housing.units...Occupied.housing.units, 
  Number..HOUSING.OCCUPANCY...Total.housing.units...Vacant.housing.units
)

colnames(census2010.trimmed) <- c('fips.code', 'census.tract', 'population', 
                               'total.housing.units', 'occupied.housing.units', 'vacant.housing.units')

census2010.trimmed$population <- as.numeric(str_replace(census2010.trimmed$population, pattern = ',', replacement = ''))
census2010.trimmed$total.housing.units <- as.numeric(str_replace(census2010.trimmed$total.housing.units, pattern = ',', replacement = ''))
census2010.trimmed$occupied.housing.units <- as.numeric(str_replace(census2010.trimmed$occupied.housing.units, pattern = ',', replacement = ''))
census2010.trimmed$vacant.housing.units <- as.numeric(str_replace(census2010.trimmed$vacant.housing.units, pattern = ',', replacement = ''))

census2010.trimmed <- separate(census2010.trimmed, census.tract, c('tract', 'parish', 'state'), ', ')

orleans2010 <- filter(census2010.trimmed, parish == 'Orleans Parish')

Merging¶

Now that we've cleaned both of our data files, let's merge the 2000 and 2010 data. Merging allows you to link two data sets on values common to both. It is a powerful operation that cannot be easily done in a program like Excel with such versatility.

In this case, we know that the FIPS code and the character names for most of the tracts should be consistent across the 10-year period.

However, census tracts are added, deleted, split and joined over the course of 10 years. We will make sure to keep all entries in both years. This is what is referred to as a "full outer join.” If we were to only keep all rows that were common to both data frames (R’s default behavior) we would lose some data.

In [23]:

census.comparison <- merge(
    census2000.trimmed, # first data frame
    census2010.trimmed, # second data frame
    by = c('fips.code', 'tract', 'parish', 'state'), # keys to use for join
    suffixes = c('.00', '.10'), # suffixes to append to new columns
    all = TRUE # specifying to keep all data from both data frames
)

Let's inspect a portion of the data frame where there are full matches and partial matches:

In [24]:

census.comparison[65:69, ]

Out[24]:

	fips.code	tract	parish	state	population.00	total.housing.units.00	occupied.housing.units.00	vacant.housing.units.00	population.10	total.housing.units.10	occupied.housing.units.10	vacant.housing.units.10
65	22015010801	Census Tract 108.01	Bossier Parish	Louisiana	3159	1527	1351	176	3359	1415	1284	131
66	22015010803	Census Tract 108.03	Bossier Parish	Louisiana	5362	2250	2120	130	NA	NA	NA	NA
67	22015010804	Census Tract 108.04	Bossier Parish	Louisiana	6101	2383	2282	101	7278	2943	2815	128
68	22015010805	Census Tract 108.05	Bossier Parish	Louisiana	NA	NA	NA	NA	3238	1585	1425	160
69	22015010806	Census Tract 108.06	Bossier Parish	Louisiana	NA	NA	NA	NA	4086	1671	1610	61

Saving your work¶

Saving your intermediate work to a file is often good practice, so we will write the results of our merge to a CSV (you can do this with any data frame you create in R).

In [25]:

write.csv(census.comparison, 'census_comparison_result.csv', row.names = FALSE)

Calculating population changes in New Orleans¶

Let's filter our merged data frame down to just Orleans Parish. The Orleans Parish and the city of New Orleans are “coterminous” (that is, they share the same boundaries), so this will isolate only the census tracts of the city.

In [26]:

# note the use of "==" since we are expressing a criterion
orleans <- filter(census.comparison, parish == 'Orleans Parish') 
head(orleans)

Out[26]:

	fips.code	tract	parish	state	population.00	total.housing.units.00	occupied.housing.units.00	vacant.housing.units.00	population.10	total.housing.units.10	occupied.housing.units.10	vacant.housing.units.10
1	2.2071e+10	Census Tract 1	Orleans Parish	Louisiana	2381	1408	1145	263	2455	1513	1229	284
2	2.2071e+10	Census Tract 2	Orleans Parish	Louisiana	1347	691	496	195	1197	738	496	242
3	2.2071e+10	Census Tract 3	Orleans Parish	Louisiana	1468	719	559	160	1231	641	467	174
4	2.2071e+10	Census Tract 4	Orleans Parish	Louisiana	2564	1034	873	161	2328	1137	911	226
5	2.2071e+10	Census Tract 6.01	Orleans Parish	Louisiana	2034	704	506	198	849	328	269	59
6	2.2071e+10	Census Tract 6.02	Orleans Parish	Louisiana	2957	1106	1011	95	2534	1108	923	185

Now we can do some quick calculations with our new merged data frame for New Orleans.

First question: What was the population of New Orleans in 2000?

That requires summing up the 2000 population column like so:

In [27]:

sum(orleans$population.00)

Out[27]:

[1] NA

Why didn't this work? Let's inspect the population.00 variable using summary:

In [28]:

summary(orleans$population.00)

Out[28]:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
     52    1711    2274    2678    3141    9931      30

This reveals that there are 30 census tracts that have NA, or missing, values for population.00. By default, R does not compute a sum of a column if there are missing values. We'll have to tell it to ignore these missing values by specifying na.rm = TRUE:

In [29]:

sum(orleans$population.00, na.rm = TRUE)

Out[29]:

484674

Second question: What was the population of New Orleans in 2010?

In [30]:

sum(orleans$population.10, na.rm = TRUE)

Out[30]:

343829

This matches the story exactly.

Last question: What was the percent change in New Orleans population between 2000 and 2010?

To do this, we'll first save each population calculation to new objects. Then we'll create another object to store the percent change.

In [31]:

nola2000pop <- sum(orleans$population.00, na.rm = TRUE)
nola2010pop <- sum(orleans$population.10, na.rm = TRUE)

perc.change.nola <- (nola2010pop - nola2000pop)/nola2000pop * 100

In [32]:

print(paste('The percent change in New Orleans population since 2000 is ', round(perc.change.nola), '%', sep =''))

[1] "The percent change in New Orleans population since 2000 is -29%"

Again, we see that this matches the 29% drop cited by The Times-Picayune article. Yay!

This concludes our workshop, Getting started with R.

We'll use the merged data CSV we saved above to do further analysis in our next workshop, More with R. Take a sneak peak at the notebook here.

Any questions?

christine.zhang@latimes.com or @christinezhang on Twitter
ryan.menezes@latimes.com or @ryanvmenezes on Twitter