Getting started with R

By Christine Zhang (Knight-Mozilla / Los Angeles Times) & Ryan Menezes (Los Angeles Times)

IRE Conference -- New Orleans, LA

June 18, 2016

This workshop is a basic introduction to R, a free, open-source software for data analysis and statistics.

R is a powerful tool that can help you quickly and effectively answer questions using data.

Take our host city, New Orleans, for example. Hurricane Katrina was a devastating natural disaster that substantially affected the population of New Orleans. The hurricane took place in August 2005, which coincidentally falls between the U.S. Census full population counts in 2000 and 2010.

In this session, we will use the "Demographic Profile” -- a large summary file with many different demographic variables downloaded from the U.S. Census Bureau website -- from 2000 and 2010, for all census tracts in the state of Louisiana.

In this session, we will:

  • Load the 2000 data in R
  • Select a few variables pertaining to population counts and housing occupancy
  • Clean the data
  • Focus on the Orleans Parish in particular, which represents the city of New Orleans, and tally the population
  • Perform the same steps on the 2010 data
  • Merge the two sets
  • Write the merged data out to a CSV file for our next class, More with R, where we'll do some more in depth analysis

Basic analysis techniques like the ones you will learn in this class can help you write data-driven stories, like this one written by The Times-Picayune shortly after the census released its 2010 tally.

The story begins:

Five years after Hurricane Katrina emptied New Orleans and prompted the largest mass migration in modern American history, the 2010 Census counted 343,829 people living in the still-recovering city, a 29 percent drop since the last head count a decade ago, according to data released today.

Using the data we have, we will attempt to replicate the calculations in that lede.

The following code and annotations were written in a Jupyter notebook. The code is best run in RStudio version 0.99.902 using R version 3.3.0

Loading data

We'll start by loading in the 2000 data, which is stored in a CSV (comma-separated values) file. CSVs are plain-text files of data where commas separate the columns within a line. It is sometimes preferable to work with CSVs as opposed to files of a proprietary format, such as Microsoft Excel files, but the Census Bureau readily makes data available in both formats.

Let's run R's read.csv command and save the data to an object called census2000. Here, we are using assignment with <-, which tells R to run the right side and assign the result to the object named on the left.

In [1]:
census2000 <- read.csv('2000_census_demographic_profile.csv')

Now that this ran without incident, let's inspect the first few rows using head, which by default prints out the first six rows of a data frame (R's internal term for a spreadsheet):

In [2]:
head(census2000)
Out[2]:
GEO.idGEO.id2GEO.display.labelHC01_VC01HC02_VC01HC01_VC03HC02_VC03HC01_VC04HC02_VC04HC01_VC05HC02_VC05HC01_VC06HC02_VC06HC01_VC07HC02_VC07HC01_VC08HC02_VC08HC01_VC09HC02_VC09HC01_VC10HC02_VC10HC01_VC11HC02_VC11HC01_VC12HC02_VC12HC01_VC13HC02_VC13HC01_VC14HC02_VC14HC01_VC15HC02_VC15HC01_VC16HC02_VC16HC01_VC17HC02_VC17HC01_VC18HC02_VC18HC01_VC19HC02_VC19HC01_VC20HC02_VC20HC01_VC21HC02_VC21HC01_VC22HC02_VC22HC01_VC23HC02_VC23HC01_VC24HC02_VC24HC01_VC25HC02_VC25HC01_VC26HC02_VC26HC01_VC28HC02_VC28HC01_VC29HC02_VC29HC01_VC30HC02_VC30HC01_VC31HC02_VC31HC01_VC32HC02_VC32HC01_VC33HC02_VC33HC01_VC34HC02_VC34HC01_VC35HC02_VC35HC01_VC36HC02_VC36HC01_VC37HC02_VC37HC01_VC38HC02_VC38HC01_VC39HC02_VC39HC01_VC40HC02_VC40HC01_VC41HC02_VC41HC01_VC42HC02_VC42HC01_VC43HC02_VC43HC01_VC44HC02_VC44HC01_VC45HC02_VC45HC01_VC46HC02_VC46HC01_VC48HC02_VC48HC01_VC49HC02_VC49HC01_VC50HC02_VC50HC01_VC51HC02_VC51HC01_VC52HC02_VC52HC01_VC53HC02_VC53HC01_VC55HC02_VC55HC01_VC56HC02_VC56HC01_VC57HC02_VC57HC01_VC58HC02_VC58HC01_VC59HC02_VC59HC01_VC60HC02_VC60HC01_VC61HC02_VC61HC01_VC62HC02_VC62HC01_VC64HC02_VC64HC01_VC65HC02_VC65HC01_VC66HC02_VC66HC01_VC67HC02_VC67HC01_VC68HC02_VC68HC01_VC69HC02_VC69HC01_VC70HC02_VC70HC01_VC71HC02_VC71HC01_VC72HC02_VC72HC01_VC73HC02_VC73HC01_VC74HC02_VC74HC01_VC75HC02_VC75HC01_VC76HC02_VC76HC01_VC78HC02_VC78HC01_VC79HC02_VC79HC01_VC80HC02_VC80HC01_VC81HC02_VC81HC01_VC82HC02_VC82HC01_VC83HC02_VC83HC01_VC84HC02_VC84HC01_VC85HC02_VC85HC01_VC86HC02_VC86HC01_VC87HC02_VC87HC01_VC88HC02_VC88HC01_VC89HC02_VC89HC01_VC90HC02_VC90HC01_VC91HC02_VC91HC01_VC93HC02_VC93HC01_VC94HC02_VC94HC01_VC95HC02_VC95HC01_VC96HC02_VC96HC01_VC97HC02_VC97HC01_VC98HC02_VC98HC01_VC100HC02_VC100HC01_VC101HC02_VC101HC01_VC102HC02_VC102HC01_VC103HC02_VC103HC01_VC104HC02_VC104
1IdId2GeographyNumber; Total populationPercent; Total populationNumber; Total population - SEX AND AGE - MalePercent; Total population - SEX AND AGE - MaleNumber; Total population - SEX AND AGE - FemalePercent; Total population - SEX AND AGE - FemaleNumber; Total population - SEX AND AGE - Under 5 yearsPercent; Total population - SEX AND AGE - Under 5 yearsNumber; Total population - SEX AND AGE - 5 to 9 yearsPercent; Total population - SEX AND AGE - 5 to 9 yearsNumber; Total population - SEX AND AGE - 10 to 14 yearsPercent; Total population - SEX AND AGE - 10 to 14 yearsNumber; Total population - SEX AND AGE - 15 to 19 yearsPercent; Total population - SEX AND AGE - 15 to 19 yearsNumber; Total population - SEX AND AGE - 20 to 24 yearsPercent; Total population - SEX AND AGE - 20 to 24 yearsNumber; Total population - SEX AND AGE - 25 to 34 yearsPercent; Total population - SEX AND AGE - 25 to 34 yearsNumber; Total population - SEX AND AGE - 35 to 44 yearsPercent; Total population - SEX AND AGE - 35 to 44 yearsNumber; Total population - SEX AND AGE - 45 to 54 yearsPercent; Total population - SEX AND AGE - 45 to 54 yearsNumber; Total population - SEX AND AGE - 55 to 59 yearsPercent; Total population - SEX AND AGE - 55 to 59 yearsNumber; Total population - SEX AND AGE - 60 to 64 yearsPercent; Total population - SEX AND AGE - 60 to 64 yearsNumber; Total population - SEX AND AGE - 65 to 74 yearsPercent; Total population - SEX AND AGE - 65 to 74 yearsNumber; Total population - SEX AND AGE - 75 to 84 yearsPercent; Total population - SEX AND AGE - 75 to 84 yearsNumber; Total population - SEX AND AGE - 85 years and overPercent; Total population - SEX AND AGE - 85 years and overNumber; Total population - SEX AND AGE - Median age (years)Percent; Total population - SEX AND AGE - Median age (years)Number; Total population - SEX AND AGE - 18 years and overPercent; Total population - SEX AND AGE - 18 years and overNumber; Total population - SEX AND AGE - 18 years and over - MalePercent; Total population - SEX AND AGE - 18 years and over - MaleNumber; Total population - SEX AND AGE - 18 years and over - FemalePercent; Total population - SEX AND AGE - 18 years and over - FemaleNumber; Total population - SEX AND AGE - 21 years and overPercent; Total population - SEX AND AGE - 21 years and overNumber; Total population - SEX AND AGE - 62 years and overPercent; Total population - SEX AND AGE - 62 years and overNumber; Total population - SEX AND AGE - 65 years and overPercent; Total population - SEX AND AGE - 65 years and overNumber; Total population - SEX AND AGE - 65 years and over - MalePercent; Total population - SEX AND AGE - 65 years and over - MaleNumber; Total population - SEX AND AGE - 65 years and over - FemalePercent; Total population - SEX AND AGE - 65 years and over - FemaleNumber; Total population - RACE - One racePercent; Total population - RACE - One raceNumber; Total population - RACE - One race - WhitePercent; Total population - RACE - One race - WhiteNumber; Total population - RACE - One race - Black or African AmericanPercent; Total population - RACE - One race - Black or African AmericanNumber; Total population - RACE - One race - American Indian and Alaska NativePercent; Total population - RACE - One race - American Indian and Alaska NativeNumber; Total population - RACE - One race - AsianPercent; Total population - RACE - One race - AsianNumber; Total population - RACE - One race - Asian - Asian IndianPercent; Total population - RACE - One race - Asian - Asian IndianNumber; Total population - RACE - One race - Asian - ChinesePercent; Total population - RACE - One race - Asian - ChineseNumber; Total population - RACE - One race - Asian - FilipinoPercent; Total population - RACE - One race - Asian - FilipinoNumber; Total population - RACE - One race - Asian - JapanesePercent; Total population - RACE - One race - Asian - JapaneseNumber; Total population - RACE - One race - Asian - KoreanPercent; Total population - RACE - One race - Asian - KoreanNumber; Total population - RACE - One race - Asian - VietnamesePercent; Total population - RACE - One race - Asian - VietnameseNumber; Total population - RACE - One race - Asian - Other Asian [1]Percent; Total population - RACE - One race - Asian - Other Asian [1]Number; Total population - RACE - One race - Native Hawaiian and Other Pacific IslanderPercent; Total population - RACE - One race - Native Hawaiian and Other Pacific IslanderNumber; Total population - RACE - One race - Native Hawaiian and Other Pacific Islander - Native HawaiianPercent; Total population - RACE - One race - Native Hawaiian and Other Pacific Islander - Native HawaiianNumber; Total population - RACE - One race - Native Hawaiian and Other Pacific Islander - Guamanian or ChamorroPercent; Total population - RACE - One race - Native Hawaiian and Other Pacific Islander - Guamanian or ChamorroNumber; Total population - RACE - One race - Native Hawaiian and Other Pacific Islander - SamoanPercent; Total population - RACE - One race - Native Hawaiian and Other Pacific Islander - SamoanNumber; Total population - RACE - One race - Native Hawaiian and Other Pacific Islander - Other Pacific Islander [2]Percent; Total population - RACE - One race - Native Hawaiian and Other Pacific Islander - Other Pacific Islander [2]Number; Total population - RACE - One race - Some other racePercent; Total population - RACE - One race - Some other raceNumber; Total population - RACE - Two or more racesPercent; Total population - RACE - Two or more racesNumber; Total population - RACE - Race alone or in combination with one or more other races [3] - WhitePercent; Total population - RACE - Race alone or in combination with one or more other races [3] - WhiteNumber; Total population - RACE - Race alone or in combination with one or more other races [3] - Black or African AmericanPercent; Total population - RACE - Race alone or in combination with one or more other races [3] - Black or African AmericanNumber; Total population - RACE - Race alone or in combination with one or more other races [3] - American Indian and Alaska NativePercent; Total population - RACE - Race alone or in combination with one or more other races [3] - American Indian and Alaska NativeNumber; Total population - RACE - Race alone or in combination with one or more other races [3] - AsianPercent; Total population - RACE - Race alone or in combination with one or more other races [3] - AsianNumber; Total population - RACE - Race alone or in combination with one or more other races [3] - Native Hawaiian and Other Pacific IslanderPercent; Total population - RACE - Race alone or in combination with one or more other races [3] - Native Hawaiian and Other Pacific IslanderNumber; Total population - RACE - Race alone or in combination with one or more other races [3] - Some other racePercent; Total population - RACE - Race alone or in combination with one or more other races [3] - Some other raceNumber; HISPANIC OR LATINO AND RACE - Total populationPercent; HISPANIC OR LATINO AND RACE - Total populationNumber; HISPANIC OR LATINO AND RACE - Total population - Hispanic or Latino (of any race)Percent; HISPANIC OR LATINO AND RACE - Total population - Hispanic or Latino (of any race)Number; HISPANIC OR LATINO AND RACE - Total population - Hispanic or Latino (of any race) - MexicanPercent; HISPANIC OR LATINO AND RACE - Total population - Hispanic or Latino (of any race) - MexicanNumber; HISPANIC OR LATINO AND RACE - Total population - Hispanic or Latino (of any race) - Puerto RicanPercent; HISPANIC OR LATINO AND RACE - Total population - Hispanic or Latino (of any race) - Puerto RicanNumber; HISPANIC OR LATINO AND RACE - Total population - Hispanic or Latino (of any race) - CubanPercent; HISPANIC OR LATINO AND RACE - Total population - Hispanic or Latino (of any race) - CubanNumber; HISPANIC OR LATINO AND RACE - Total population - Hispanic or Latino (of any race) - Other Hispanic or LatinoPercent; HISPANIC OR LATINO AND RACE - Total population - Hispanic or Latino (of any race) - Other Hispanic or LatinoNumber; HISPANIC OR LATINO AND RACE - Total population - Not Hispanic or LatinoPercent; HISPANIC OR LATINO AND RACE - Total population - Not Hispanic or LatinoNumber; HISPANIC OR LATINO AND RACE - Total population - Not Hispanic or Latino - White alonePercent; HISPANIC OR LATINO AND RACE - Total population - Not Hispanic or Latino - White aloneNumber; RELATIONSHIP - Total populationPercent; RELATIONSHIP - Total populationNumber; RELATIONSHIP - Total population - In householdsPercent; RELATIONSHIP - Total population - In householdsNumber; RELATIONSHIP - Total population - In households - HouseholderPercent; RELATIONSHIP - Total population - In households - HouseholderNumber; RELATIONSHIP - Total population - In households - SpousePercent; RELATIONSHIP - Total population - In households - SpouseNumber; RELATIONSHIP - Total population - In households - ChildPercent; RELATIONSHIP - Total population - In households - ChildNumber; RELATIONSHIP - Total population - In households - Child - Own child under 18 yearsPercent; RELATIONSHIP - Total population - In households - Child - Own child under 18 yearsNumber; RELATIONSHIP - Total population - In households - Other relativesPercent; RELATIONSHIP - Total population - In households - Other relativesNumber; RELATIONSHIP - Total population - In households - Other relatives - Under 18 yearsPercent; RELATIONSHIP - Total population - In households - Other relatives - Under 18 yearsNumber; RELATIONSHIP - Total population - In households - NonrelativesPercent; RELATIONSHIP - Total population - In households - NonrelativesNumber; RELATIONSHIP - Total population - In households - Nonrelatives - Unmarried partnerPercent; RELATIONSHIP - Total population - In households - Nonrelatives - Unmarried partnerNumber; RELATIONSHIP - Total population - In group quartersPercent; RELATIONSHIP - Total population - In group quartersNumber; RELATIONSHIP - Total population - In group quarters - Institutionalized populationPercent; RELATIONSHIP - Total population - In group quarters - Institutionalized populationNumber; RELATIONSHIP - Total population - In group quarters - Noninstitutionalized populationPercent; RELATIONSHIP - Total population - In group quarters - Noninstitutionalized populationNumber; HOUSEHOLDS BY TYPE - Total householdsPercent; HOUSEHOLDS BY TYPE - Total householdsNumber; HOUSEHOLDS BY TYPE - Total households - Family households (families)Percent; HOUSEHOLDS BY TYPE - Total households - Family households (families)Number; HOUSEHOLDS BY TYPE - Total households - Family households (families) - With own children under 18 yearsPercent; HOUSEHOLDS BY TYPE - Total households - Family households (families) - With own children under 18 yearsNumber; HOUSEHOLDS BY TYPE - Total households - Family households (families) - Married-couple familyPercent; HOUSEHOLDS BY TYPE - Total households - Family households (families) - Married-couple familyNumber; HOUSEHOLDS BY TYPE - Total households - Family households (families) - Married-couple family - With own children under 18 yearsPercent; HOUSEHOLDS BY TYPE - Total households - Family households (families) - Married-couple family - With own children under 18 yearsNumber; HOUSEHOLDS BY TYPE - Total households - Family households (families) - Female householder, no husband presentPercent; HOUSEHOLDS BY TYPE - Total households - Family households (families) - Female householder, no husband presentNumber; HOUSEHOLDS BY TYPE - Total households - Family households (families) - Female householder, no husband present - With own children under 18 yearsPercent; HOUSEHOLDS BY TYPE - Total households - Family households (families) - Female householder, no husband present - With own children under 18 yearsNumber; HOUSEHOLDS BY TYPE - Total households - Nonfamily householdsPercent; HOUSEHOLDS BY TYPE - Total households - Nonfamily householdsNumber; HOUSEHOLDS BY TYPE - Total households - Nonfamily households - Householder living alonePercent; HOUSEHOLDS BY TYPE - Total households - Nonfamily households - Householder living aloneNumber; HOUSEHOLDS BY TYPE - Total households - Nonfamily households - Householder living alone - Householder 65 years and overPercent; HOUSEHOLDS BY TYPE - Total households - Nonfamily households - Householder living alone - Householder 65 years and overNumber; HOUSEHOLDS BY TYPE - Total households - Households with individuals under 18 yearsPercent; HOUSEHOLDS BY TYPE - Total households - Households with individuals under 18 yearsNumber; HOUSEHOLDS BY TYPE - Total households - Households with individuals 65 years and overPercent; HOUSEHOLDS BY TYPE - Total households - Households with individuals 65 years and overNumber; HOUSEHOLDS BY TYPE - Total households - Average household sizePercent; HOUSEHOLDS BY TYPE - Total households - Average household sizeNumber; HOUSEHOLDS BY TYPE - Total households - Average family sizePercent; HOUSEHOLDS BY TYPE - Total households - Average family sizeNumber; HOUSING OCCUPANCY - Total housing unitsPercent; HOUSING OCCUPANCY - Total housing unitsNumber; HOUSING OCCUPANCY - Total housing units - Occupied housing unitsPercent; HOUSING OCCUPANCY - Total housing units - Occupied housing unitsNumber; HOUSING OCCUPANCY - Total housing units - Vacant housing unitsPercent; HOUSING OCCUPANCY - Total housing units - Vacant housing unitsNumber; HOUSING OCCUPANCY - Total housing units - Vacant housing units - For seasonal, recreational, or occasional usePercent; HOUSING OCCUPANCY - Total housing units - Vacant housing units - For seasonal, recreational, or occasional useNumber; HOUSING OCCUPANCY - Total housing units - Homeowner vacancy rate (percent)Percent; HOUSING OCCUPANCY - Total housing units - Homeowner vacancy rate (percent)Number; HOUSING OCCUPANCY - Total housing units - Rental vacancy rate (percent)Percent; HOUSING OCCUPANCY - Total housing units - Rental vacancy rate (percent)Number; HOUSING TENURE - Occupied housing unitsPercent; HOUSING TENURE - Occupied housing unitsNumber; HOUSING TENURE - Occupied housing units - Owner-occupied housing unitsPercent; HOUSING TENURE - Occupied housing units - Owner-occupied housing unitsNumber; HOUSING TENURE - Occupied housing units - Renter-occupied housing unitsPercent; HOUSING TENURE - Occupied housing units - Renter-occupied housing unitsNumber; HOUSING TENURE - Occupied housing units - Average household size of owner-occupied unitPercent; HOUSING TENURE - Occupied housing units - Average household size of owner-occupied unitNumber; HOUSING TENURE - Occupied housing units - Average household size of renter-occupied unitPercent; HOUSING TENURE - Occupied housing units - Average household size of renter-occupied unit
21400000US2200196010022001960100Census Tract 9601, Acadia Parish, Louisiana6,1881002,920473,26853462850285419572937567281291315699113015252443372875123234(X)4,304701,957322,347384,031659961684314295554896,1741004,455721,67527120700000200010004000000000002501404,468721,6772719080003016,18810087151110003516,101994,398716,1881006,030972,236361,119182,199361,70028274414322023109215831512702,2361001,59571868391,1195057326363162371164129585263031496243569253(X)3(X)2,4101002,2369317471511(X)8(X)2,2361001,52668710323(X)3(X)
31400000US2200196020022001960200Census Tract 9602, Acadia Parish, Louisiana5,0561002,562512,4944934674168476946392986579128611770914250520342635150342134(X)3,527701,758351,769353,28965570114559217423855,0351004,79995216460100000000000001000000000001302104,81695226513060401605,05610035118000001705,021994,775945,0561005,0561001,764351,216241,791351,41328173386211226110000001,7641001,40880722411,216696173513488153562031018128778144339193(X)3(X)1,9091001,7649214583121(X)7(X)1,7641001,46183303173(X)3(X)
41400000US2200196030022001960300Census Tract 9603, Acadia Parish, Louisiana3,1491001,593511,55649209725183051026082047368125201740913148513042097104332135(X)2,233711,103351,130362,081664351434511150519563,1401003,05897672802000100000000010101000000040903,066976921302010703,1491001514000001103,1341003,049973,1491003,1481001,14536750241,09135854277323818934821000101,14510088377445397506636932938525262232282091847542247223(X)3(X)1,2461001,1459210181921(X)7(X)1,1451001,0419110493(X)3(X)
51400000US2200196040022001960400Census Tract 9604, Acadia Parish, Louisiana5,6171002,754492,8635142984067520947693536691129141668412254522244107193365134(X)3,944701,911342,033363,716668001466812302536675,583995,34795207418060101040000000000000000000503415,3819621144318000905,61710043124000001905,574995,307955,6171005,5921001,991351,291231,994361,554281964101212027012501001501,9911001,55578804401,2916564132168899543622388201891086143482243(X)3(X)2,1761001,9919218592311(X)6(X)1,9911001,63082361183(X)3(X)
61400000US2200196050022001960500Census Tract 9605, Acadia Parish, Louisiana4,9271002,461502,4665040084389439941893197704147771664413227515432345134339132(X)3,405691,675341,730353,16264499104078167324054,9011004,49891378815040000010000030001000001000502614,5249238583514010904,92710061137110002314,866994,448904,9271004,9211001,692341,068221,809371,418291744711178410026000601,6921001,32678762451,06863611361821110363662230018136880848316193(X)3(X)1,7961001,6929410462211(X)5(X)1,6921001,41984273163(X)3(X)

Upon inspection, we can see that the file came with two header rows. R, by default, takes the first row of a CSV to be the header. We clearly do not need the first one so we can rerun the read.csv command and tell it so:

In [3]:
census2000 <- read.csv('2000_census_demographic_profile.csv', skip = 1)
In [4]:
head(census2000)
Out[4]:
IdId2GeographyNumber..Total.populationPercent..Total.populationNumber..Total.population...SEX.AND.AGE...MalePercent..Total.population...SEX.AND.AGE...MaleNumber..Total.population...SEX.AND.AGE...FemalePercent..Total.population...SEX.AND.AGE...FemaleNumber..Total.population...SEX.AND.AGE...Under.5.yearsPercent..Total.population...SEX.AND.AGE...Under.5.yearsNumber..Total.population...SEX.AND.AGE...5.to.9.yearsPercent..Total.population...SEX.AND.AGE...5.to.9.yearsNumber..Total.population...SEX.AND.AGE...10.to.14.yearsPercent..Total.population...SEX.AND.AGE...10.to.14.yearsNumber..Total.population...SEX.AND.AGE...15.to.19.yearsPercent..Total.population...SEX.AND.AGE...15.to.19.yearsNumber..Total.population...SEX.AND.AGE...20.to.24.yearsPercent..Total.population...SEX.AND.AGE...20.to.24.yearsNumber..Total.population...SEX.AND.AGE...25.to.34.yearsPercent..Total.population...SEX.AND.AGE...25.to.34.yearsNumber..Total.population...SEX.AND.AGE...35.to.44.yearsPercent..Total.population...SEX.AND.AGE...35.to.44.yearsNumber..Total.population...SEX.AND.AGE...45.to.54.yearsPercent..Total.population...SEX.AND.AGE...45.to.54.yearsNumber..Total.population...SEX.AND.AGE...55.to.59.yearsPercent..Total.population...SEX.AND.AGE...55.to.59.yearsNumber..Total.population...SEX.AND.AGE...60.to.64.yearsPercent..Total.population...SEX.AND.AGE...60.to.64.yearsNumber..Total.population...SEX.AND.AGE...65.to.74.yearsPercent..Total.population...SEX.AND.AGE...65.to.74.yearsNumber..Total.population...SEX.AND.AGE...75.to.84.yearsPercent..Total.population...SEX.AND.AGE...75.to.84.yearsNumber..Total.population...SEX.AND.AGE...85.years.and.overPercent..Total.population...SEX.AND.AGE...85.years.and.overNumber..Total.population...SEX.AND.AGE...Median.age..years.Percent..Total.population...SEX.AND.AGE...Median.age..years.Number..Total.population...SEX.AND.AGE...18.years.and.overPercent..Total.population...SEX.AND.AGE...18.years.and.overNumber..Total.population...SEX.AND.AGE...18.years.and.over...MalePercent..Total.population...SEX.AND.AGE...18.years.and.over...MaleNumber..Total.population...SEX.AND.AGE...18.years.and.over...FemalePercent..Total.population...SEX.AND.AGE...18.years.and.over...FemaleNumber..Total.population...SEX.AND.AGE...21.years.and.overPercent..Total.population...SEX.AND.AGE...21.years.and.overNumber..Total.population...SEX.AND.AGE...62.years.and.overPercent..Total.population...SEX.AND.AGE...62.years.and.overNumber..Total.population...SEX.AND.AGE...65.years.and.overPercent..Total.population...SEX.AND.AGE...65.years.and.overNumber..Total.population...SEX.AND.AGE...65.years.and.over...MalePercent..Total.population...SEX.AND.AGE...65.years.and.over...MaleNumber..Total.population...SEX.AND.AGE...65.years.and.over...FemalePercent..Total.population...SEX.AND.AGE...65.years.and.over...FemaleNumber..Total.population...RACE...One.racePercent..Total.population...RACE...One.raceNumber..Total.population...RACE...One.race...WhitePercent..Total.population...RACE...One.race...WhiteNumber..Total.population...RACE...One.race...Black.or.African.AmericanPercent..Total.population...RACE...One.race...Black.or.African.AmericanNumber..Total.population...RACE...One.race...American.Indian.and.Alaska.NativePercent..Total.population...RACE...One.race...American.Indian.and.Alaska.NativeNumber..Total.population...RACE...One.race...AsianPercent..Total.population...RACE...One.race...AsianNumber..Total.population...RACE...One.race...Asian...Asian.IndianPercent..Total.population...RACE...One.race...Asian...Asian.IndianNumber..Total.population...RACE...One.race...Asian...ChinesePercent..Total.population...RACE...One.race...Asian...ChineseNumber..Total.population...RACE...One.race...Asian...FilipinoPercent..Total.population...RACE...One.race...Asian...FilipinoNumber..Total.population...RACE...One.race...Asian...JapanesePercent..Total.population...RACE...One.race...Asian...JapaneseNumber..Total.population...RACE...One.race...Asian...KoreanPercent..Total.population...RACE...One.race...Asian...KoreanNumber..Total.population...RACE...One.race...Asian...VietnamesePercent..Total.population...RACE...One.race...Asian...VietnameseNumber..Total.population...RACE...One.race...Asian...Other.Asian..1.Percent..Total.population...RACE...One.race...Asian...Other.Asian..1.Number..Total.population...RACE...One.race...Native.Hawaiian.and.Other.Pacific.IslanderPercent..Total.population...RACE...One.race...Native.Hawaiian.and.Other.Pacific.IslanderNumber..Total.population...RACE...One.race...Native.Hawaiian.and.Other.Pacific.Islander...Native.HawaiianPercent..Total.population...RACE...One.race...Native.Hawaiian.and.Other.Pacific.Islander...Native.HawaiianNumber..Total.population...RACE...One.race...Native.Hawaiian.and.Other.Pacific.Islander...Guamanian.or.ChamorroPercent..Total.population...RACE...One.race...Native.Hawaiian.and.Other.Pacific.Islander...Guamanian.or.ChamorroNumber..Total.population...RACE...One.race...Native.Hawaiian.and.Other.Pacific.Islander...SamoanPercent..Total.population...RACE...One.race...Native.Hawaiian.and.Other.Pacific.Islander...SamoanNumber..Total.population...RACE...One.race...Native.Hawaiian.and.Other.Pacific.Islander...Other.Pacific.Islander..2.Percent..Total.population...RACE...One.race...Native.Hawaiian.and.Other.Pacific.Islander...Other.Pacific.Islander..2.Number..Total.population...RACE...One.race...Some.other.racePercent..Total.population...RACE...One.race...Some.other.raceNumber..Total.population...RACE...Two.or.more.racesPercent..Total.population...RACE...Two.or.more.racesNumber..Total.population...RACE...Race.alone.or.in.combination.with.one.or.more.other.races..3....WhitePercent..Total.population...RACE...Race.alone.or.in.combination.with.one.or.more.other.races..3....WhiteNumber..Total.population...RACE...Race.alone.or.in.combination.with.one.or.more.other.races..3....Black.or.African.AmericanPercent..Total.population...RACE...Race.alone.or.in.combination.with.one.or.more.other.races..3....Black.or.African.AmericanNumber..Total.population...RACE...Race.alone.or.in.combination.with.one.or.more.other.races..3....American.Indian.and.Alaska.NativePercent..Total.population...RACE...Race.alone.or.in.combination.with.one.or.more.other.races..3....American.Indian.and.Alaska.NativeNumber..Total.population...RACE...Race.alone.or.in.combination.with.one.or.more.other.races..3....AsianPercent..Total.population...RACE...Race.alone.or.in.combination.with.one.or.more.other.races..3....AsianNumber..Total.population...RACE...Race.alone.or.in.combination.with.one.or.more.other.races..3....Native.Hawaiian.and.Other.Pacific.IslanderPercent..Total.population...RACE...Race.alone.or.in.combination.with.one.or.more.other.races..3....Native.Hawaiian.and.Other.Pacific.IslanderNumber..Total.population...RACE...Race.alone.or.in.combination.with.one.or.more.other.races..3....Some.other.racePercent..Total.population...RACE...Race.alone.or.in.combination.with.one.or.more.other.races..3....Some.other.raceNumber..HISPANIC.OR.LATINO.AND.RACE...Total.populationPercent..HISPANIC.OR.LATINO.AND.RACE...Total.populationNumber..HISPANIC.OR.LATINO.AND.RACE...Total.population...Hispanic.or.Latino..of.any.race.Percent..HISPANIC.OR.LATINO.AND.RACE...Total.population...Hispanic.or.Latino..of.any.race.Number..HISPANIC.OR.LATINO.AND.RACE...Total.population...Hispanic.or.Latino..of.any.race....MexicanPercent..HISPANIC.OR.LATINO.AND.RACE...Total.population...Hispanic.or.Latino..of.any.race....MexicanNumber..HISPANIC.OR.LATINO.AND.RACE...Total.population...Hispanic.or.Latino..of.any.race....Puerto.RicanPercent..HISPANIC.OR.LATINO.AND.RACE...Total.population...Hispanic.or.Latino..of.any.race....Puerto.RicanNumber..HISPANIC.OR.LATINO.AND.RACE...Total.population...Hispanic.or.Latino..of.any.race....CubanPercent..HISPANIC.OR.LATINO.AND.RACE...Total.population...Hispanic.or.Latino..of.any.race....CubanNumber..HISPANIC.OR.LATINO.AND.RACE...Total.population...Hispanic.or.Latino..of.any.race....Other.Hispanic.or.LatinoPercent..HISPANIC.OR.LATINO.AND.RACE...Total.population...Hispanic.or.Latino..of.any.race....Other.Hispanic.or.LatinoNumber..HISPANIC.OR.LATINO.AND.RACE...Total.population...Not.Hispanic.or.LatinoPercent..HISPANIC.OR.LATINO.AND.RACE...Total.population...Not.Hispanic.or.LatinoNumber..HISPANIC.OR.LATINO.AND.RACE...Total.population...Not.Hispanic.or.Latino...White.alonePercent..HISPANIC.OR.LATINO.AND.RACE...Total.population...Not.Hispanic.or.Latino...White.aloneNumber..RELATIONSHIP...Total.populationPercent..RELATIONSHIP...Total.populationNumber..RELATIONSHIP...Total.population...In.householdsPercent..RELATIONSHIP...Total.population...In.householdsNumber..RELATIONSHIP...Total.population...In.households...HouseholderPercent..RELATIONSHIP...Total.population...In.households...HouseholderNumber..RELATIONSHIP...Total.population...In.households...SpousePercent..RELATIONSHIP...Total.population...In.households...SpouseNumber..RELATIONSHIP...Total.population...In.households...ChildPercent..RELATIONSHIP...Total.population...In.households...ChildNumber..RELATIONSHIP...Total.population...In.households...Child...Own.child.under.18.yearsPercent..RELATIONSHIP...Total.population...In.households...Child...Own.child.under.18.yearsNumber..RELATIONSHIP...Total.population...In.households...Other.relativesPercent..RELATIONSHIP...Total.population...In.households...Other.relativesNumber..RELATIONSHIP...Total.population...In.households...Other.relatives...Under.18.yearsPercent..RELATIONSHIP...Total.population...In.households...Other.relatives...Under.18.yearsNumber..RELATIONSHIP...Total.population...In.households...NonrelativesPercent..RELATIONSHIP...Total.population...In.households...NonrelativesNumber..RELATIONSHIP...Total.population...In.households...Nonrelatives...Unmarried.partnerPercent..RELATIONSHIP...Total.population...In.households...Nonrelatives...Unmarried.partnerNumber..RELATIONSHIP...Total.population...In.group.quartersPercent..RELATIONSHIP...Total.population...In.group.quartersNumber..RELATIONSHIP...Total.population...In.group.quarters...Institutionalized.populationPercent..RELATIONSHIP...Total.population...In.group.quarters...Institutionalized.populationNumber..RELATIONSHIP...Total.population...In.group.quarters...Noninstitutionalized.populationPercent..RELATIONSHIP...Total.population...In.group.quarters...Noninstitutionalized.populationNumber..HOUSEHOLDS.BY.TYPE...Total.householdsPercent..HOUSEHOLDS.BY.TYPE...Total.householdsNumber..HOUSEHOLDS.BY.TYPE...Total.households...Family.households..families.Percent..HOUSEHOLDS.BY.TYPE...Total.households...Family.households..families.Number..HOUSEHOLDS.BY.TYPE...Total.households...Family.households..families....With.own.children.under.18.yearsPercent..HOUSEHOLDS.BY.TYPE...Total.households...Family.households..families....With.own.children.under.18.yearsNumber..HOUSEHOLDS.BY.TYPE...Total.households...Family.households..families....Married.couple.familyPercent..HOUSEHOLDS.BY.TYPE...Total.households...Family.households..families....Married.couple.familyNumber..HOUSEHOLDS.BY.TYPE...Total.households...Family.households..families....Married.couple.family...With.own.children.under.18.yearsPercent..HOUSEHOLDS.BY.TYPE...Total.households...Family.households..families....Married.couple.family...With.own.children.under.18.yearsNumber..HOUSEHOLDS.BY.TYPE...Total.households...Family.households..families....Female.householder..no.husband.presentPercent..HOUSEHOLDS.BY.TYPE...Total.households...Family.households..families....Female.householder..no.husband.presentNumber..HOUSEHOLDS.BY.TYPE...Total.households...Family.households..families....Female.householder..no.husband.present...With.own.children.under.18.yearsPercent..HOUSEHOLDS.BY.TYPE...Total.households...Family.households..families....Female.householder..no.husband.present...With.own.children.under.18.yearsNumber..HOUSEHOLDS.BY.TYPE...Total.households...Nonfamily.householdsPercent..HOUSEHOLDS.BY.TYPE...Total.households...Nonfamily.householdsNumber..HOUSEHOLDS.BY.TYPE...Total.households...Nonfamily.households...Householder.living.alonePercent..HOUSEHOLDS.BY.TYPE...Total.households...Nonfamily.households...Householder.living.aloneNumber..HOUSEHOLDS.BY.TYPE...Total.households...Nonfamily.households...Householder.living.alone...Householder.65.years.and.overPercent..HOUSEHOLDS.BY.TYPE...Total.households...Nonfamily.households...Householder.living.alone...Householder.65.years.and.overNumber..HOUSEHOLDS.BY.TYPE...Total.households...Households.with.individuals.under.18.yearsPercent..HOUSEHOLDS.BY.TYPE...Total.households...Households.with.individuals.under.18.yearsNumber..HOUSEHOLDS.BY.TYPE...Total.households...Households.with.individuals.65.years.and.overPercent..HOUSEHOLDS.BY.TYPE...Total.households...Households.with.individuals.65.years.and.overNumber..HOUSEHOLDS.BY.TYPE...Total.households...Average.household.sizePercent..HOUSEHOLDS.BY.TYPE...Total.households...Average.household.sizeNumber..HOUSEHOLDS.BY.TYPE...Total.households...Average.family.sizePercent..HOUSEHOLDS.BY.TYPE...Total.households...Average.family.sizeNumber..HOUSING.OCCUPANCY...Total.housing.unitsPercent..HOUSING.OCCUPANCY...Total.housing.unitsNumber..HOUSING.OCCUPANCY...Total.housing.units...Occupied.housing.unitsPercent..HOUSING.OCCUPANCY...Total.housing.units...Occupied.housing.unitsNumber..HOUSING.OCCUPANCY...Total.housing.units...Vacant.housing.unitsPercent..HOUSING.OCCUPANCY...Total.housing.units...Vacant.housing.unitsNumber..HOUSING.OCCUPANCY...Total.housing.units...Vacant.housing.units...For.seasonal..recreational..or.occasional.usePercent..HOUSING.OCCUPANCY...Total.housing.units...Vacant.housing.units...For.seasonal..recreational..or.occasional.useNumber..HOUSING.OCCUPANCY...Total.housing.units...Homeowner.vacancy.rate..percent.Percent..HOUSING.OCCUPANCY...Total.housing.units...Homeowner.vacancy.rate..percent.Number..HOUSING.OCCUPANCY...Total.housing.units...Rental.vacancy.rate..percent.Percent..HOUSING.OCCUPANCY...Total.housing.units...Rental.vacancy.rate..percent.Number..HOUSING.TENURE...Occupied.housing.unitsPercent..HOUSING.TENURE...Occupied.housing.unitsNumber..HOUSING.TENURE...Occupied.housing.units...Owner.occupied.housing.unitsPercent..HOUSING.TENURE...Occupied.housing.units...Owner.occupied.housing.unitsNumber..HOUSING.TENURE...Occupied.housing.units...Renter.occupied.housing.unitsPercent..HOUSING.TENURE...Occupied.housing.units...Renter.occupied.housing.unitsNumber..HOUSING.TENURE...Occupied.housing.units...Average.household.size.of.owner.occupied.unitPercent..HOUSING.TENURE...Occupied.housing.units...Average.household.size.of.owner.occupied.unitNumber..HOUSING.TENURE...Occupied.housing.units...Average.household.size.of.renter.occupied.unitPercent..HOUSING.TENURE...Occupied.housing.units...Average.household.size.of.renter.occupied.unit
11400000US2200196010022001960100Census Tract 9601, Acadia Parish, Louisiana6,1881002,920473,26853462850285419572937567281291315699113015252443372875123234(X)4,304701,957322,347384,031659961684314295554896,1741004,455721,67527120700000200010004000000000002501404,468721,6772719080003016,18810087151110003516,101994,398716,1881006,030972,236361,119182,199361,70028274414322023109215831512702,2361001,59571868391,1195057326363162371164129585263031496243569253(X)3(X)2,4101002,2369317471511(X)8(X)2,2361001,52668710323(X)3(X)
21400000US2200196020022001960200Census Tract 9602, Acadia Parish, Louisiana5,0561002,562512,4944934674168476946392986579128611770914250520342635150342134(X)3,527701,758351,769353,28965570114559217423855,0351004,79995216460100000000000001000000000001302104,81695226513060401605,05610035118000001705,021994,775945,0561005,0561001,764351,216241,791351,41328173386211226110000001,7641001,40880722411,216696173513488153562031018128778144339193(X)3(X)1,9091001,7649214583121(X)7(X)1,7641001,46183303173(X)3(X)
31400000US2200196030022001960300Census Tract 9603, Acadia Parish, Louisiana3,1491001,593511,55649209725183051026082047368125201740913148513042097104332135(X)2,233711,103351,130362,081664351434511150519563,1401003,05897672802000100000000010101000000040903,066976921302010703,1491001514000001103,1341003,049973,1491003,1481001,14536750241,09135854277323818934821000101,14510088377445397506636932938525262232282091847542247223(X)3(X)1,2461001,1459210181921(X)7(X)1,1451001,0419110493(X)3(X)
41400000US2200196040022001960400Census Tract 9604, Acadia Parish, Louisiana5,6171002,754492,8635142984067520947693536691129141668412254522244107193365134(X)3,944701,911342,033363,716668001466812302536675,583995,34795207418060101040000000000000000000503415,3819621144318000905,61710043124000001905,574995,307955,6171005,5921001,991351,291231,994361,554281964101212027012501001501,9911001,55578804401,2916564132168899543622388201891086143482243(X)3(X)2,1761001,9919218592311(X)6(X)1,9911001,63082361183(X)3(X)
51400000US2200196050022001960500Census Tract 9605, Acadia Parish, Louisiana4,9271002,461502,4665040084389439941893197704147771664413227515432345134339132(X)3,405691,675341,730353,16264499104078167324054,9011004,49891378815040000010000030001000001000502614,5249238583514010904,92710061137110002314,866994,448904,9271004,9211001,692341,068221,809371,418291744711178410026000601,6921001,32678762451,06863611361821110363662230018136880848316193(X)3(X)1,7961001,6929410462211(X)5(X)1,6921001,41984273163(X)3(X)
61400000US2200196060022001960600Census Tract 9606, Acadia Parish, Louisiana5,6541002,647473,00753464847184428460835867601387115615112434209441572414105233(X)3,999711,791322,208393,736668691576114271549095,620994,8098578214701200030200010600000000000001003414,8428679214180150002105,65410049128100002105,605994,774845,6541005,526982,073371,076191,891331,47826289516031974110212821282002,0731001,47771796381,076525412631015189959629521252431288243510253(X)3(X)2,2921002,07390219101111(X)14(X)2,0731001,47471599293(X)3(X)

Cleaning the data

Visually, we can see that this data set is very wide. In fact, there are 195 columns.

Spaces are not allowed in R column names. That's why they've been automatically converted to periods, as in Number..Total.population.

Let's keep a handful of these:

  • Id2: This is what the census bureau calls a FIPS code. It is a unique numerical identifier for all census tracts. This will be important when we join our two datasets together.
  • Geography: This is a text description of the tract, with the parish name.
  • Number..Total.population: The total population of the tract.
  • Number..HOUSING.OCCUPANCY...Total.housing.units, Number..HOUSING.OCCUPANCY...Total.housing.units...Occupied.housing.units, and Number..HOUSING.OCCUPANCY...Total.housing.units...Vacant.housing.units: The total, occupied and vacant housing units.

To help us trim the data set to just these six columns, we are going to import a package. There are thousands of packages for R created by the open-source community, which help improve on what is included in R by default.

The one we will use here is called dplyr.

In [5]:
## if dplyr was not installed we would have to run this
# install.packages('dplyr')

## to import the package and all of its functions 
library('dplyr')
Warning message:
: package ‘dplyr’ was built under R version 3.2.4
Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

From dplyr, we will use the select function to trim the data set and save it to a new variable called census2000.trimmed:

In [6]:
census2000.trimmed <- select(
    census2000, # name of the data frame
    # list of all the six column names we want to keep
    Id2, 
    Geography, 
    Number..Total.population, 
    Number..HOUSING.OCCUPANCY...Total.housing.units, 
    Number..HOUSING.OCCUPANCY...Total.housing.units...Occupied.housing.units, 
    Number..HOUSING.OCCUPANCY...Total.housing.units...Vacant.housing.units
)

head(census2000.trimmed)
Out[6]:
Id2GeographyNumber..Total.populationNumber..HOUSING.OCCUPANCY...Total.housing.unitsNumber..HOUSING.OCCUPANCY...Total.housing.units...Occupied.housing.unitsNumber..HOUSING.OCCUPANCY...Total.housing.units...Vacant.housing.units
122001960100Census Tract 9601, Acadia Parish, Louisiana6,1882,4102,236174
222001960200Census Tract 9602, Acadia Parish, Louisiana5,0561,9091,764145
322001960300Census Tract 9603, Acadia Parish, Louisiana3,1491,2461,145101
422001960400Census Tract 9604, Acadia Parish, Louisiana5,6172,1761,991185
522001960500Census Tract 9605, Acadia Parish, Louisiana4,9271,7961,692104
622001960600Census Tract 9606, Acadia Parish, Louisiana5,6542,2922,073219

This shows us that we were able to select the columns correctly. But one lingering issue is that these column names are long and unwieldy. Since we are going to be typing them often, let's rename them to shorter, more convenient versions:

In [7]:
colnames(census2000.trimmed) <- c(
    'fips.code', 'geography', 'population',
    'total.housing.units', 'occupied.housing.units', 'vacant.housing.units'
)
head(census2000.trimmed)
Out[7]:
fips.codegeographypopulationtotal.housing.unitsoccupied.housing.unitsvacant.housing.units
122001960100Census Tract 9601, Acadia Parish, Louisiana6,1882,4102,236174
222001960200Census Tract 9602, Acadia Parish, Louisiana5,0561,9091,764145
322001960300Census Tract 9603, Acadia Parish, Louisiana3,1491,2461,145101
422001960400Census Tract 9604, Acadia Parish, Louisiana5,6172,1761,991185
522001960500Census Tract 9605, Acadia Parish, Louisiana4,9271,7961,692104
622001960600Census Tract 9606, Acadia Parish, Louisiana5,6542,2922,073219

Another helpful command to run on any data set is str, which gives you the structure of the variable as defined by R:

In [8]:
str(census2000.trimmed)
'data.frame':	1106 obs. of  6 variables:
 $ fips.code             : num  2.2e+10 2.2e+10 2.2e+10 2.2e+10 2.2e+10 ...
 $ geography             : Factor w/ 1106 levels "Census Tract 10.01, Lafayette Parish, Louisiana",..: 970 978 985 993 1000 1006 1011 1015 1018 1021 ...
 $ population            : Factor w/ 1019 levels "0","1","10,248",..: 859 710 350 801 692 806 647 804 711 842 ...
 $ total.housing.units   : Factor w/ 905 levels "1","10","1,002",..: 565 404 104 499 357 531 397 522 426 594 ...
 $ occupied.housing.units: Factor w/ 876 levels "0","1","1,001",..: 512 363 74 446 338 468 347 470 374 514 ...
 $ vacant.housing.units  : Factor w/ 374 levels "0","1","10","100",..: 86 54 5 98 9 134 83 86 110 196 ...

The structure tells us that this is a data frame with 1106 rows and six columns. It further tells us the type of each column.

Notice how the FIPS code read in as a number but the other numeric columns read in as “factors”? That's R-speak for a categorical variable, and any character variables are by default set to this type. This happened because the numbers and those columns have commas. The presence of a single character within a number makes R treat the entire column as strings. This will be an issue later when we try to add two numbers together, as R doesn't know how to add two characters.

The solution: we need to remove the comma from all the strings, then recast the variable as a number.

To help with this we are going to use another package called stringr, and a function from within it called str_replace:

In [9]:
# install.packages('stringr')

library('stringr')
Warning message:
: package ‘stringr’ was built under R version 3.2.5

Let's start with the population variable. First, let's remove the comma and write the result to the original column. (The format for calling a column from a data frame in R is df.name$column.name)

In [10]:
census2000.trimmed$population <- str_replace(
    census2000.trimmed$population, 
    pattern = ',', 
    replacement = ''
)

Then we'll visually inspect the head:

In [11]:
head(census2000.trimmed)
Out[11]:
fips.codegeographypopulationtotal.housing.unitsoccupied.housing.unitsvacant.housing.units
122001960100Census Tract 9601, Acadia Parish, Louisiana61882,4102,236174
222001960200Census Tract 9602, Acadia Parish, Louisiana50561,9091,764145
322001960300Census Tract 9603, Acadia Parish, Louisiana31491,2461,145101
422001960400Census Tract 9604, Acadia Parish, Louisiana56172,1761,991185
522001960500Census Tract 9605, Acadia Parish, Louisiana49271,7961,692104
622001960600Census Tract 9606, Acadia Parish, Louisiana56542,2922,073219

This appeared to work. But R will still think this is a character variable unless we explicitly tell it otherwise:

In [12]:
census2000.trimmed$population <- as.numeric(census2000.trimmed$population)

Running str will help us ensure this worked:

In [13]:
str(census2000.trimmed)
'data.frame':	1106 obs. of  6 variables:
 $ fips.code             : num  2.2e+10 2.2e+10 2.2e+10 2.2e+10 2.2e+10 ...
 $ geography             : Factor w/ 1106 levels "Census Tract 10.01, Lafayette Parish, Louisiana",..: 970 978 985 993 1000 1006 1011 1015 1018 1021 ...
 $ population            : num  6188 5056 3149 5617 4927 ...
 $ total.housing.units   : Factor w/ 905 levels "1","10","1,002",..: 565 404 104 499 357 531 397 522 426 594 ...
 $ occupied.housing.units: Factor w/ 876 levels "0","1","1,001",..: 512 363 74 446 338 468 347 470 374 514 ...
 $ vacant.housing.units  : Factor w/ 374 levels "0","1","10","100",..: 86 54 5 98 9 134 83 86 110 196 ...

For the rest of the columns we can nest the first function within the second to speed things up:

In [14]:
census2000.trimmed$total.housing.units <- as.numeric(str_replace(census2000.trimmed$total.housing.units, pattern = ',', replacement = ''))
census2000.trimmed$occupied.housing.units <- as.numeric(str_replace(census2000.trimmed$occupied.housing.units, pattern = ',', replacement = ''))
census2000.trimmed$vacant.housing.units <- as.numeric(str_replace(census2000.trimmed$vacant.housing.units, pattern = ',', replacement = ''))
In [15]:
str(census2000.trimmed)
'data.frame':	1106 obs. of  6 variables:
 $ fips.code             : num  2.2e+10 2.2e+10 2.2e+10 2.2e+10 2.2e+10 ...
 $ geography             : Factor w/ 1106 levels "Census Tract 10.01, Lafayette Parish, Louisiana",..: 970 978 985 993 1000 1006 1011 1015 1018 1021 ...
 $ population            : num  6188 5056 3149 5617 4927 ...
 $ total.housing.units   : num  2410 1909 1246 2176 1796 ...
 $ occupied.housing.units: num  2236 1764 1145 1991 1692 ...
 $ vacant.housing.units  : num  174 145 101 185 104 219 171 174 196 284 ...

By default, head will print the first six lines. But we can override the default to show as many as we want (we'll show 10 here):

In [16]:
head(census2000.trimmed, n = 10)
Out[16]:
fips.codegeographypopulationtotal.housing.unitsoccupied.housing.unitsvacant.housing.units
122001960100Census Tract 9601, Acadia Parish, Louisiana618824102236174
222001960200Census Tract 9602, Acadia Parish, Louisiana505619091764145
322001960300Census Tract 9603, Acadia Parish, Louisiana314912461145101
422001960400Census Tract 9604, Acadia Parish, Louisiana561721761991185
522001960500Census Tract 9605, Acadia Parish, Louisiana492717961692104
622001960600Census Tract 9606, Acadia Parish, Louisiana565422922073219
722001960700Census Tract 9607, Acadia Parish, Louisiana461418941723171
822001960800Census Tract 9608, Acadia Parish, Louisiana564022542080174
922001960900Census Tract 9609, Acadia Parish, Louisiana505919781782196
1022001961000Census Tract 9610, Acadia Parish, Louisiana596525262242284

That worked!

But in the interest of full disclosure, you should know that we added those commas to the original CSVs from the Census Bureau to facilitate this exercise. “Commafied” numbers are one of the most frequent stumbling blocks to creating a cleaned data set.

For our last cleaning exercise, we'll work with the geography column. It has a lot of information in there, but it would be more useful if the census tract, parish name and state were separated, to help us aggregate some of these numbers.

The package tidyr has a function that helps us do just that:

In [17]:
# install.packages('tidyr')

library('tidyr')

Should you run into a function and not know what arguments it takes, running the function name, a pair of of empty parentheses afterwards, preceded by a question mark will allow you to access the documentation on that function:

In [18]:
# ?separate()
In [19]:
census2000.trimmed <- separate(
    census2000.trimmed, # name of the data frame
    geography, # column to split
    c('tract', 'parish', 'state'), # new column names
    ', ' # delimiter to split on (note the space after the comma)
)
In [20]:
head(census2000.trimmed)
Out[20]:
fips.codetractparishstatepopulationtotal.housing.unitsoccupied.housing.unitsvacant.housing.units
122001960100Census Tract 9601Acadia ParishLouisiana618824102236174
222001960200Census Tract 9602Acadia ParishLouisiana505619091764145
322001960300Census Tract 9603Acadia ParishLouisiana314912461145101
422001960400Census Tract 9604Acadia ParishLouisiana561721761991185
522001960500Census Tract 9605Acadia ParishLouisiana492717961692104
622001960600Census Tract 9606Acadia ParishLouisiana565422922073219

Our data set is as cleaned up as we need it to be now.

Let's summarize it with a frequency table of the county names:

In [21]:
table(census2000.trimmed$parish)
Out[21]:
              Acadia Parish                Allen Parish 
                         12                           5 
           Ascension Parish           Assumption Parish 
                         14                           6 
           Avoyelles Parish           Beauregard Parish 
                          9                           7 
           Bienville Parish              Bossier Parish 
                          5                          19 
               Caddo Parish            Calcasieu Parish 
                         64                          41 
            Caldwell Parish              Cameron Parish 
                          3                           2 
           Catahoula Parish            Claiborne Parish 
                          3                           5 
           Concordia Parish              De Soto Parish 
                          5                           7 
    East Baton Rouge Parish         East Carroll Parish 
                         89                           3 
      East Feliciana Parish           Evangeline Parish 
                          4                           8 
            Franklin Parish                Grant Parish 
                          6                           5 
              Iberia Parish            Iberville Parish 
                         15                           8 
             Jackson Parish      Jefferson Davis Parish 
                          5                           7 
           Jefferson Parish            Lafayette Parish 
                        123                          41 
           Lafourche Parish             La Salle Parish 
                         22                           3 
             Lincoln Parish           Livingston Parish 
                         10                          13 
             Madison Parish            Morehouse Parish 
                          5                           8 
        Natchitoches Parish              Orleans Parish 
                          9                         181 
            Ouachita Parish          Plaquemines Parish 
                         41                           8 
       Pointe Coupee Parish              Rapides Parish 
                          6                          34 
           Red River Parish             Richland Parish 
                          2                           6 
              Sabine Parish          St. Bernard Parish 
                          7                          17 
         St. Charles Parish           St. Helena Parish 
                         13                           2 
           St. James Parish St. John the Baptist Parish 
                          7                          11 
          St. Landry Parish           St. Martin Parish 
                         19                           9 
            St. Mary Parish          St. Tammany Parish 
                         16                          35 
          Tangipahoa Parish               Tensas Parish 
                         18                           3 
          Terrebonne Parish                Union Parish 
                         20                           6 
           Vermilion Parish               Vernon Parish 
                         10                           9 
          Washington Parish              Webster Parish 
                         10                          11 
    West Baton Rouge Parish         West Carroll Parish 
                          4                           3 
      West Feliciana Parish                 Winn Parish 
                          3                           4 

Now we need to run all of the above cleaning steps on the 2010 data:

In [22]:
census2010 <- read.csv('2010_census_demographic_profile.csv', skip = 1)

census2010.trimmed <- select(
  census2010, # name of the data frame
  # list of all the column names we want to keep
  Id2, Geography, Number..SEX.AND.AGE...Total.population, 
  Number..HOUSING.OCCUPANCY...Total.housing.units, 
  Number..HOUSING.OCCUPANCY...Total.housing.units...Occupied.housing.units, 
  Number..HOUSING.OCCUPANCY...Total.housing.units...Vacant.housing.units
)

colnames(census2010.trimmed) <- c('fips.code', 'census.tract', 'population', 
                               'total.housing.units', 'occupied.housing.units', 'vacant.housing.units')

census2010.trimmed$population <- as.numeric(str_replace(census2010.trimmed$population, pattern = ',', replacement = ''))
census2010.trimmed$total.housing.units <- as.numeric(str_replace(census2010.trimmed$total.housing.units, pattern = ',', replacement = ''))
census2010.trimmed$occupied.housing.units <- as.numeric(str_replace(census2010.trimmed$occupied.housing.units, pattern = ',', replacement = ''))
census2010.trimmed$vacant.housing.units <- as.numeric(str_replace(census2010.trimmed$vacant.housing.units, pattern = ',', replacement = ''))

census2010.trimmed <- separate(census2010.trimmed, census.tract, c('tract', 'parish', 'state'), ', ')

orleans2010 <- filter(census2010.trimmed, parish == 'Orleans Parish')

Merging

Now that we've cleaned both of our data files, let's merge the 2000 and 2010 data. Merging allows you to link two data sets on values common to both. It is a powerful operation that cannot be easily done in a program like Excel with such versatility.

In this case, we know that the FIPS code and the character names for most of the tracts should be consistent across the 10-year period.

However, census tracts are added, deleted, split and joined over the course of 10 years. We will make sure to keep all entries in both years. This is what is referred to as a "full outer join.” If we were to only keep all rows that were common to both data frames (R’s default behavior) we would lose some data.

In [23]:
census.comparison <- merge(
    census2000.trimmed, # first data frame
    census2010.trimmed, # second data frame
    by = c('fips.code', 'tract', 'parish', 'state'), # keys to use for join
    suffixes = c('.00', '.10'), # suffixes to append to new columns
    all = TRUE # specifying to keep all data from both data frames
)

Let's inspect a portion of the data frame where there are full matches and partial matches:

In [24]:
census.comparison[65:69, ]
Out[24]:
fips.codetractparishstatepopulation.00total.housing.units.00occupied.housing.units.00vacant.housing.units.00population.10total.housing.units.10occupied.housing.units.10vacant.housing.units.10
6522015010801Census Tract 108.01Bossier ParishLouisiana315915271351176335914151284131
6622015010803Census Tract 108.03Bossier ParishLouisiana536222502120130NANANANA
6722015010804Census Tract 108.04Bossier ParishLouisiana610123832282101727829432815128
6822015010805Census Tract 108.05Bossier ParishLouisianaNANANANA323815851425160
6922015010806Census Tract 108.06Bossier ParishLouisianaNANANANA40861671161061

Saving your work

Saving your intermediate work to a file is often good practice, so we will write the results of our merge to a CSV (you can do this with any data frame you create in R).

In [25]:
write.csv(census.comparison, 'census_comparison_result.csv', row.names = FALSE)

Calculating population changes in New Orleans

Let's filter our merged data frame down to just Orleans Parish. The Orleans Parish and the city of New Orleans are “coterminous” (that is, they share the same boundaries), so this will isolate only the census tracts of the city.

In [26]:
# note the use of "==" since we are expressing a criterion
orleans <- filter(census.comparison, parish == 'Orleans Parish') 
head(orleans)
Out[26]:
fips.codetractparishstatepopulation.00total.housing.units.00occupied.housing.units.00vacant.housing.units.00population.10total.housing.units.10occupied.housing.units.10vacant.housing.units.10
12.2071e+10Census Tract 1Orleans ParishLouisiana238114081145263245515131229284
22.2071e+10Census Tract 2Orleans ParishLouisiana13476914961951197738496242
32.2071e+10Census Tract 3Orleans ParishLouisiana14687195591601231641467174
42.2071e+10Census Tract 4Orleans ParishLouisiana2564103487316123281137911226
52.2071e+10Census Tract 6.01Orleans ParishLouisiana203470450619884932826959
62.2071e+10Census Tract 6.02Orleans ParishLouisiana2957110610119525341108923185

Now we can do some quick calculations with our new merged data frame for New Orleans.

First question: What was the population of New Orleans in 2000?

That requires summing up the 2000 population column like so:

In [27]:
sum(orleans$population.00)
Out[27]:
[1] NA

Why didn't this work? Let's inspect the population.00 variable using summary:

In [28]:
summary(orleans$population.00)
Out[28]:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
     52    1711    2274    2678    3141    9931      30 

This reveals that there are 30 census tracts that have NA, or missing, values for population.00. By default, R does not compute a sum of a column if there are missing values. We'll have to tell it to ignore these missing values by specifying na.rm = TRUE:

In [29]:
sum(orleans$population.00, na.rm = TRUE)
Out[29]:
484674

Second question: What was the population of New Orleans in 2010?

In [30]:
sum(orleans$population.10, na.rm = TRUE)
Out[30]:
343829

This matches the story exactly.

Last question: What was the percent change in New Orleans population between 2000 and 2010?

To do this, we'll first save each population calculation to new objects. Then we'll create another object to store the percent change.

In [31]:
nola2000pop <- sum(orleans$population.00, na.rm = TRUE)
nola2010pop <- sum(orleans$population.10, na.rm = TRUE)

perc.change.nola <- (nola2010pop - nola2000pop)/nola2000pop * 100
In [32]:
print(paste('The percent change in New Orleans population since 2000 is ', round(perc.change.nola), '%', sep =''))
[1] "The percent change in New Orleans population since 2000 is -29%"

Again, we see that this matches the 29% drop cited by The Times-Picayune article. Yay!

This concludes our workshop, Getting started with R.

We'll use the merged data CSV we saved above to do further analysis in our next workshop, More with R. Take a sneak peak at the notebook here.

Any questions?