By Christine Zhang (Knight-Mozilla / Los Angeles Times) & Ryan Menezes (Los Angeles Times)
IRE Conference -- New Orleans, LA
June 18, 2016
This workshop is a basic introduction to R, a free, open-source software for data analysis and statistics.
R is a powerful tool that can help you quickly and effectively answer questions using data.
Take our host city, New Orleans, for example. Hurricane Katrina was a devastating natural disaster that substantially affected the population of New Orleans. The hurricane took place in August 2005, which coincidentally falls between the U.S. Census full population counts in 2000 and 2010.
In this session, we will use the "Demographic Profile” -- a large summary file with many different demographic variables downloaded from the U.S. Census Bureau website -- from 2000 and 2010, for all census tracts in the state of Louisiana.
In this session, we will:
Basic analysis techniques like the ones you will learn in this class can help you write data-driven stories, like this one written by The Times-Picayune shortly after the census released its 2010 tally.
The story begins:
Five years after Hurricane Katrina emptied New Orleans and prompted the largest mass migration in modern American history, the 2010 Census counted 343,829 people living in the still-recovering city, a 29 percent drop since the last head count a decade ago, according to data released today.
Using the data we have, we will attempt to replicate the calculations in that lede.
The following code and annotations were written in a Jupyter notebook. The code is best run in RStudio version 0.99.902 using R version 3.3.0
We'll start by loading in the 2000 data, which is stored in a CSV (comma-separated values) file. CSVs are plain-text files of data where commas separate the columns within a line. It is sometimes preferable to work with CSVs as opposed to files of a proprietary format, such as Microsoft Excel files, but the Census Bureau readily makes data available in both formats.
Let's run R's read.csv
command and save the data to an object called census2000
. Here, we are using assignment with <-
, which tells R to run the right side and assign the result to the object named on the left.
census2000 <- read.csv('2000_census_demographic_profile.csv')
Now that this ran without incident, let's inspect the first few rows using head
, which by default prints out the first six rows of a data frame (R's internal term for a spreadsheet):
head(census2000)
Upon inspection, we can see that the file came with two header rows. R, by default, takes the first row of a CSV to be the header. We clearly do not need the first one so we can rerun the read.csv
command and tell it so:
census2000 <- read.csv('2000_census_demographic_profile.csv', skip = 1)
head(census2000)
Visually, we can see that this data set is very wide. In fact, there are 195 columns.
Spaces are not allowed in R column names. That's why they've been automatically converted to periods, as in Number..Total.population
.
Let's keep a handful of these:
Id2
: This is what the census bureau calls a FIPS code. It is a unique numerical identifier for all census tracts. This will be important when we join our two datasets together. Geography
: This is a text description of the tract, with the parish name. Number..Total.population
: The total population of the tract. Number..HOUSING.OCCUPANCY...Total.housing.units
, Number..HOUSING.OCCUPANCY...Total.housing.units...Occupied.housing.units
, and Number..HOUSING.OCCUPANCY...Total.housing.units...Vacant.housing.units
: The total, occupied and vacant housing units.To help us trim the data set to just these six columns, we are going to import a package. There are thousands of packages for R created by the open-source community, which help improve on what is included in R by default.
The one we will use here is called dplyr.
## if dplyr was not installed we would have to run this
# install.packages('dplyr')
## to import the package and all of its functions
library('dplyr')
From dplyr, we will use the select
function to trim the data set and save it to a new variable called census2000.trimmed:
census2000.trimmed <- select(
census2000, # name of the data frame
# list of all the six column names we want to keep
Id2,
Geography,
Number..Total.population,
Number..HOUSING.OCCUPANCY...Total.housing.units,
Number..HOUSING.OCCUPANCY...Total.housing.units...Occupied.housing.units,
Number..HOUSING.OCCUPANCY...Total.housing.units...Vacant.housing.units
)
head(census2000.trimmed)
This shows us that we were able to select the columns correctly. But one lingering issue is that these column names are long and unwieldy. Since we are going to be typing them often, let's rename them to shorter, more convenient versions:
colnames(census2000.trimmed) <- c(
'fips.code', 'geography', 'population',
'total.housing.units', 'occupied.housing.units', 'vacant.housing.units'
)
head(census2000.trimmed)
Another helpful command to run on any data set is str
, which gives you the structure of the variable as defined by R:
str(census2000.trimmed)
The structure tells us that this is a data frame with 1106 rows and six columns. It further tells us the type of each column.
Notice how the FIPS code read in as a number but the other numeric columns read in as “factors”? That's R-speak for a categorical variable, and any character variables are by default set to this type. This happened because the numbers and those columns have commas. The presence of a single character within a number makes R treat the entire column as strings. This will be an issue later when we try to add two numbers together, as R doesn't know how to add two characters.
The solution: we need to remove the comma from all the strings, then recast the variable as a number.
To help with this we are going to use another package called stringr
, and a function from within it called str_replace
:
# install.packages('stringr')
library('stringr')
Let's start with the population variable. First, let's remove the comma and write the result to the original column. (The format for calling a column from a data frame in R is df.name$column.name
)
census2000.trimmed$population <- str_replace(
census2000.trimmed$population,
pattern = ',',
replacement = ''
)
Then we'll visually inspect the head:
head(census2000.trimmed)
This appeared to work. But R will still think this is a character variable unless we explicitly tell it otherwise:
census2000.trimmed$population <- as.numeric(census2000.trimmed$population)
Running str
will help us ensure this worked:
str(census2000.trimmed)
For the rest of the columns we can nest the first function within the second to speed things up:
census2000.trimmed$total.housing.units <- as.numeric(str_replace(census2000.trimmed$total.housing.units, pattern = ',', replacement = ''))
census2000.trimmed$occupied.housing.units <- as.numeric(str_replace(census2000.trimmed$occupied.housing.units, pattern = ',', replacement = ''))
census2000.trimmed$vacant.housing.units <- as.numeric(str_replace(census2000.trimmed$vacant.housing.units, pattern = ',', replacement = ''))
str(census2000.trimmed)
By default, head
will print the first six lines. But we can override the default to show as many as we want (we'll show 10 here):
head(census2000.trimmed, n = 10)
That worked!
But in the interest of full disclosure, you should know that we added those commas to the original CSVs from the Census Bureau to facilitate this exercise. “Commafied” numbers are one of the most frequent stumbling blocks to creating a cleaned data set.
For our last cleaning exercise, we'll work with the geography column. It has a lot of information in there, but it would be more useful if the census tract, parish name and state were separated, to help us aggregate some of these numbers.
The package tidyr has a function that helps us do just that:
# install.packages('tidyr')
library('tidyr')
Should you run into a function and not know what arguments it takes, running the function name, a pair of of empty parentheses afterwards, preceded by a question mark will allow you to access the documentation on that function:
# ?separate()
census2000.trimmed <- separate(
census2000.trimmed, # name of the data frame
geography, # column to split
c('tract', 'parish', 'state'), # new column names
', ' # delimiter to split on (note the space after the comma)
)
head(census2000.trimmed)
Our data set is as cleaned up as we need it to be now.
Let's summarize it with a frequency table of the county names:
table(census2000.trimmed$parish)
Now we need to run all of the above cleaning steps on the 2010 data:
census2010 <- read.csv('2010_census_demographic_profile.csv', skip = 1)
census2010.trimmed <- select(
census2010, # name of the data frame
# list of all the column names we want to keep
Id2, Geography, Number..SEX.AND.AGE...Total.population,
Number..HOUSING.OCCUPANCY...Total.housing.units,
Number..HOUSING.OCCUPANCY...Total.housing.units...Occupied.housing.units,
Number..HOUSING.OCCUPANCY...Total.housing.units...Vacant.housing.units
)
colnames(census2010.trimmed) <- c('fips.code', 'census.tract', 'population',
'total.housing.units', 'occupied.housing.units', 'vacant.housing.units')
census2010.trimmed$population <- as.numeric(str_replace(census2010.trimmed$population, pattern = ',', replacement = ''))
census2010.trimmed$total.housing.units <- as.numeric(str_replace(census2010.trimmed$total.housing.units, pattern = ',', replacement = ''))
census2010.trimmed$occupied.housing.units <- as.numeric(str_replace(census2010.trimmed$occupied.housing.units, pattern = ',', replacement = ''))
census2010.trimmed$vacant.housing.units <- as.numeric(str_replace(census2010.trimmed$vacant.housing.units, pattern = ',', replacement = ''))
census2010.trimmed <- separate(census2010.trimmed, census.tract, c('tract', 'parish', 'state'), ', ')
orleans2010 <- filter(census2010.trimmed, parish == 'Orleans Parish')
Now that we've cleaned both of our data files, let's merge the 2000 and 2010 data. Merging allows you to link two data sets on values common to both. It is a powerful operation that cannot be easily done in a program like Excel with such versatility.
In this case, we know that the FIPS code and the character names for most of the tracts should be consistent across the 10-year period.
However, census tracts are added, deleted, split and joined over the course of 10 years. We will make sure to keep all entries in both years. This is what is referred to as a "full outer join.” If we were to only keep all rows that were common to both data frames (R’s default behavior) we would lose some data.
census.comparison <- merge(
census2000.trimmed, # first data frame
census2010.trimmed, # second data frame
by = c('fips.code', 'tract', 'parish', 'state'), # keys to use for join
suffixes = c('.00', '.10'), # suffixes to append to new columns
all = TRUE # specifying to keep all data from both data frames
)
Let's inspect a portion of the data frame where there are full matches and partial matches:
census.comparison[65:69, ]
Saving your intermediate work to a file is often good practice, so we will write the results of our merge to a CSV (you can do this with any data frame you create in R).
write.csv(census.comparison, 'census_comparison_result.csv', row.names = FALSE)
Let's filter our merged data frame down to just Orleans Parish. The Orleans Parish and the city of New Orleans are “coterminous” (that is, they share the same boundaries), so this will isolate only the census tracts of the city.
# note the use of "==" since we are expressing a criterion
orleans <- filter(census.comparison, parish == 'Orleans Parish')
head(orleans)
Now we can do some quick calculations with our new merged data frame for New Orleans.
First question: What was the population of New Orleans in 2000?
That requires summing up the 2000 population column like so:
sum(orleans$population.00)
Why didn't this work? Let's inspect the population.00 variable using summary
:
summary(orleans$population.00)
This reveals that there are 30 census tracts that have NA, or missing, values for population.00. By default, R does not compute a sum of a column if there are missing values. We'll have to tell it to ignore these missing values by specifying na.rm = TRUE
:
sum(orleans$population.00, na.rm = TRUE)
Second question: What was the population of New Orleans in 2010?
sum(orleans$population.10, na.rm = TRUE)
This matches the story exactly.
Last question: What was the percent change in New Orleans population between 2000 and 2010?
To do this, we'll first save each population calculation to new objects. Then we'll create another object to store the percent change.
nola2000pop <- sum(orleans$population.00, na.rm = TRUE)
nola2010pop <- sum(orleans$population.10, na.rm = TRUE)
perc.change.nola <- (nola2010pop - nola2000pop)/nola2000pop * 100
print(paste('The percent change in New Orleans population since 2000 is ', round(perc.change.nola), '%', sep =''))
Again, we see that this matches the 29% drop cited by The Times-Picayune article. Yay!
This concludes our workshop, Getting started with R.
We'll use the merged data CSV we saved above to do further analysis in our next workshop, More with R. Take a sneak peak at the notebook here.
Any questions?