This walkthrough is distributed under a Creative Commons Attribution 4.0 International (CC BY 4.0) License.
In order to use your data in R, you must import it and turn it into an R object. There are many ways to get data into R.
data.frame()
and specify your variables.readLines()
function)read.table()
function or readr
package)xlsx
package)googlesheets
package)haven
package)RMySQL
package)rvest
package.twitteR
package, or Altmetrics data with rAltmetric
, or World Bank's World Development Indicators with WDI
.readr
¶R has some base functions for reading a local data file into your R session--namely read.table()
and read.csv()
, but these have some idiosyncrasies that were improved upon in the readr
package, which is installed and loaded with tidyverse
. You can either load tidyverse
, which will automatically load readr
, or you can load readr
individually.
library(readr)
library(purrr)
library(dplyr)
For this session, we will be reading a CSV from a web connection rather than saving the data to our computer and loading it into R. However, to do that, see the below section on Loading data from a local file.
To get our sample data into our R session, we will use the read_csv()
function and connect to a CSV saved on my GitHub using the url()
function.
books_url <- url("https://raw.githubusercontent.com/ciakovx/ciakovx.github.io/master/data/books.csv")
books <- readr::read_csv(books_url)
books
You will notice a warning message telling you that because you did not specify the data type for each column, read_csv()
parsed it automatically. For example, LOCATION
was parsed as a col_character()
field. You should now have an R object called books
in the Environment pane: 5991 observations of 12 variables. We will be using this data file in the next module.
After you read in the data, you want to examine it not only to make sure it was read in correctly, but also to gather some basic information about it. Here I am working with the data file that was provided to you during the webinar session. Read this file by saving it to your computer, setting your working directory, and typing the expression found in the TRY IT YOURSELF exercise at the end of Section 2.
# Use dim() to obtain the dimensions
dim(books)
# print the column names
names(books)
# nrow() is number of rows.
# ncol() is the number of columns
nrow(books)
ncol(books)
# Use head() and tail() to get the first and last 6 observations
# View more by adding the n argument
head(books)
head(books, n = 10)
The map()
series of function from purrr
is a useful way of running a function on all variables in a data frame or list. Here we call class()
on books
using map_chr()
, which will return a character vector of the classes for each variable.
map_chr(books, class)
TOT.CHKOUT
(the total number of checkouts) was read into R as numeric
. Everything else was read in as character
.
# print the first six book titles
head(books$X245.ab)
# print the mean number of checkouts
mean(books$TOT.CHKOUT)
unique()
, table()
, and duplicated()
¶Use unique()
to see all the distinct values in a variable:
unique(books$LOCATION)
Take that one step further with table()
to get quick frequency counts on a variable:
table(books$LOCATION)
# you can use it with relational operators
# Here we find that 9 books have over 50 checkouts
table(books$TOT.CHKOUT > 50)
duplicated()
will give you the a logical vector of duplicated values.
# The books dataset doesn't have much duplication, we'll create a new vector to test this.
mydupes <- tibble("identifier" = c("111", "222", "111", "333", "444"),
"birthYear" = c(1980, 1940, 1980, 2000, 1960))
mydupes
# The second 111 is duplicated
duplicated(mydupes$identifier)
# you can put an exclamation mark before it to get non-duplicated values
!duplicated(mydupes$identifier)
# or run a table of duplicated values
table(duplicated(mydupes$identifier))
# which() is also a useful function for identifying the specific element
# in the vector that is duplicated
which(duplicated(mydupes$identifier))
You may also need to know the number of missing values:
# How many total missing values?
sum(is.na(books))
# Total missing values per column
colSums(is.na(books))
# use table() and is.na() in combination
table(is.na(books$ISN))
# Return only observations that have no missing values
booksNoNA <- na.omit(books)
TRY IT YOURSELF
# What is the class() of the TOT.CHKOUT variable?
# The publication date variable is represented with X008.Date.One.
# Use the table() function to get frequency counts of values in this variable
# Use table() and is.na() to find out how many NA values are in the ISN variable.
# Use which() and is.na() to find out which rows are NA in the ISN variable. What happened?
# Use which() and !is.na() to find out which rows are **not** NA in ISN.
# tip: !is.na() is the same thing as complete.cases()
# Use brackets [] to subset the books$ISN vector along with !is.na() to
# include only those values that are not NA.
# Call summary(books$TOT.CHKOUT).
# What can we infer when we compare the mean, median, and max?
# hist() will print a rudimentary histogram, which displays frequency counts.
# Call hist(books$TOT.CHKOUT). What is this telling us?
R contains a number of operators you can use to compare values. Use help(Comparison)
to read the R help file. Note that two equal signs (==
) are used for evaluating equality (because one equals sign (=
) is used for assigning variables).
help(Comparison)
Sometimes you need to do multiple logical tests (think Boolean logic). Use help(Logic)
to read the help file.
help(Logic)
TRY IT YOURSELF
8 == 16
8 < 16
8 == 16 - 8
8 == 8 & 8 < 16
8 == 8 | 8 > 16
8 == 9 | 8 > 16
any(8 == 8, 8 == 9, 8 == 10)
all(8 == 8, 8 == 9, 8 == 10)
8 %in% c(6, 7, 8)
8 %in% c(5, 6, 7)
!(8 %in% c(5, 6, 7))
if(8 == 8) {print("eight equals eight")}
if(8 > 16){
print("eight is greater than sixteen")
} else {
print("eight is less than sixteen")
}
We are now entering the data cleaning and transforming phase. While it is possible to do much of the following using Base R functions (in other words, without loading an external package) dplyr
makes it much easier. Like many of the most useful R packages, dplyr
was developed by [http://hadley.nz/](Hadley Wickham), a data scientist and professor at Rice University.
It is often necessary to rename variables to make them more meaningful. If you print the names of the sample books
dataset you can see that some of the vector names are not particularly helpful:
# print names of the books data frame to the console
names(books)
There are many ways to rename variables in R, but I find the rename()
function in the dplyr
package to be the easiest and most straightforward. The new variable name comes first. See help(rename)
.
# rename the X245.ab variable. Make sure you return (<-) the output to your
# variable, otherwise it will just print it to the console
books <- dplyr::rename(books,
title = X245.ab)
# rename multiple variables at once
books <- dplyr::rename(books,
author = X245.c,
callnumber = CALL...BIBLIO.,
isbn = ISN,
pubyear = X008.Date.One,
subCollection = BCODE1,
format = BCODE2)
books
Side note: where does X245.ab
come from? That is the MARC field 245|ab. However, because R variables cannot start with a number, R automatically inserted and X, and because pipes | are not allowed in variable names, R replaced it with a period.
R does this automatically when you use read_csv
, but sometimes you need to force it. The clean_names()
function from janitor
can be used to clean names of data frames. As per the help file, "Resulting names are unique and consist only of the _ character, numbers, and letters."
# print column names
names(books)
# use clean_names() to make them easier to work with
books <- janitor::clean_names(books)
names(books)
It is often necessary to recode or reclassify values in your data. For example, in the sample dataset provided to you, the sub_collection
(formerly BCODE1
) and format
(formerly BCODE2
) variables contain single characters.
You can do this easily using the recode()
function, also in the dplyr
package.
# first print to the console all of the unique values you will need to recode
unique(books$sub_collection)
unique(books$format)
# Use the recode function to assign them.
# Unlike rename, the old value comes first here.
books$sub_collection <- dplyr::recode(books$sub_collection,
"-" = "general collection",
u = "government documents",
r = "reference",
b = "k-12 materials",
j = "juvenile",
s = "special collections",
c = "computer files",
t = "theses",
a = "archives",
z = "reserves")
unique(books$sub_collection)
# you can do the same to recode formats:
books$format <- dplyr::recode(books$format,
a = "book",
e = "serial",
w = "microform",
s = "e-gov doc",
o = "map",
n = "database",
k = "cd-rom",
m = "image",
"5" = "kit/object",
"4" = "online video")
unique(books$format)
In the same way we used brackets to subset vectors, we also use them to subset dataframes. However, vectors have only one direction, but dataframes have two. Therefore we have to use two values in the brackets: the first representing the row, and the second representing the column: [[row, column]]
.
When using tibbles, single brackets will return a tibble, but double brackets will return the individual vectors or values without names.
# subsetting a vector
c("do", "re", "mi", "fa", "so") [1]
# subsetting a data frame:
# pull a single variable into a tibble with names
books[5, 2]
# return the second value in multiple columns
books[2, c("title", "format", "sub_collection")]
# leave the row space blank to return all rows.
# This will give you the titles column only
myTitles <- books[, "title"]
You can also use the relational operators (see above) to return values based on a logical condition. Below, we use complete.cases()
to subset out the NA values. This function evaluates each row and checks if it is an NA value. If it is not, a TRUE value is returned, and vice versa. Since the column space is blank in the brackets [row, column]
, it will return all columns.
completeCallnumbers <- books[complete.cases(books$callnumber), ]
completeCallnumbers
filter()
in the dplyr
package¶Subsetting using brackets is important to understand, but as with other R functions, the dplyr
package makes it much more straightforward, using the filter()
function.
# filter books to return only those items where the format is books
booksOnly <- dplyr::filter(books, format == "book")
# use multiple filter conditions,
# e.g. books to include only books with more than zero checkouts
bookCheckouts <- dplyr::filter(books,
format == "book",
tot_chkout > 0)
# How many books have checkouts?
nrow(bookCheckouts)
# Divide the number of rows of books with checkouts by the number of rows of books
# then multiply by 100 to get the percent of books with checkouts
nrow(bookCheckouts)/nrow(booksOnly) * 100
# use the str_detect() function from the stringr package to return all
# books with the word "Science" in the SUBJECT variable, published after 1999
scienceBooks <- dplyr::filter(books,
format == "book",
stringr::str_detect(subject, "Science"),
pubyear > 1999)
scienceBooks
TRY IT YOURSELF
# Run unique(books$format) and unique(books$sub_collection) to
# confirm the values in each of these fields
unique(books$format)
# Use filter() on books to create a data frame consisting only of format serials.
# Assign it to a value called serials
# How many rows are there?
# Use sum() to compute the number of total checkouts of serials
# Use filter() to create a data frame consisting of format books and sub_collection juvenile materials.
# Assign it to a value called juv
# What is the average number of checkouts tot_chkout for juvenile books?
# Filter the data frame to include books with between 10 and 20 checkouts
The select()
function allows you to keep or remove specific variables. It also provides a convenient way to reorder variables.
# specify the variables you want to keep by name
booksTitleCheckouts <- dplyr::select(books, title, tot_chkout)
# specify the variables you want to remove with a -
books <- dplyr::select(books, -call_item)
# reorder columns, combined with everything()
booksReordered <- dplyr::select(books, title, tot_chkout, loutdate, everything())
booksReordered
The arrange()
function in the dplyr
package allows you to sort your data by alphabetical or numerical order.
booksTitleArrange <- dplyr::arrange(books, title)
# use desc() to sort a variable in descending order
booksHighestChkout <- dplyr::arrange(books, desc(tot_chkout))
# order data based on multiple variables (e.g. sort first by checkout, then by publication year)
booksChkoutYear <- dplyr::arrange(books, desc(tot_chkout), desc(pubyear))
The mutate()
function allows you to create new variables.
head(books$callnumber)
# use mutate to create a new column lc_class
# use the str_sub() function from the stringr package to extract the first character of the callnumber variable (the LC Class)
booksLC <- mutate(books
, lc_class = stringr::str_sub(callnumber, 1, 1))
head(booksLC$lc_class)
The Pipe Operator %>%
is loaded with the tidyverse
. It takes the output of one statement and makes it the input of the next statement. You can think of it as "then" in natural language. So in the following example, the books
tibble is first loaded, then the format is filtered to include only books, then only the title and tot_chkout columns are selected, and finally the data is rearranged from most to least checkouts.
myBooks <- books %>%
dplyr::filter(format == "book") %>%
dplyr::select(title, tot_chkout) %>%
dplyr::arrange(desc(tot_chkout))
myBooks
TRY IT YOURSELF
Experiment with different combinations of the pipe operator.
dplyr
at https://dplyr.tidyverse.org/.library(dplyr)
, run vignette("dplyr")
to read an extremely helpful explanation of how to use it.