Data Exploration¶

This walkthrough is distributed under a Creative Commons Attribution 4.0 International (CC BY 4.0) License.

Getting data into R¶

Ways to get data into R¶

In order to use your data in R, you must import it and turn it into an R object. There are many ways to get data into R.

Manually: You can manually create it as we did at the end of last session. To create a data.frame, use the data.frame() and specify your variables.
Import it from a file Below is a very incomplete list

Text: TXT (readLines() function)
Tabular data: CSV, TSV (read.table() function or readr package)
Excel: XLSX (xlsx package)
Google sheets: (googlesheets package)
Statistics program: SPSS, SAS (haven package)
Databases: MySQL (RMySQL package)

Gather it from the web: You can connect to webpages, servers, or APIs directly from within R, or you can create a data scraped from HTML webpages using the rvest package.

For example, connect to the Twitter API with the twitteR package, or Altmetrics data with rAltmetric, or World Bank's World Development Indicators with WDI.

`readr`¶

R has some base functions for reading a local data file into your R session--namely read.table() and read.csv(), but these have some idiosyncrasies that were improved upon in the readr package, which is installed and loaded with tidyverse. You can either load tidyverse, which will automatically load readr, or you can load readr individually.

In [ ]:

library(readr)
library(purrr)
library(dplyr)

For this session, we will be reading a CSV from a web connection rather than saving the data to our computer and loading it into R. However, to do that, see the below section on Loading data from a local file.

To get our sample data into our R session, we will use the read_csv() function and connect to a CSV saved on my GitHub using the url() function.

In [ ]:

books_url <- url("https://raw.githubusercontent.com/ciakovx/ciakovx.github.io/master/data/books.csv")
books <- readr::read_csv(books_url)
books

You will notice a warning message telling you that because you did not specify the data type for each column, read_csv() parsed it automatically. For example, LOCATION was parsed as a col_character() field. You should now have an R object called books in the Environment pane: 5991 observations of 12 variables. We will be using this data file in the next module.

Data exploration¶

After you read in the data, you want to examine it not only to make sure it was read in correctly, but also to gather some basic information about it. Here I am working with the data file that was provided to you during the webinar session. Read this file by saving it to your computer, setting your working directory, and typing the expression found in the TRY IT YOURSELF exercise at the end of Section 2.

Exploring dataframes¶

In [ ]:

# Use dim() to obtain the dimensions
dim(books)

In [ ]:

# print the column names
names(books)

In [ ]:

# nrow() is number of rows. 
# ncol() is the number of columns
nrow(books)
ncol(books)

In [ ]:

# Use head() and tail() to get the first and last 6 observations
# View more by adding the n argument
head(books)
head(books, n = 10)

The map() series of function from purrr is a useful way of running a function on all variables in a data frame or list. Here we call class() on books using map_chr(), which will return a character vector of the classes for each variable.

In [ ]:

map_chr(books, class)

TOT.CHKOUT (the total number of checkouts) was read into R as numeric. Everything else was read in as character.

Exploring variables¶

Dollar sign¶

The dollar sign $ is used to distinguish a specific variable (column, in Excel-speak) in a data frame:

In [ ]:

# print the first six book titles
head(books$X245.ab)

# print the mean number of checkouts
mean(books$TOT.CHKOUT)

`unique()`, `table()`, and `duplicated()`¶

Use unique() to see all the distinct values in a variable:

In [ ]:

unique(books$LOCATION)

Take that one step further with table() to get quick frequency counts on a variable:

In [ ]:

table(books$LOCATION)

# you can use it with relational operators 
# Here we find that 9 books have over 50 checkouts
table(books$TOT.CHKOUT > 50)

duplicated() will give you the a logical vector of duplicated values.

In [ ]:

# The books dataset doesn't have much duplication, we'll create a new vector to test this.
mydupes <- tibble("identifier" = c("111", "222", "111", "333", "444"),
                  "birthYear" = c(1980, 1940, 1980, 2000, 1960))

mydupes

In [ ]:

# The second 111 is duplicated
duplicated(mydupes$identifier)

In [ ]:

# you can put an exclamation mark before it to get non-duplicated values
!duplicated(mydupes$identifier)

In [ ]:

# or run a table of duplicated values
table(duplicated(mydupes$identifier))

In [ ]:

# which() is also a useful function for identifying the specific element
# in the vector that is duplicated
which(duplicated(mydupes$identifier))

Exploring missing values¶

You may also need to know the number of missing values:

In [ ]:

# How many total missing values?
sum(is.na(books))

In [ ]:

# Total missing values per column
colSums(is.na(books))

In [ ]:

# use table() and is.na() in combination
table(is.na(books$ISN))

In [ ]:

# Return only observations that have no missing values
booksNoNA <- na.omit(books)

TRY IT YOURSELF

In [ ]:

# What is the class() of the TOT.CHKOUT variable?

In [ ]:

# The publication date variable is represented with X008.Date.One. 
# Use the table() function to get frequency counts of values in this variable

In [ ]:

# Use table() and is.na() to find out how many NA values are in the ISN variable.

In [ ]:

# Use which() and is.na() to find out which rows are NA in the ISN variable. What happened?

In [ ]:

# Use which() and !is.na() to find out which rows are **not** NA in ISN. 
# tip: !is.na() is the same thing as complete.cases()

In [ ]:

# Use brackets [] to subset the books$ISN vector along with !is.na() to 
# include only those values that are not NA.

In [ ]:

# Call summary(books$TOT.CHKOUT). 
# What can we infer when we compare the mean, median, and max?

In [ ]:

# hist() will print a rudimentary histogram, which displays frequency counts. 
# Call hist(books$TOT.CHKOUT). What is this telling us?

Logical tests¶

R contains a number of operators you can use to compare values. Use help(Comparison) to read the R help file. Note that two equal signs (==) are used for evaluating equality (because one equals sign (=) is used for assigning variables).

In [ ]:

help(Comparison)

Sometimes you need to do multiple logical tests (think Boolean logic). Use help(Logic) to read the help file.

In [ ]:

help(Logic)

TRY IT YOURSELF

Evaluate the following expressions yourself and consider the results.

In [ ]:

8 == 16

In [ ]:

8 < 16

In [ ]:

8 == 16 - 8

In [ ]:

8 == 8 & 8 < 16

In [ ]:

8 == 8 | 8 > 16

In [ ]:

8 == 9 | 8 > 16

In [ ]:

any(8 == 8, 8 == 9, 8 == 10)

In [ ]:

all(8 == 8, 8 == 9, 8 == 10)

In [ ]:

8 %in% c(6, 7, 8)

In [ ]:

8 %in% c(5, 6, 7)

In [ ]:

!(8 %in% c(5, 6, 7))

In [ ]:

if(8 == 8) {print("eight equals eight")}

In [ ]:

if(8 > 16){
  print("eight is greater than sixteen")
} else {
  print("eight is less than sixteen")
}

Data cleaning & transformation with dplyr¶

We are now entering the data cleaning and transforming phase. While it is possible to do much of the following using Base R functions (in other words, without loading an external package) dplyr makes it much easier. Like many of the most useful R packages, dplyr was developed by [http://hadley.nz/](Hadley Wickham), a data scientist and professor at Rice University.

Renaming variables¶

It is often necessary to rename variables to make them more meaningful. If you print the names of the sample books dataset you can see that some of the vector names are not particularly helpful:

In [ ]:

# print names of the books data frame to the console
names(books)

There are many ways to rename variables in R, but I find the rename() function in the dplyr package to be the easiest and most straightforward. The new variable name comes first. See help(rename).

In [ ]:

# rename the X245.ab variable. Make sure you return (<-) the output to your 
# variable, otherwise it will just print it to the console
books <- dplyr::rename(books,
                       title = X245.ab)

# rename multiple variables at once
books <- dplyr::rename(books,
                       author = X245.c,
                       callnumber = CALL...BIBLIO.,
                       isbn = ISN,
                       pubyear = X008.Date.One,
                       subCollection = BCODE1,
                       format = BCODE2)
books

Side note: where does X245.ab come from? That is the MARC field 245|ab. However, because R variables cannot start with a number, R automatically inserted and X, and because pipes | are not allowed in variable names, R replaced it with a period.

R does this automatically when you use read_csv, but sometimes you need to force it. The clean_names() function from janitor can be used to clean names of data frames. As per the help file, "Resulting names are unique and consist only of the _ character, numbers, and letters."

In [ ]:

# print column names
names(books)

# use clean_names() to make them easier to work with
books <- janitor::clean_names(books)
names(books)

Recoding values¶

It is often necessary to recode or reclassify values in your data. For example, in the sample dataset provided to you, the sub_collection (formerly BCODE1) and format (formerly BCODE2) variables contain single characters.

You can do this easily using the recode() function, also in the dplyr package.

In [ ]:

# first print to the console all of the unique values you will need to recode
unique(books$sub_collection)
unique(books$format)

In [ ]:

# Use the recode function to assign them. 
# Unlike rename, the old value comes first here. 
books$sub_collection <- dplyr::recode(books$sub_collection,
                                      "-" = "general collection",
                                      u = "government documents",
                                      r = "reference",
                                      b = "k-12 materials",
                                      j = "juvenile",
                                      s = "special collections",
                                      c = "computer files",
                                      t = "theses",
                                      a = "archives",
                                      z = "reserves")
unique(books$sub_collection)

# you can do the same to recode formats:
books$format <- dplyr::recode(books$format,
                              a = "book",
                              e = "serial",
                              w = "microform",
                              s = "e-gov doc",
                              o = "map",
                              n = "database",
                              k = "cd-rom",
                              m = "image",
                              "5" = "kit/object",
                              "4" = "online video")

unique(books$format)

Subsetting dataframes¶

Subsetting using brackets in Base R¶

In the same way we used brackets to subset vectors, we also use them to subset dataframes. However, vectors have only one direction, but dataframes have two. Therefore we have to use two values in the brackets: the first representing the row, and the second representing the column: [[row, column]].

When using tibbles, single brackets will return a tibble, but double brackets will return the individual vectors or values without names.

In [ ]:

# subsetting a vector 
c("do", "re", "mi", "fa", "so") [1]

In [ ]:

# subsetting a data frame:
# pull a single variable into a tibble with names
books[5, 2]

In [ ]:

# return the second value in multiple columns
books[2, c("title", "format", "sub_collection")]

In [ ]:

# leave the row space blank to return all rows. 
# This will give you the titles column only
myTitles <- books[, "title"]

You can also use the relational operators (see above) to return values based on a logical condition. Below, we use complete.cases() to subset out the NA values. This function evaluates each row and checks if it is an NA value. If it is not, a TRUE value is returned, and vice versa. Since the column space is blank in the brackets [row, column], it will return all columns.

In [ ]:

completeCallnumbers <- books[complete.cases(books$callnumber), ]
completeCallnumbers

Subsetting using `filter()` in the `dplyr` package¶

Subsetting using brackets is important to understand, but as with other R functions, the dplyr package makes it much more straightforward, using the filter() function.

In [ ]:

# filter books to return only those items where the format is books
booksOnly <- dplyr::filter(books, format == "book")

In [ ]:

# use multiple filter conditions, 
# e.g. books to include only books with more than zero checkouts
bookCheckouts <- dplyr::filter(books,
                               format == "book",
                               tot_chkout > 0)

In [ ]:

# How many books have checkouts?
nrow(bookCheckouts)

In [ ]:

# Divide the number of rows of books with checkouts by the number of rows of books 
# then multiply by 100 to get the percent of books with checkouts
nrow(bookCheckouts)/nrow(booksOnly) * 100

In [ ]:

# use the str_detect() function from the stringr package to return all 
# books with the word "Science" in the SUBJECT variable, published after 1999
scienceBooks <- dplyr::filter(books,
                              format == "book",
                              stringr::str_detect(subject, "Science"), 
                              pubyear > 1999) 
scienceBooks

TRY IT YOURSELF

In [ ]:

# Run unique(books$format) and unique(books$sub_collection) to 
# confirm the values in each of these fields
unique(books$format)

In [ ]:

# Use filter() on books to create a data frame consisting only of format serials.
# Assign it to a value called serials

# How many rows are there?

# Use sum() to compute the number of total checkouts of serials

In [ ]:

# Use filter() to create a data frame consisting of format books and sub_collection juvenile materials. 
# Assign it to a value called juv

# What is the average number of checkouts tot_chkout for juvenile books?

# Filter the data frame to include books with between 10 and 20 checkouts

Selecting variables¶

The select() function allows you to keep or remove specific variables. It also provides a convenient way to reorder variables.

In [ ]:

# specify the variables you want to keep by name
booksTitleCheckouts <- dplyr::select(books, title, tot_chkout)

In [ ]:

# specify the variables you want to remove with a -
books <- dplyr::select(books, -call_item)

In [ ]:

# reorder columns, combined with everything()
booksReordered <- dplyr::select(books, title, tot_chkout, loutdate, everything())
booksReordered

Ordering data¶

The arrange() function in the dplyr package allows you to sort your data by alphabetical or numerical order.

In [ ]:

booksTitleArrange <- dplyr::arrange(books, title)

In [ ]:

# use desc() to sort a variable in descending order
booksHighestChkout <- dplyr::arrange(books, desc(tot_chkout))

In [ ]:

# order data based on multiple variables (e.g. sort first by checkout, then by publication year)
booksChkoutYear <- dplyr::arrange(books, desc(tot_chkout), desc(pubyear))

Creating new variables¶

The mutate() function allows you to create new variables.

In [ ]:

head(books$callnumber)

# use mutate to create a new column lc_class
# use the str_sub() function from the stringr package to extract the first character of the callnumber variable (the LC Class)

booksLC <- mutate(books
                  , lc_class = stringr::str_sub(callnumber, 1, 1))

head(booksLC$lc_class)

Putting it all together with %>%¶

The Pipe Operator %>% is loaded with the tidyverse. It takes the output of one statement and makes it the input of the next statement. You can think of it as "then" in natural language. So in the following example, the books tibble is first loaded, then the format is filtered to include only books, then only the title and tot_chkout columns are selected, and finally the data is rearranged from most to least checkouts.

In [ ]:

myBooks <- books %>%
  dplyr::filter(format == "book") %>%
  dplyr::select(title, tot_chkout) %>%
  dplyr::arrange(desc(tot_chkout))
myBooks

TRY IT YOURSELF

Experiment with different combinations of the pipe operator.

Create a data frame with these conditions:
- filter to include subCollection juvenile & k-12 materials and format books
- select only title, call number, total checkouts, and pub year
- arrange by total checkouts in descending order

In [ ]:

Create a data frame with these conditions:
- rename the isbn column to all caps: ISBN
- filter out NA values in the call number column
- filter to include only books published after 1990
- arrange from oldest to newest publication year

In [ ]:

Help with dplyr¶

Read more about dplyr at https://dplyr.tidyverse.org/.
In your console, after loading library(dplyr), run vignette("dplyr") to read an extremely helpful explanation of how to use it.
See the [http://r4ds.had.co.nz/transform.html]("Data Transformation" chapter) in Garrett Grolemund and Hadley Wickham's book R for Data Science.
Watch this Data School video: [https://www.youtube.com/watch?v=jWjqLW-u3hc](Hands-on dplyr tutorial for faster data manipulation in R.)

Data Exploration¶

Getting data into R¶

Ways to get data into R¶

readr¶

Data exploration¶

Exploring dataframes¶

Exploring variables¶

Dollar sign¶

unique(), table(), and duplicated()¶

Exploring missing values¶

Logical tests¶

Data cleaning & transformation with dplyr¶

Renaming variables¶

Recoding values¶

Subsetting dataframes¶

Subsetting using brackets in Base R¶

Subsetting using filter() in the dplyr package¶

Selecting variables¶

Ordering data¶

Creating new variables¶

Putting it all together with %>%¶

Help with dplyr¶

`readr`¶

`unique()`, `table()`, and `duplicated()`¶

Subsetting using `filter()` in the `dplyr` package¶