Lecture 3: Data Visualization

We've now learned the basics of R and how to manipulate and clean data with R. Let's dive into a field that arguably R is most useful for: data visualization.

In this lecture we will learn:

  1. Why data visualization is important,
  2. Notable techniques used to visualize data,
  3. The challenges of data visualization, and
  4. Useful visual tools available in R.

Why is Data Visualization Important?

Visualizing data is crucial in communicating ideas. We more readily and easily process information that is visual rather than abstract in nature. Since much of the output that arises from data analytics is abstract, visualization allows both easy digestion of complex patterns and presentation of consequent insight to those from non-technical backgrounds.

Many avoid data visualization because the process can be time-consuming, and good visuals are perceived to be hard to make. But many latent trends in a dataset can only be made noticeable via visualization. Not visualizing at all can result in a lack of foresight when it comes to model and parameter selection. When in doubt, visualize!

Note that there are typically two types of visualizations: distributional (using histograms or box plots to assess the distribution of a variable) and correlational (using line plots or scatter plots to understand the relationship between two variables).

The process of data visualization usually works in the following fashion:

  • Simple data analysis (correlational, summarize)
  • Data visualization
  • Identification of pattern
  • Secondary analysis or implementation

Let's see how this works using the built-in iris dataset in R. This dataset is based on a famous experiment conducted by R.A. Fisher [1].

In [1]:
data(iris)
str(iris)
summary(iris)
'data.frame':	150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
       Species  
 setosa    :50  
 versicolor:50  
 virginica :50  
                
                
                

This dataset is a good example of a classification problem, where we can use this dataset to train an algorithm that outputs species given Sepal.Length, width and petal width and length.

First Step: Correlational Analysis

In [2]:
cor(iris[,1:4])
Sepal.LengthSepal.WidthPetal.LengthPetal.Width
Sepal.Length 1.0000000-0.1175698 0.8717538 0.8179411
Sepal.Width-0.1175698 1.0000000-0.4284401-0.3661259
Petal.Length 0.8717538-0.4284401 1.0000000 0.9628654
Petal.Width 0.8179411-0.3661259 0.9628654 1.0000000

We can see that there is a negative correlation between the sepal features and positive correlation between petal features. However, linear correlation reveals little of the actual dynamics of the data, as will be shown below

Second Step:Visualization

In [16]:
plot(iris$Sepal.Length, iris$Sepal.Width, main = "The Sepal features", xlab = "Sepal Length", ylab = "Sepal Width")

We can see that the negative linear correlation is in fact not an apt representation of the data. It is better to understand it in terms of clusters. In order to further assess the pattern present in our data, we include color-codes for species using the ggplot2 package.

In [3]:
library(ggplot2)
ggplot(data= iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) + geom_point()

It is noticeable that
1. There is a clear clustering behavior for setosa
2. Versicolor and virginica is not clearly separated

This kind of information is valuable in assessing what kind of model to choose, and what kind of additional analysis that needs to take before we are sure of what to do with our dataset. It is possible, however, to further embellish our visual analysis by using a 3d plot and adding in another feature. We will use the scatterplot3d package for this.

In [6]:
#install.packages("scatterplot3d", repos='http://cran.us.r-project.org')
library(scatterplot3d)
library(dplyr)

iris_mutated <- mutate(iris, Color = as.character(Species))
iris_mutated$Color <- sapply(iris_mutated$Color, function(x){
  if (x=="setosa"){return("red") }
  if (x=="versicolor"){return("blue")}
  else return("green")
})

with(iris_mutated, {
   scatterplot3d(Sepal.Length,   # x axis
                 Sepal.Width,     # y axis
                 Petal.Length,    # z axis
                 main="3-D Scatterplot",
                color = Color)
})

Looking at this 3d scatterplot, we can see that the versicolor and virginica is actually much more separable, unlike what was indicated in our previous 2d plot. We can therefore conclude that these three features are enough to implement an effective classifier.

Common Visualization technique: Density function

While histograms are popular, density plots are favored for several reasons

  1. Histogram shape varies wildly depending on the bin size
  2. Density plots smooth out outliers and local fluctuations

The second point can be a weakness however, since local fluctuations can be very important. Let's look at an example

In [74]:
par(mfrow=c(1,2))
plot(density(iris$Sepal.Width), main = "density plot")
hist(iris$Sepal.Width, breaks = 100, main = "histogram")

The smoothing parameter for the density plot can be adjusted by changing something called the smoothing bandwidth in order to adjust for sensitivity to fluctuations.

The next part of the lecture notes will be using the dataset from Yelp: more specifically, of businesses in Pittsburgh.

Advanced Visualization Techniques: heatmaps, contour plots, and using maps

In [41]:
setwd("~/Yelp/business_data")
new_bz <- jsonlite::stream_in(file("new_bz.json"))
pitt <- filter(new_bz, city == "Pittsburgh")
opening file input connection.
 Imported 85901 records. Simplifying...
closing file input connection.

Using Maps to Visualize

The dataset above has longitude and latitude data. In cases like this, it is often worthwhile to put these data points on a map. A map provides several advantages which include:

  1. Provide context to certain patterns or clusters present within the data
  2. Provide information that can explain outliers
  3. Facilitate explanations to trends that have a real life counterpart.

Maps can be obtained using the qmap function from the ggmap package.

Note: The newest version of ggmap and ggplot2 (ggmap 2.6.1, ggplot2 2.2.0) might cause errors when you try to plot the map. If the problem persists, download the ggmap package directly from the developer's repo:

In [ ]:
# downloading ggmap directly from developer's github repo
install.packages("devtools")
install_github("dkahle/ggmap")
In [36]:
library(ggmap)
pitt_map <- qmap("Pittsburgh", zoom = 12, maptype = "roadmap")
pitt_map
Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Pittsburgh&zoom=12&size=640x640&scale=2&maptype=roadmap&language=en-EN&sensor=false
Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Pittsburgh&sensor=false