# Lecture 3: Data Visualization¶

We've now learned the basics of R and how to manipulate and clean data with R. Let's dive into a field that arguably R is most useful for: data visualization.

In this lecture we will learn:

1. Why data visualization is important,
2. Notable techniques used to visualize data,
3. The challenges of data visualization, and
4. Useful visual tools available in R.

## Why is Data Visualization Important?¶

Visualizing data is crucial in communicating ideas. We more readily and easily process information that is visual rather than abstract in nature. Since much of the output that arises from data analytics is abstract, visualization allows both easy digestion of complex patterns and presentation of consequent insight to those from non-technical backgrounds.

Many avoid data visualization because the process can be time-consuming, and good visuals are perceived to be hard to make. But many latent trends in a dataset can only be made noticeable via visualization. Not visualizing at all can result in a lack of foresight when it comes to model and parameter selection. When in doubt, visualize!

Note that there are typically two types of visualizations: distributional (using histograms or box plots to assess the distribution of a variable) and correlational (using line plots or scatter plots to understand the relationship between two variables).

The process of data visualization usually works in the following fashion:

• Simple data analysis (correlational, summarize)
• Data visualization
• Identification of pattern
• Secondary analysis or implementation

Let's see how this works using the built-in iris dataset in R. This dataset is based on a famous experiment conducted by R.A. Fisher [1].

In [1]:
data(iris)
str(iris)
summary(iris)

'data.frame':	150 obs. of  5 variables:
$Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...$ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...$ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...   Sepal.Length Sepal.Width Petal.Length Petal.Width Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 Median :5.800 Median :3.000 Median :4.350 Median :1.300 Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 Species setosa :50 versicolor:50 virginica :50  This dataset is a good example of a classification problem, where we can use this dataset to train an algorithm that outputs species given Sepal.Length, width and petal width and length. ### First Step: Correlational Analysis¶ In [2]: cor(iris[,1:4])  Sepal.LengthSepal.WidthPetal.LengthPetal.Width Sepal.Length 1.0000000-0.1175698 0.8717538 0.8179411 Sepal.Width-0.1175698 1.0000000-0.4284401-0.3661259 Petal.Length 0.8717538-0.4284401 1.0000000 0.9628654 Petal.Width 0.8179411-0.3661259 0.9628654 1.0000000 We can see that there is a negative correlation between the sepal features and positive correlation between petal features. However, linear correlation reveals little of the actual dynamics of the data, as will be shown below ## Second Step:Visualization¶ In [16]: plot(iris$Sepal.Length, iris$Sepal.Width, main = "The Sepal features", xlab = "Sepal Length", ylab = "Sepal Width")  We can see that the negative linear correlation is in fact not an apt representation of the data. It is better to understand it in terms of clusters. In order to further assess the pattern present in our data, we include color-codes for species using the ggplot2 package. In [3]: library(ggplot2) ggplot(data= iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) + geom_point()  It is noticeable that 1. There is a clear clustering behavior for setosa 2. Versicolor and virginica is not clearly separated This kind of information is valuable in assessing what kind of model to choose, and what kind of additional analysis that needs to take before we are sure of what to do with our dataset. It is possible, however, to further embellish our visual analysis by using a 3d plot and adding in another feature. We will use the scatterplot3d package for this. In [6]: #install.packages("scatterplot3d", repos='http://cran.us.r-project.org') library(scatterplot3d) library(dplyr) iris_mutated <- mutate(iris, Color = as.character(Species)) iris_mutated$Color <- sapply(iris_mutated$Color, function(x){ if (x=="setosa"){return("red") } if (x=="versicolor"){return("blue")} else return("green") }) with(iris_mutated, { scatterplot3d(Sepal.Length, # x axis Sepal.Width, # y axis Petal.Length, # z axis main="3-D Scatterplot", color = Color) })  Looking at this 3d scatterplot, we can see that the versicolor and virginica is actually much more separable, unlike what was indicated in our previous 2d plot. We can therefore conclude that these three features are enough to implement an effective classifier. ## Common Visualization technique: Density function¶ While histograms are popular, density plots are favored for several reasons 1. Histogram shape varies wildly depending on the bin size 2. Density plots smooth out outliers and local fluctuations The second point can be a weakness however, since local fluctuations can be very important. Let's look at an example In [74]: par(mfrow=c(1,2)) plot(density(iris$Sepal.Width), main = "density plot")
hist(iris\$Sepal.Width, breaks = 100, main = "histogram")


The smoothing parameter for the density plot can be adjusted by changing something called the smoothing bandwidth in order to adjust for sensitivity to fluctuations.

The next part of the lecture notes will be using the dataset from Yelp: more specifically, of businesses in Pittsburgh.

## Advanced Visualization Techniques: heatmaps, contour plots, and using maps¶

In [41]:
setwd("~/Yelp/business_data")
new_bz <- jsonlite::stream_in(file("new_bz.json"))
pitt <- filter(new_bz, city == "Pittsburgh")

opening file input connection.

 Imported 85901 records. Simplifying...

closing file input connection.


## Using Maps to Visualize¶

The dataset above has longitude and latitude data. In cases like this, it is often worthwhile to put these data points on a map. A map provides several advantages which include:

1. Provide context to certain patterns or clusters present within the data
2. Provide information that can explain outliers
3. Facilitate explanations to trends that have a real life counterpart.

Maps can be obtained using the qmap function from the ggmap package.

Note: The newest version of ggmap and ggplot2 (ggmap 2.6.1, ggplot2 2.2.0) might cause errors when you try to plot the map. If the problem persists, download the ggmap package directly from the developer's repo:

In [ ]:
# downloading ggmap directly from developer's github repo
install.packages("devtools")
install_github("dkahle/ggmap")

In [36]:
library(ggmap)
pitt_map <- qmap("Pittsburgh", zoom = 12, maptype = "roadmap")
pitt_map

Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Pittsburgh&zoom=12&size=640x640&scale=2&maptype=roadmap&language=en-EN&sensor=false