Lecture 1: Introduction to R, a statistical programming language

Welcome to the Cornell Data Science Training program! Through these lectures and exercises, we aim to teach you basic data science concepts and how to apply them using a data science language. The topics that we will cover will include, but will not be limited to:

  • Data Cleaning and Manipulation
  • Data visualization
  • Supervised Learning Algorithms
  • Unsupervised Learning
  • Text Analysis
  • Big Data Tools

We'll start with the basics of the language, to advanced applications in more advanced concepts. By the end of the course, you will have the foundation and basic skills to contribute to any subteam on Cornell Data Science, and to your career in this explosive field.

R vs. Other Languages

One of the most-asked questions to those dabbling in data science is the choice of programming language to use for data analysis. There are several contenders fighting to become the dominant data science language in both academia and industry. These languages include Python, R, Julia, Scala, C++, Java and the list goes on. Our language of choice for this course is R. There are several reasons for our choice.

  1. R is the most popular language(and not python!) for data science in industry. [1]
  2. For those first entering data science, the functional approach of R can prove easier. [2]
  3. The strong support R has in academia allows it to have some of the most diverse set of packages and robust support community both online and offline.

With that said, R has some quirks.

  1. It is not very compatible with Java and other traditional OO languages. This makes R used more for standalone analytics - for finance, consulting, scientific research, for example - rather than as part of a software product cycle.
  2. For reasons related to the point above, the computer science and tech industry does tend to prefer Python.
  3. R is a high-level language - a wrapper language with C and Fortran inside - and thus can prove extremely slow compared to other languages
  4. The functional approach proves difficult to adapt to for people coming from OO background.

However, we do wish to emphasize that choice of starting language is an irrelevant discussion. It matters not what you start with. A career in data science involves knowledge of more than one language. There are so many relevant areas of data science that are governed by a large variety of languages, and by no means should you limit yourself to learning one.

Let's get started coding! Download R (https://cran.rstudio.com/) and RStudio (https://www.rstudio.com/products/rstudio/download/)!

R Data Structures

In any programming language, we have the notion of ‘data types’ and ‘data structures’. This is because programs manipulate data, and different kinds of data require different manipulations. For example, numbers are treated differently than characters (or strings of characters) and single numbers are treated differently than lists of numbers (vectors) or arrays of numbers (matrices).

The following are some simple R data types:

  • numeric (1,2.7,9001)
  • character ("Hello World")
  • logical (TRUE or FALSE)

(Show what modes can hold with chart from http://bxhorn.com/data-modes-and-classes/)

These data types can be combined to form data structures:

Vectors

Vectors are the most basic data structure in R. we can understand vectors as arrays for those familiar with programming, or as sets for those more familiar with mathematics. Usually in R, a vector refers to grouped elements of the same data type; vectors are usually created using the c() or vector() function.

  • c(): concatenation function
  • vector(length, mode): a function used to initialize a vector of certain length and certain data type
In [1]:
print("c() function example")
x <- c(1, 5, 4, 9, 0)
print(x)
[1] "c() function example"
[1] 1 5 4 9 0
In [2]:
# or
x <- vector(length=5, mode = "numeric")
x <- c(1,5,4,9,0)
print("vector() function example")
print(x)
[1] "vector() function example"
[1] 1 5 4 9 0

Another way to make a vector is using the seq() function. More complex sequences can be created using the seq() function, like defining number of points in an interval, or the step size.

In [3]:
seq(1,6)                # specify endpoints for integer sequence
  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
In [4]:
seq(1,3, by=0.2)        # specify intervals for a sequence with values in the interval (1,3) included
  1. 1
  2. 1.2
  3. 1.4
  4. 1.6
  5. 1.8
  6. 2
  7. 2.2
  8. 2.4
  9. 2.6
  10. 2.8
  11. 3
In [5]:
c(1:6)                  # Specify a vector with integers 1 to 6
  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
In [6]:
seq(1, 5, length.out=4) # specify length of the vector
  1. 1
  2. 2.33333333333333
  3. 3.66666666666667
  4. 5

Lists

List is a data structure having components of mixed data types. A vector having all elements of the same type is called an atomic vector but a vector having elements of different type, and even different data structure is a list.
We can check whether an item is a list or vector using the typeof() function.
Below, we create a list x, of three components with data types double, logical, and integer vector respectively.

In [20]:
x <- list("a" = 2.5, "b" = TRUE, "c" = seq(1,3))

Its structure can be examined with the str() function.

In [21]:
str(x)
List of 3
 $ a: num 2.5
 $ b: logi TRUE
 $ c: int [1:3] 1 2 3

In this example, a, b and c are called tags which makes it easier to reference the components of the list. Tags are optional. We can create the same list without tags. R will use numeric indices instead.
So how do we access components of a list? Below are some methods.

In [7]:
x <- list("a" = 2.5, "b" = TRUE, "c" = seq(1,3))
x[c(1:2)]     # index using integer vector
$a
2.5
$b
TRUE
In [8]:
x[c(T,F,F)]   # index using logical vector
$a = 2.5
In [9]:
x[c("a","c")] # index using character vector
$a
2.5
$c
  1. 1
  2. 2
  3. 3

Indexing with [ as shown above will give us sublist, not the content inside the component. To retrieve the content, we need to use [[ or $.

In [7]:
x["a"]
x[["a"]]
x$a
$a = 2.5
2.5
2.5

Matrices

A matrix is a two dimensional data structure, they're similar to vectors but additionally contains the dimension attribute. All attributes of an object can be checked with the attributes() function (dimension can be checked directly with the dim() function). We can check if a variable is a matrix or not with the class() function.

Matrices can be created using the matrix() function. Dimension of the matrix can be defined by passing appropriate value for arguments nrow and ncol.

Providing value for both dimension is not necessary. If one of the dimension is provided, the other is inferred from length of the data.

In [38]:
matrix(1:9, nrow = 3, ncol = 3) # same as matrix(1:9, nrow = 3)
147
258
369
In [39]:
matrix(1:9, nrow=3, byrow=TRUE) # fill matrix row-wise
123
456
789

It is possible to name the rows and columns of matrix during creation by passing a 2-element list to the argument dimnames.

In [41]:
x <- matrix(1:9, nrow = 3, dimnames = list(c("X","Y","Z"), c("A","B","C")))
x
ABC
X147
Y258
Z369

Another way of creating a matrix is by using functions cbind() and rbind() as in column bind and row bind, or from a vector by setting its dimension using dim().

In [8]:
cbind(c(1,2,3),c(4,5,6))
rbind(c(1,2,3),c(4,5,6))
x <- c(1,2,3,4,5,6)
dim(x) <- c(2,3)
x
14
25
36
123
456

Data Frames

Data frames are similar to matrices, so many of the we're about to use also work on matrices.

The key difference between the two is that data frames, unlike matrices, can have different columns store different modes (numeric, character, factor, etc.). This makes data frames extremely powerful, since it iand the most commonly used data type.

Below we create a simple data frame, and analyze it using some essential R functions.

In [11]:
x <- data.frame("ID" = 1:2, "Age" = c(21,15), "Name" = c("John","Dora"), stringsAsFactors=FALSE) #make the data frame
x
str(x)   # structure of x
names(x) # names of each column
ncol(x)  # number of columns
nrow(x)  # number of rows
IDAgeName
11 21 John
22 15 Dora
'data.frame':	2 obs. of  3 variables:
 $ ID  : int  1 2
 $ Age : num  21 15
 $ Name: chr  "John" "Dora"
  1. "ID"
  2. "Age"
  3. "Name"
3
2

How to access elements of a data frame? We can access elements of a matrix using the square bracket [ indexing method. Elements can be accessed as var[row, column]. Here rows and columns are vectors.

In [10]:
x[c(1,2),c(2,3)]    # select rows 1 & 2 and columns 2 & 3
x[1,]    # leaving column field blank will select entire columns
x[,]    # leaving row as well as column field blank will select entire data frame
x[-1,]    # select all rows except first
AgeName
121 John
215 Dora
IDAgeName
11 21 John
In [93]:
x[1,"Name"]
John

Data frames are also easy to modify, or add components to.

In [94]:
x[1,"Age"] <- 20; x #modify John's Age
IDAgeName
1 20 John
2 15 Dora

Rows can be added to a data frame using the rbind() function.

In [109]:
rbind(x,list(3,16,"Paul"))
IDAgeName
1 21 John
2 15 Dora
3 16 Paul

Similarly, we can add columns using cbind(), or If you want to delete a row or column, simply set its value to NULL.

In [79]:
cbind(x,State=c("NY","FL"))
x$State <- NULL
x
IDAgeNameState
1 20 JohnNY
2 15 DoraFL