Welcome to the Cornell Data Science Training program! Through these lectures and exercises, we aim to teach you basic data science concepts and how to apply them using a data science language. The topics that we will cover will include, but will not be limited to:
We'll start with the basics of the language, to advanced applications in more advanced concepts. By the end of the course, you will have the foundation and basic skills to contribute to any subteam on Cornell Data Science, and to your career in this explosive field.
One of the most-asked questions to those dabbling in data science is the choice of programming language to use for data analysis. There are several contenders fighting to become the dominant data science language in both academia and industry. These languages include Python, R, Julia, Scala, C++, Java and the list goes on. Our language of choice for this course is R. There are several reasons for our choice.
With that said, R has some quirks.
However, we do wish to emphasize that choice of starting language is an irrelevant discussion. It matters not what you start with. A career in data science involves knowledge of more than one language. There are so many relevant areas of data science that are governed by a large variety of languages, and by no means should you limit yourself to learning one.
Let's get started coding! Download R (https://cran.rstudio.com/) and RStudio (https://www.rstudio.com/products/rstudio/download/)!
In any programming language, we have the notion of ‘data types’ and ‘data structures’. This is because programs manipulate data, and different kinds of data require different manipulations. For example, numbers are treated differently than characters (or strings of characters) and single numbers are treated differently than lists of numbers (vectors) or arrays of numbers (matrices).
The following are some simple R data types:
(Show what modes can hold with chart from http://bxhorn.com/data-modes-and-classes/)
These data types can be combined to form data structures:
Vectors are the most basic data structure in R. we can understand vectors as arrays for those familiar with programming, or as sets for those more familiar with mathematics. Usually in R, a vector refers to grouped elements of the same data type; vectors are usually created using the c() or vector() function.
print("c() function example") x <- c(1, 5, 4, 9, 0) print(x)
 "c() function example"  1 5 4 9 0
# or x <- vector(length=5, mode = "numeric") x <- c(1,5,4,9,0) print("vector() function example") print(x)
 "vector() function example"  1 5 4 9 0
Another way to make a vector is using the seq() function. More complex sequences can be created using the seq() function, like defining number of points in an interval, or the step size.
seq(1,6) # specify endpoints for integer sequence
seq(1,3, by=0.2) # specify intervals for a sequence with values in the interval (1,3) included
c(1:6) # Specify a vector with integers 1 to 6
seq(1, 5, length.out=4) # specify length of the vector
List is a data structure having components of mixed data types.
A vector having all elements of the same type is called an atomic vector but a vector having elements of different type, and even different data structure is a list.
We can check whether an item is a list or vector using the typeof() function.
Below, we create a list x, of three components with data types double, logical, and integer vector respectively.
x <- list("a" = 2.5, "b" = TRUE, "c" = seq(1,3))
Its structure can be examined with the str() function.
List of 3 $ a: num 2.5 $ b: logi TRUE $ c: int [1:3] 1 2 3
In this example, a, b and c are called tags which makes it easier to reference the components of the list. Tags are optional. We can create the same list without tags. R will use numeric indices instead.
So how do we access components of a list? Below are some methods.
x <- list("a" = 2.5, "b" = TRUE, "c" = seq(1,3)) x[c(1:2)] # index using integer vector
x[c(T,F,F)] # index using logical vector
x[c("a","c")] # index using character vector
Indexing with [ as shown above will give us sublist, not the content inside the component. To retrieve the content, we need to use [[ or $.
x["a"] x[["a"]] x$a
A matrix is a two dimensional data structure, they're similar to vectors but additionally contains the dimension attribute. All attributes of an object can be checked with the attributes() function (dimension can be checked directly with the dim() function). We can check if a variable is a matrix or not with the class() function.
Matrices can be created using the matrix() function. Dimension of the matrix can be defined by passing appropriate value for arguments nrow and ncol.
Providing value for both dimension is not necessary. If one of the dimension is provided, the other is inferred from length of the data.
matrix(1:9, nrow = 3, ncol = 3) # same as matrix(1:9, nrow = 3)
matrix(1:9, nrow=3, byrow=TRUE) # fill matrix row-wise
It is possible to name the rows and columns of matrix during creation by passing a 2-element list to the argument dimnames.
x <- matrix(1:9, nrow = 3, dimnames = list(c("X","Y","Z"), c("A","B","C"))) x
Another way of creating a matrix is by using functions cbind() and rbind() as in column bind and row bind, or from a vector by setting its dimension using dim().
cbind(c(1,2,3),c(4,5,6)) rbind(c(1,2,3),c(4,5,6)) x <- c(1,2,3,4,5,6) dim(x) <- c(2,3) x
Data frames are similar to matrices, so many of the we're about to use also work on matrices.
The key difference between the two is that data frames, unlike matrices, can have different columns store different modes (numeric, character, factor, etc.). This makes data frames extremely powerful, since it iand the most commonly used data type.
Below we create a simple data frame, and analyze it using some essential R functions.
x <- data.frame("ID" = 1:2, "Age" = c(21,15), "Name" = c("John","Dora"), stringsAsFactors=FALSE) #make the data frame x str(x) # structure of x names(x) # names of each column ncol(x) # number of columns nrow(x) # number of rows
'data.frame': 2 obs. of 3 variables: $ ID : int 1 2 $ Age : num 21 15 $ Name: chr "John" "Dora"
How to access elements of a data frame? We can access elements of a matrix using the square bracket [ indexing method. Elements can be accessed as var[row, column]. Here rows and columns are vectors.
x[c(1,2),c(2,3)] # select rows 1 & 2 and columns 2 & 3 x[1,] # leaving column field blank will select entire columns x[,] # leaving row as well as column field blank will select entire data frame x[-1,] # select all rows except first
Data frames are also easy to modify, or add components to.
x[1,"Age"] <- 20; x #modify John's Age
Rows can be added to a data frame using the rbind() function.
Similarly, we can add columns using cbind(), or If you want to delete a row or column, simply set its value to NULL.
cbind(x,State=c("NY","FL")) x$State <- NULL x
code examples adapted from programiz.com