Objects¶

R objects are assigned using the assignment operator, '<-':

In [1]:

x <- 1

In [2]:

print(x)

[1] 1

As we can see here, we have assigned 'x' a value of 1. Thus, unless redefined to mean something else, 'x' now means '1' for all intents and purposes. Variable names can be almost anything, though spaces are prohibited.

Vectors¶

Atomic Types

R provides six basic types representing the value of a variable. The fundamental types are:

1 integer
2 real
3 character
4 logical
5 complex 
6 raw

Numeric Vectors

Objects can contain far more than one value. R is a vectorized programming language - for our purposes here, we can define 'vector' as 'a collection of values of the same type'. A collection of numbers is a vector, as is a collection of letters or words. An entire vector can easily be stored in a single object.

In [3]:

y <- c(1, 2, 3)

In [4]:

(y)

1
2
3

'y' now contains a vector composed of the numbers 1, 2, and 3. Vectors can be assembled in a number of ways, but the most straightforward way to do so is by wrapping a list of comma-separated numbers in the c() function; there are, however, many alternatives.

Using direct assignment:

In [5]:

short.vector <- 1
print(short.vector)

[1] 1

Using the c() operator:

In [6]:

medium.vector <- c(1, 2, 3)
print(medium.vector)

[1] 1 2 3

Using a sequence (':' operator):

In [7]:

long.vector <- 1:10
print(long.vector)

 [1]  1  2  3  4  5  6  7  8  9 10

It is even possible to generate an empty vector so that you can fill it later as calculations are performed. Here, we are using the vector() function to initialize an empty vector.

In [8]:

empty.vector <- vector()

print(empty.vector)

logical(0)

R reports this empty object as logical(0), which means that this is a logical vector of length 0, which is the default type generated by the vector function. If we used another type it would print type(0) instead

It is trivial to add new information to existing vectors. Here we use the above-mentioned c() function to concatenate (i.e. combine) an existing vector with a new value:

In [9]:

print(short.vector)
print(c(short.vector, 2))

[1] 1
[1] 1 2

Combining an empty vector with a value will lead to a vector of length 1 containing that value:

In [10]:

print(c(empty.vector, 100))

[1] 100

Vectors can likewise be combined with each other:

In [11]:

print(short.vector)
print(medium.vector)
print(c(short.vector, medium.vector))

[1] 1
[1] 1 2 3
[1] 1 1 2 3

and multiple vectors can be combined in one step:

In [12]:

print(c(short.vector, medium.vector, long.vector))

 [1]  1  1  2  3  1  2  3  4  5  6  7  8  9 10

The results of this concatenation can be stored in an object just like any other vector:

In [13]:

combined.vector <- c(short.vector, medium.vector, long.vector)
print(combined.vector)

 [1]  1  1  2  3  1  2  3  4  5  6  7  8  9 10

The length of a vector can be extracted using the length() function:

In [14]:

length(combined.vector)

14

The 'index' of a value within a vector is the number denoting its position. The first value has an index of 1, and the last value has an index equal to the length of the vector. The index can be used to extract the corresponding value from a vector using '[' and ']':

In [15]:

print(combined.vector)
print(combined.vector[1])
print(combined.vector[14])

 [1]  1  1  2  3  1  2  3  4  5  6  7  8  9 10
[1] 1
[1] 10

You can combine these last two points to write code that will extract the last element of a vector:

In [16]:

print(combined.vector[length(combined.vector)])

[1] 10

R code evaluates from the inside out: first, the length() function will be evaluated, and that value (i.e. 14, as seen above) will be passed to the square brackets, which will use it to extract the corresponding element. Since the indices of any vector span the range from 1 to the vector's total length, this code will always extract the last element regardless of what vector is used:

In [17]:

print(short.vector)
print(short.vector[length(short.vector)])

print(long.vector)
print(long.vector[length(long.vector)])

[1] 1
[1] 1
 [1]  1  2  3  4  5  6  7  8  9 10
[1] 10

You can use indices to remove elements from vectors, too. Here, we will remove the second element from medium.vector:

In [18]:

print(medium.vector)

medium.vector <- medium.vector[-c(2)]

print(medium.vector)

[1] 1 2 3
[1] 1 3

Since R is a vectorized language, it is designed for performing mathematical operations on entire vectors. It is trivial, for example, to add a value to every element of a vector:

In [19]:

print(long.vector)

long.vector <- long.vector + 1

print(long.vector)

 [1]  1  2  3  4  5  6  7  8  9 10
 [1]  2  3  4  5  6  7  8  9 10 11

Other elementary mathematical operations can be implemented in the same way.

Standard vector addition is also possible if vectors are the same length:

In [20]:

first.vector <- c(1, 2, 3)
second.vector <- c(4, 5, 6)

print(first.vector + second.vector)

[1] 5 7 9

Do not try to perform vector addition when the vectors are not of the same length: R will attempt to add them even though doing so is not a mathematically legitimate operation, and weird behavior will result.

In [21]:

third.vector <- c(7, 8)

print(first.vector + third.vector)

Warning message in first.vector + third.vector:
“longer object length is not a multiple of shorter object length”

[1]  8 10 10

Specifically, what happened here is that R began to cycle back through the shorter vector until it had enough numbers to match the longer vector's length, so the resulting calculation was (first.vector[1] + third.vector[1], first.vector[2] + third.vector[2] + first.vector[3] + third.vector[1]). There are rare situations in which this behavior is actually desired, but by and large it will cause serious problems for an analysis.

Vector subtraction works in the same way as vector addition.

Thus far, we have only worked with one type of vector - the 'numeric' vector - because it is the type most commonly seen in bioinformatics. The class() function allows you to determine what type of vector a given object is:

In [22]:

class(short.vector)

'numeric'

Character Vectors

There are, however, other types of vector. Another commonly-seen vector class is the 'character' vector, a data structure used for holding text:

In [23]:

character.vector <- c("These", "words", "form", "a", "character", "vector")

class(character.vector)

'character'

Indexing works the same in all types of vectors, including the character vector:

In [24]:

print(character.vector)
print(character.vector[2])

[1] "These"     "words"     "form"      "a"         "character" "vector"   
[1] "words"

You cannot, however, perform mathetical operations on a character vector:

In [25]:

print(character.vector + 5)

Error in character.vector + 5: non-numeric argument to binary operator
Traceback:

1. print(character.vector + 5)

Sometimes, when data is read into R, its formatting causes problems with parsing and leads to it being imported as the incorrect data type. It is not uncommon to have numbers read in as a character vector instead of a numeric vector:

In [26]:

numbers.as.character.vector <- c("1", "2", "3", "4", "5")

print(numbers.as.character.vector)

[1] "1" "2" "3" "4" "5"

In [27]:

class(numbers.as.character.vector)

print(numbers.as.character.vector + 5)

'character'

Error in numbers.as.character.vector + 5: non-numeric argument to binary operator
Traceback:

1. print(numbers.as.character.vector + 5)

As can be seen above, elements of a character vector are always printed within double quotes. Seeing data surrounded by quotation marks is often the first sign that R has imported the data incorrectly (often due to cross-platform character encoding issues). Fortunately, there is a simple solution to this issue: coercion.

In many programming languages, including R, it is possible to 'coerce' one data structure into another...basically, forcing it to change characteristics. R provides a variety of functions for coercing objects, and here we will use the as.numeric() function to force an object to become a numeric vector:

In [28]:

class(numbers.as.character.vector)

'character'

Vector Conversion

In [29]:

coerced.vector <- as.numeric(numbers.as.character.vector)

class(coerced.vector)

print(coerced.vector + 5)

'numeric'

[1]  6  7  8  9 10

As demonstrated above, the vector can now be subjected to mathematical operations like any other numeric vector.

Be warned that careless object coersion will yield missing values. Here, we will try to coerce our original character vector to a numeric even though that is nonsense:

In [30]:

print(character.vector)

[1] "These"     "words"     "form"      "a"         "character" "vector"

In [31]:

nonsense.vector <- as.numeric(character.vector)

class(nonsense.vector)

print(nonsense.vector)

Warning message in eval(expr, envir, enclos):
“NAs introduced by coercion”

'numeric'

[1] NA NA NA NA NA NA

Since a value like "This" has no numerical equivalent, R is unable to coerce it and it becomes a missing value (NA - 'Not Available').

Missing Values

We will now take a brief interlude to expand upon missing data. In addition to resulting from an inappropriate type conversion, missing values can result from a variety of errors on the part of both the user and the system. Things like unexpected character encoding in data files - though invisible to the user - can cause R to yield NAs when reading data in.

Checking for the presencs of NAs is relatively simple. Here we will present the simplest way to check for them in a vector; we will return to the topic of missing data later once we introduce the concept of the data frame.

In [32]:

vector.with.missingness <- c(1, 2, NA, 4, NA)

print(vector.with.missingness)

[1]  1  2 NA  4 NA

The is.na() function checks each element of a vector to see if it is an NA, and then states TRUE if it is and FALSE if it is not. This result is technically a logical vector, a vector wherein all elements are TRUE or FALSE relative to some condition.

In [33]:

is.na(vector.with.missingness)

FALSE
FALSE
TRUE
FALSE
TRUE

R has a function called table() that tabulates how many entries of a vector fall into each category of entry; this function is ideally-suited for tabulating the missingness of a vector:

In [34]:

table(is.na(vector.with.missingness))

FALSE  TRUE 
    3     2

As can easily be seen, we have two pieces of missing data in this vector. What happens if we call table() on a vector with no missingness?

In [35]:

table(is.na(long.vector))

any(is.na(long.vector))

FALSE 
   10

FALSE

table() still works even if there is only one category into which all elements of a vector fall, and the syntax 'table(is.na(VECTOR_NAME))' is therefore a quick and easy way to see if there are any NAs in even the longest vectors.

There are several other types of anomalous values that can result when R attempts to perform an invalid operation; NA is the most commonly encountered, but we will briefly note the others.

NaN stands for 'not a number and is the result of an invalid arithmetic operation:

In [36]:

0/0

NaN

NULL is the result of trying to query a parameter that is undefined for a specified object:

In [37]:

names(vector.with.missingness)

NULL

Positive and negative infinity result from dividing by zero or operations that do not converge:

In [38]:

10/0

Inf

Infinity is usually - but not always - the result of an error, but the other types of anomalous data are; if NA, NaN, or NULL is the result of a function call, chances are the function was misapplied or there is a serious problem with the data.

Common Functions

R has hundreds of useful functions that operate on vectors. We've seen a few of them so far, presented without much technical explanation. Now, we will briefly pause our coverage of vector types to discuss the fundamental nature of functions and then exhibit the effects of a few with widespread use-cases.

All functions have one or more 'arguments' - pieces of information that must be provided to them in order for them to work. The most common function we've seen throughout this exercise thus far is the print() function. print() requires one argument: the thing you want to print.

In [39]:

print("This sentence is the argument to print().")

[1] "This sentence is the argument to print()."

All functions have at least one required argument, but most also have optional arguments, including print(). It is possible to have print() produce output surrounded by quotation marks by including a second argument, 'quote = TRUE'. Arguments are always separated by commas.

In [40]:

print("This sentence is the argument to print().", quote = TRUE)

[1] "This sentence is the argument to print()."

Note the presence of quotation marks as a result of this argument. Optional arguments have default values - in this case, quote = FALSE - that are consistent with the most common use case for that function. A full list of a function's arguments and an explanation of their possible values can be found on a function's manual page, accessed through the following syntax (which behaves oddly in a Jupyter environment like this one - it will look quite different in standalone R or RStudio).

Running the following line will cause the manual page to pop up separately, rather than opening in the notebook's main panel. Click the small X in the upper right corner to dismiss the popup.

In [41]:

?print

sum() is a function that will sum the contents of a vector:

In [42]:

function.vector <- c(1, 1, 2, 0, 3)

sum(function.vector)

7

mean() and median() will determine the corresponding central tendency metrics of a vector:

In [43]:

mean(function.vector)

median(function.vector)

1.4

1

A particularly useful function for summary statistics is the summary() function, which includes quartiles and the mean:

In [44]:

summary(function.vector)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    0.0     1.0     1.0     1.4     2.0     3.0

Most functions are designed to 'return' some piece of data; by default, data that is returned prints to the screen and is not stored anywhere. It is possible to store the information returned by a function in a variable for future use, in which case the return value is (usually) not printed to the screen.

In [45]:

summary.return <- summary(function.vector)

Of course, we can then print it to the screen using print() if we so desire:

In [46]:

print(summary.return)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    0.0     1.0     1.0     1.4     2.0     3.0

Factor Variables

The last common type of variable is a factor variable, known to statisticians as a 'categorical variable' or a 'nominal variable'. A factor is a type of variable that has a set number of distinct categories into which all observations fall - for example, in a clinical trial comparing multiple treatments, a factor variable would describe which treatment a given patient received.

Factor variables are important because R's default behavior when reading in text is to convert that text into a factor variable rather than a character variable, which can often lead to weird behavior if the user is trying to e.g. search that text.

Here we can see a factor variable:

In [47]:

factor.vector <- as.factor(c("Metformin", "Metformin", "Acarbose", "Metformin", "Acarbose", "Acarbose", "Metformin"))

class(factor.vector)

'factor'

In [48]:

print(factor.vector)

[1] Metformin Metformin Acarbose  Metformin Acarbose  Acarbose  Metformin
Levels: Acarbose Metformin

Note that when printing out a factor variable, R tells you the 'levels' - in other words, all possible values the variable can take. Furthermore, unlike in a character vector, the vector's elements are not wrapped in quotation marks. If R prints out that 'levels' line and prints text without any quotation marks, you are dealing with a factor.

A factor variable can be converted into text without difficulty:

In [49]:

as.character(factor.vector)

'Metformin'
'Metformin'
'Acarbose'
'Metformin'
'Acarbose'
'Acarbose'
'Metformin'

Note that the 'levels' section is gone and the elements are all now wrapped in quotation marks.

It is also possible to convert a factor into a numeric vector; sometimes that is desirable, but often it can lead to unexpected behavior:

In [50]:

as.numeric(factor.vector)

2
2
1
2
1
1
2

To understand what has happened here, look at the 'levels' of the original factor and note their order. Converting a factor variable to a numeric variable causes each element to be replaced with the index of its level. Whether that is desirable or not is up to the analyst's own particular situation.

Finally, we note that it is possible to change the order of the factor's levels using the relevel() command, which is important for certain statistical procedures that are beyond the scope of this analysis:

In [51]:

print(relevel(factor.vector, "Metformin"))

[1] Metformin Metformin Acarbose  Metformin Acarbose  Acarbose  Metformin
Levels: Metformin Acarbose

Note that Metformin is now the first level. The first level of a factor serves as the basis for comparison ('reference group') in many types of R-implemented regression; those interested in the (very important) implications of this statement can refer to the following textbook:

http://www.stat.columbia.edu/~gelman/arm/

Data Frames¶

Data frames - what R calls general data tables - are the structure that a biologist will most frequently encounter in their work. Data frames are, for all intents and purposes, a group of vectors of equal length bound together into one unit. In fact, here we will literally bind three vectors together to create a data frame using the data.frame() function:

In [52]:

column.1 <- c("a", "b", "c")
column.2 <- c(1, 3, 5)
column.3 <- c(TRUE, TRUE, FALSE)
example.df <- data.frame(column.1, column.2, column.3)

example.df

column.1	column.2	column.3
a	1	TRUE
b	3	TRUE
c	5	FALSE

The data frame's column names can be changed in the following manner:

In [53]:

colnames(example.df) <- c("id", "visits", "tx.success")

example.df

id	visits	tx.success
a	1	TRUE
b	3	TRUE
c	5	FALSE

We can use the column names to extract a single column using the notation [data frame name]$[column name], e.g.:

In [54]:

print(example.df$visits)

is.vector(example.df$visits)

[1] 1 3 5

TRUE

Once a data frame is constructed, it can be modified freely to include additional data. cbind() - column bind - allows you to bind a vector to the data frame, at which point it becomes a new column. By default, cbind() will name this new column after either the vector itself or the name of the object storing the vector.

In [55]:

column.4 <- c(1, 1, 1)

example.df.cbind <- cbind(example.df, column.4)

example.df.cbind

class(example.df.cbind)

id	visits	tx.success	column.4
a	1	TRUE	1
b	3	TRUE	1
c	5	FALSE	1

'data.frame'

The column names of a data frame are stored as a vector; as such, we can interact with them in the same manner as any vector using our previous knowledge.

In [56]:

colnames(example.df.cbind)

'id'
'visits'
'tx.success'
'column.4'

In [57]:

colnames(example.df.cbind)[4] <- "survival"

example.df.cbind

id	visits	tx.success	survival
a	1	TRUE	1
b	3	TRUE	1
c	5	FALSE	1

Here, we have used our knowledge of vector indexing to set the 4th element of the column name vector to 'patient.survival' without affecting any of the other column names.

There is also an alternative method for generating new columns that automatically assigns a user-specified column name that involves the same '$' notation discussed previously:

In [58]:

column.5 <- c(3, 7, 15)

example.df.cbind$total.tx <- column.5

example.df.cbind

id	visits	tx.success	survival	total.tx
a	1	TRUE	1	3
b	3	TRUE	1	7
c	5	FALSE	1	15

Please note that this new column may appear on a new line due to screen resolution; when that happens, R will automatically repeat the row name at the beginning of each new line for the sake of readability.

	1	2	3	4
1	1, 1	1, 2	1, 3	1, 4
2	2, 1	2, 2	2, 3	2, 4
3	3, 1	3, 2	3, 3	3, 4

Elements within data frames have indices just like elements within vectors. Since a data frame is two-dimensional rather than one-dimensional, each element has two coordinates, listed in the order [ROW, COLUMN] as denoted above.

Let's revisit our previous data:

In [59]:

example.df.cbind

id	visits	tx.success	survival	total.tx
a	1	TRUE	1	3
b	3	TRUE	1	7
c	5	FALSE	1	15

To extract the element in the first row and the first column, we will use the following syntax:

In [60]:

example.df.cbind[1, 1]

a

The element in the second row and the first column can be extracted using the following syntax:

In [61]:

example.df.cbind[2, 1]

b

Similarly, the element in the first row and the second column can be extracted using the following syntax:

In [62]:

example.df.cbind[1, 2]

1

Ranges of values, passed in as vectors, can be used to extract elements as well, e.g:

In [63]:

example.df.cbind[c(1, 2), 1]

a
b

will yield both the element in the first row and the first column as well as the element in the second row and the first column.

The sequence operator ':' can be used to the same effect:

In [64]:

example.df.cbind[1:2, 1]

a
b

We can also extract entire rows or columns from a data frame using 'blank notation', e.g. the following will extract ALL rows associated with the first column (i.e. it will yield the first column):

In [65]:

example.df.cbind[, 1]

a
b
c

The following will extract ALL columns associated with the second row (i.e. it will yield the second row):

In [66]:

example.df.cbind[2, ]

	id	visits	tx.success	survival	total.tx
2	b	3	TRUE	1	7

Finally, we can use a negative index to extract all data except for specified rows or columns. Here, we will extract everything except for the second row:

In [67]:

example.df.cbind[-2, ]

	id	visits	tx.success	survival	total.tx
1	a	1	TRUE	1	3
3	c	5	FALSE	1	15

Negative indexing also works with vectors. Here, we extract everything except for the first and second columns:

In [68]:

example.df.cbind[, -c(1, 2)]

tx.success	survival	total.tx
TRUE	1	3
TRUE	1	7
FALSE	1	15

Note that in this case, the rownames are not displayed. They still do exist, however, and can be queried:

In [69]:

rownames(example.df.cbind[, -c(1, 2)])

'1'
'2'
'3'

If both dimensions are queried using vectors, the resulting output will be two-dimensional, i.e. a data frame unto itself. Here we will extract elements 1,1; 2,1; 2,1; and 2,2, resulting in a 2x2 data frame:

In [70]:

example.df.cbind[1:2, 1:2]

is.data.frame(example.df.cbind[1:2, 1:2])

id	visits
a	1
b	3

TRUE

As you might be thinking, the above results of negative indexing are also data frames:

In [71]:

is.data.frame(example.df.cbind[, -c(1, 2)])

TRUE

Now, we will take a look at one of the most useful and efficient features of R's data manipulation abilities: data frame merging.

Data is often spread across more than one file. Reading each file into R will result in more than one data frame. R's functions, however, generally must be applied to a single object; as such, we need to merge these separate data frames into one. These separate files will always have some sort of ID column that allows an analyst to determine how the different rows are related. For example, in this case the Patient ID column links which observations belong to which patient - here, our patient ID is encoded as id.

For a moment, we will return to our original data frame:

In [72]:

example.df

id	visits	tx.success
a	1	TRUE
b	3	TRUE
c	5	FALSE

Now, we will assemble a second data frame consisting of our other columns in a manner resembling the one analysts frequently encounter:

In [73]:

second.df <- data.frame(c("a", "i", "j"), column.4, column.5)

colnames(second.df) <- c("id", "survival", "total.tx")

second.df

id	survival	total.tx
a	1	3
i	1	7
j	1	15

As a sidenote, sometimes different files might inconsistently refer to the ID variable with multiple different names. To merge the data into a single data frame, the variables must be manually renamed until they're all called the same thing.

Now, we have two data frames that each have data about three patients with various IDs. How do we get R to merge these into a single data frame with each observation correctly mapped to the corresponding patient?

We can do so using the merge() function, to which the user must pass two data frames as well as information about which variable is held in common between them (id):

In [74]:

example.df

second.df

id	visits	tx.success
a	1	TRUE
b	3	TRUE
c	5	FALSE

id	survival	total.tx
a	1	3
i	1	7
j	1	15

One simple approach to merging is to merge such that only the common observations are present:

In [75]:

merge(example.df, second.df, by = "id", all = F)

id	visits	tx.success	survival	total.tx
a	1	TRUE	1	3

Another simple approach is to merge all the observations and maintain consistency by adding missing values as needed:

In [76]:

merge(example.df, second.df, by = "id", all = T)

id	visits	tx.success	survival	total.tx
a	1	TRUE	1	3
b	3	TRUE	NA	NA
c	5	FALSE	NA	NA
i	NA	NA	1	7
j	NA	NA	1	15

As can be seen above, merge() has correctly merged the two data frames despite the second having an entire different ordering. The result of merge() is demonstrably identical to the data frame we manually constructed column-by-column earlier.

Now, we will leave our simple example data frames and demonstrate some additional data exploration and manipulation features on a dataset that is still simple but nevertheless too large to look at on screen (which will be true in practice 99% of the time).

The data() function we will use simply imports an example from R's example datasets package, turning it into a data frame with the same name.

In [77]:

data(DNase)
DNase <- data.frame(DNase)

When first loading a dataset, it is a good idea to get an idea of how large it is, both to be sure it makes logical sense for your data and to figure out which functions should be used to explore it.

The dim() function simply returns the dimensions of the data frame in the order [rows, columns]. These values can also be extracted separately:

In [78]:

dim(DNase)

nrow(DNase)

ncol(DNase)

176
3

176

3

So, we have a data frame with 176 rows and 3 columns. An object of this size is too large to view on screen all at once, so to preview it, we can use the head() function.

By default, head() shows the first six rows of a data frame in their entirety (i.e. all columns).

In [79]:

head(DNase)

Run	conc	density
1	0.04882812	0.017
1	0.04882812	0.018
1	0.19531250	0.121
1	0.19531250	0.124
1	0.39062500	0.206
1	0.39062500	0.215

We can specify a second value ('argument') to this function to ask for a different number of lines:

In [80]:

head(DNase, 3)

Run	conc	density
1	0.04882812	0.017
1	0.04882812	0.018
1	0.19531250	0.121

In [81]:

head(DNase, 10)

Run	conc	density
1	0.04882812	0.017
1	0.04882812	0.018
1	0.19531250	0.121
1	0.19531250	0.124
1	0.39062500	0.206
1	0.39062500	0.215
1	0.78125000	0.377
1	0.78125000	0.374
1	1.56250000	0.614
1	1.56250000	0.609

The opposite of head() is tail():

In [82]:

tail(DNase, 10)

	Run	conc	density
167	11	0.78125	0.427
168	11	0.78125	0.411
169	11	1.56250	0.704
170	11	1.56250	0.684
171	11	3.12500	0.994
172	11	3.12500	0.980
173	11	6.25000	1.421
174	11	6.25000	1.385
175	11	12.50000	1.715
176	11	12.50000	1.721

Another useful tool for exploring a data frame is the str() - 'structure' - function:

In [83]:

str(DNase)

'data.frame':	176 obs. of  3 variables:
 $ Run    : Ord.factor w/ 11 levels "10"<"11"<"9"<..: 4 4 4 4 4 4 4 4 4 4 ...
 $ conc   : num  0.0488 0.0488 0.1953 0.1953 0.3906 ...
 $ density: num  0.017 0.018 0.121 0.124 0.206 0.215 0.377 0.374 0.614 0.609 ...

As can be seen above, str() shows a number of things: the dimensions of the data frame, the names of all the variables, their classes, and a preview of the first few values for wach variable.

str() is particularly useful for exploring high-dimensional Bioconductor data, which will be seen extensively in the February Bioconductor workshops.

The summary function, which can be applied to either a vector or a data frame (in the latter case, R applies it separately to each column in the data frame) yields a variety of summary statistics about each variable:

In [84]:

summary(DNase)

      Run          conc             density      
 10     :16   Min.   : 0.04883   Min.   :0.0110  
 11     :16   1st Qu.: 0.34180   1st Qu.:0.1978  
 9      :16   Median : 1.17188   Median :0.5265  
 1      :16   Mean   : 3.10669   Mean   :0.7192  
 4      :16   3rd Qu.: 3.90625   3rd Qu.:1.1705  
 8      :16   Max.   :12.50000   Max.   :2.0030  
 (Other):80

This function is only well-behaved with numeric data like 'conc' and 'density'; the output describing a categorial (factor) variable with many levels like 'Run' is not useful.

We will now subset the first 20 rows of the 'DNase' object to make it more tractable for instructional purposes.

In [85]:

DNase.subset <- DNase[1:20, ]

dim(DNase.subset)

20
3

Now, we will demonstrate how to sort a data frame in R based on the values of a column. This action is not performed using the sort() function; instead, the order() function can be used.

By default, order() sorts in ascending order. Notably, ties remain in their initial relative order.

The syntax of the order() command is somewhat complicated, but a brief explanation is as follows: the function yields a vector describing the relative 'rank' of each observation. First, we look at the vector in its original state:

In [86]:

print(DNase.subset$conc)

 [1]  0.04882812  0.04882812  0.19531250  0.19531250  0.39062500  0.39062500
 [7]  0.78125000  0.78125000  1.56250000  1.56250000  3.12500000  3.12500000
[13]  6.25000000  6.25000000 12.50000000 12.50000000  0.04882812  0.04882812
[19]  0.19531250  0.19531250

order() yields the ranks of the above data points in ascending order:

In [87]:

order(DNase.subset$conc)

1
2
17
18
3
4
19
20
5
6
7
8
9
10
11
12
13
14
15
16

The 'decreasing = T' argument causes it to order in descending order instead of ascending order:

In [88]:

order(DNase.subset$conc, decreasing = T)

15
16
13
14
11
12
9
10
7
8
5
6
3
4
19
20
1
2
17
18

This vector is then used in conjunction with the square bracket notation and the 'blank notation' to retrieve the elements in that order.

We can assign this ordering to a vector:

In [89]:

reorder.vector <- order(DNase.subset$conc)

In [90]:

DNase.subset[reorder.vector, ]

	Run	conc	density
1	1	0.04882812	0.017
2	1	0.04882812	0.018
17	2	0.04882812	0.045
18	2	0.04882812	0.050
3	1	0.19531250	0.121
4	1	0.19531250	0.124
19	2	0.19531250	0.137
20	2	0.19531250	0.123
5	1	0.39062500	0.206
6	1	0.39062500	0.215
7	1	0.78125000	0.377
8	1	0.78125000	0.374
9	1	1.56250000	0.614
10	1	1.56250000	0.609
11	1	3.12500000	1.019
12	1	3.12500000	1.001
13	1	6.25000000	1.334
14	1	6.25000000	1.364
15	1	12.50000000	1.730
16	1	12.50000000	1.710

It is important to be mindful of the fact that the order of ties is not changed by ordering with order(), even if sorting is performed in descending order. We can demonstrate this fact by performing the above operations while sorting in descending order:

In [91]:

reorder.vector.descending <-order(DNase.subset$conc, decreasing = T)

In [92]:

DNase.subset[reorder.vector.descending, ]

	Run	conc	density
15	1	12.50000000	1.730
16	1	12.50000000	1.710
13	1	6.25000000	1.334
14	1	6.25000000	1.364
11	1	3.12500000	1.019
12	1	3.12500000	1.001
9	1	1.56250000	0.614
10	1	1.56250000	0.609
7	1	0.78125000	0.377
8	1	0.78125000	0.374
5	1	0.39062500	0.206
6	1	0.39062500	0.215
3	1	0.19531250	0.121
4	1	0.19531250	0.124
19	2	0.19531250	0.137
20	2	0.19531250	0.123
1	1	0.04882812	0.017
2	1	0.04882812	0.018
17	2	0.04882812	0.045
18	2	0.04882812	0.050

Data frames can be classified into two broad categories: wide format and long format.

All data frames shown thus far have been presented in wide format. A wide format data frame has each row describe a sample and each column describe a feature. Here is a short example of a data frame in wide format, tablating counts for three arbitrary genes in three patients:

In [93]:

wide.df <- data.frame(c("A", "B", "C"), c(1, 1, 2), c(5, 6, 7), c(0, 1, 0))
colnames(wide.df) <- c("id", "gene.1", "gene.2", "gene.3")

print(wide.df)

  id gene.1 gene.2 gene.3
1  A      1      5      0
2  B      1      6      1
3  C      2      7      0

Long format stacks features on top of one another; each row is the combination of a sample and a feature. One column exists to denote the feature in question, and another column exists to denote that feature' value:

In [94]:

long.df <- data.frame(c("A", "A", "A", "B", "B", "B", "C", "C", "C"), c("gene.1", "gene.2", "gene.3", "gene.1", "gene.2", "gene.3", "gene.1", "gene.2", "gene.3"), c(1, 5, 0, 1, 6, 1, 2, 7, 0))
colnames(long.df) <- c("id", "gene", "count")

print(long.df)

  id   gene count
1  A gene.1     1
2  A gene.2     5
3  A gene.3     0
4  B gene.1     1
5  B gene.2     6
6  B gene.3     1
7  C gene.1     2
8  C gene.2     7
9  C gene.3     0

These formats both contain the exact same data but represent it in different ways. Various functions exist to convert between wide and long format but these are beyond the scope of today's discussion.

Those interested can look up the 'reshape2' package on their own:

http://seananderson.ca/2013/10/19/reshape.html

Matrices

R also includes another type of data structure very similar to a data frame - the 'matrix'. The matrix is analogous to a data frame but can only contain one type of column e.g. all entries must be numeric or all entries must be text.

If an object that is supposed to be a data frame including numeric columns suddenly begins to display quotation marks around the numeric values, that is a sign that it has been coerced to a matrix. This phenomenon occurs when e.g. a data frame is transposed and is rarely (if ever) desired behavior.

Extensions to R¶

As we have seen throughout this workshop, R has a wide variety of functions included by default; however, the true utility of R is its extensibility. It was designed to be upgraded by users, and there now exist over 12,000 user-created collections of functions - 'R packages' - in the default R repository alone.

At this point in time, it is often not necessary for a user to write their own functions; chances are, another author has already done so. Googling your use case and 'R' will often yield one or more packages intended to accomplish your goal - but how are these packages added to your version of R?

Fortunately, the answer is very simple: the install.packages() function exists to facilitiate this processs. Passing the name of a desired package in quotes will cause R to automatically download and install the package in question. Here, we will install the 'devtools' package:

In [95]:

install.packages("devtools")

Installing package into ‘/home/ubuntu/R/x86_64-pc-linux-gnu-library/3.4’
(as ‘lib’ is unspecified)

Installing packages in a Jupyter enivonment looks different (i.e. much worse) than it does in standalone R or RStudio; in these programs, it will show progress bars to monitor the status of the installation. The installation of devtools should take about two minutes, so rest assured that it has not frozen even though no output can be seen.

When install.packages() tries to download a package, it will also download any other packages necessary for your desired package to work correctly; as seen above, installing devtools also installs 9 other packages. R's modular nature means that most packages depend on other packages, using other peoples' functions to accomplish their own work, so it is not uncommon to end up installing 10 or more packages just to gain access to the one you actually plan to use directly.

Once a package is installed, it is necessary to activate it within your R session. Downloading a package does not automatically make it available to R. We can verify this fact using the search() function, which outputs a list of packages currently loaded into R:

In [96]:

search()

'.GlobalEnv'
'jupyter:irkernel'
'package:stats'
'package:graphics'
'package:grDevices'
'package:utils'
'package:datasets'
'package:methods'
'Autoloads'
'package:base'

To load a package into R - which must be done at the beginning of each session - the library() function is used:

In [97]:

library(devtools)

In [98]:

search()

'.GlobalEnv'
'package:devtools'
'jupyter:irkernel'
'package:stats'
'package:graphics'
'package:grDevices'
'package:utils'
'package:datasets'
'package:methods'
'Autoloads'
'package:base'

The devtools package is now at the beginning of the list of loaded packages.

Most R packages are found in CRAN - the central repository for R package. However, packages can be found in different places.

Within the context of bioinformatics, the most important external repository is Bioconductor. This repository contains over 1400 packages designed for the analysis of biological data; several of these packages will be discussed in subsequent workshops. For now, we will concern ourselves with accessing this repository and downloading an example package.

There are two steps to downloading a Bioconductor package; this process is distinct from that used to download a standard R package from CRAN. First, the following command must be used, which temporarily gives R access to the 'biocLite()' function, a package downloader analogous to 'install.packages()'.

This command must be run at the beginning of each new R session.

In [99]:

source("https://bioconductor.org/biocLite.R")

Bioconductor version 3.5 (BiocInstaller 1.26.1), ?biocLite for help
A newer version of Bioconductor is available for this version of R,
  ?BiocUpgrade for help

The syntax for downloading a Bioconductor package is directly analogous to that of 'install.packages()':

In [100]:

biocLite("IRanges")

library(IRanges)

BioC_mirror: https://bioconductor.org
Using Bioconductor 3.5 (BiocInstaller 1.26.1), R 3.4.2 (2017-09-28).
Installing package(s) ‘IRanges’
installation path not writeable, unable to update packages: codetools, lattice,
  MASS, Matrix, mgcv, rpart, spatial
Old packages: 'ade4', 'ape', 'backports', 'curl', 'data.table', 'digest',
  'foreach', 'getopt', 'git2r', 'Hmisc', 'htmlTable', 'htmlwidgets', 'irlba',
  'iterators', 'knitr', 'lazyeval', 'matrixStats', 'openssl', 'pbdZMQ',
  'phangorn', 'Rcpp', 'RcppArmadillo', 'RCurl', 'registry', 'reshape2',
  'rlang', 'stringi', 'tibble', 'vegan', 'viridis', 'withr', 'yaml'
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, cbind, colMeans, colnames,
    colSums, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
    grepl, intersect, is.unsorted, lapply, lengths, Map, mapply, match,
    mget, order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
    rbind, Reduce, rowMeans, rownames, rowSums, sapply, setdiff, sort,
    table, tapply, union, unique, unsplit, which, which.max, which.min

Loading required package: S4Vectors
Loading required package: stats4

Attaching package: ‘S4Vectors’

The following object is masked from ‘package:base’:

    expand.grid

Bioconductor also comes with a variety of custom object classes built specifically to handle common biological data structures such as genomic coordinates and fasta sequences. A full list of these classes can be viewed at:

https://bioconductor.org/developers/how-to/commonMethodsAndClasses/

Notably, Bioconductor also has its own extension of the data frame that allows the inclusion of metadata and does not require row names. Documentation for this data structure is located at:

https://www.rdocumentation.org/packages/S4Vectors/versions/0.10.1/topics/DataFrame-class

Users can also create custom functions to aid in performing various tasks. If an operation must be performed many, many times in the exact same manner, it is a good candidate for conversion into a function.

Here, as a very simple example, we will demonstrate a basic function for cubing a number.

In [101]:

cube.number <- function(input.number)
{
output.number <- input.number * input.number * input.number
return(output.number)
}

result <- cube.number(2)

print(result)

[1] 8

First off, functions are defined like other objects using the assignment operator '<-'. The name of a function is user-defined.

When defining a function, one or more arguments must be defined by being placed within the function() command. Here, we define one argument, 'input.number'. That argument must then be passed into the function when it is invoked.

When a value is provided to an argument (e.g. 2, above), that argument's name is temporarily created as a variable containing the value in question. Thus, in our example. 'input.number' is temporarily defined as 2. This definition remains in place until the end of the function, at which point it is deleted.

Within a function, standard mathematical operations function as normally. Thus, in our example, 'input.number * input.number * input.number' = '2 * 2 * 2' = 8.

Variables created within functions, including those defined by providing the function's arguments with values, are deleted upon completion of the function. There is, however, a way to retain information from within a function: a return statement.

A return statement tells the function what it should pass back to the R session upon termination of the function. Here, 'return(output.number)' causes the function to return the cube of 'input.number' as defined in the previous line.

The output of a return statement can be stored in an object for future use. If no object is specified, by default the returned value will be printed to the screen and not stored:

In [102]:

cube.number(3)

27

Note that the temporary variables 'input.number' and 'output.number' are no longer defined, having been deleted upon termination of the function:

In [103]:

print(input.number)

Error in print(input.number): object 'input.number' not found
Traceback:

1. print(input.number)

In [104]:

print(output.number)

Error in print(output.number): object 'output.number' not found
Traceback:

1. print(output.number)

Finally, the following video demonstrates how to install R and RStudio (an integrated developer environment for R) locally:

https://www.youtube.com/watch?v=MFfRQuQKGYg

Support in navigating this process is available by contacting the following account:

cbc-help@brown.edu

which will generate a support ticket in our system to facilitate the issue's resolution.

In [ ]: