It is used for statistical computations. It has a lot of statistical functions implemented. It tryies to show results of evaluations in a beautiful and understandible form. It also allows making beautiful plots and diagrams.
Let us use an "R cheatsheet". https://www.rstudio.com/wp-content/uploads/2016/10/r-cheat-sheet-3.pdf
x <- 10 # an assignment
x = 10 # it is possible to use =, but don't do this, because it can lead to problems, and it is used in a different place
42 -> y # assignment
The most basic data type in R is a vector. Vector is a linear array of elements. There are several "modes": numerical, character, boolean. You create a vector with the c()
function:
c(1, 2, 3) # numeric
c(T, F, T) # boolean
c("abc", "xyz") #character
# in octave: c(...) -> [...]
You can not mix modes, you either have numbers or characters, but not both.
2:6 #forward
10:1 # backwards
seq(1, 10, by=2) # from 1 to 10 with step 2 (In Octave: 1:2:10)
seq(1, 10, length.out=4) # from 1 to 10, but create 4 elements (In octave: linspace(1, 10, 4))
seq(1, 10, 4) # if you don't write anything before 4, it is "by="
Here we see that functions in R usually take named arguments. You put the name of an argument with the equality.
x <- seq(10, 1000, by=10)
x
x[4] # square brackets for indexing
x[c(1, 5, 7)] # in Octave: x([1, 4])
x[-4] # all except the fourth element
x[c(-4, -5)] # all except the 4th and 5th elements
x[x > 600] # all elements that are greater than 600 (In Octave: x(x > 600) )
x > 600 # this is a logical vector. And x[x > 600] is a logical indexing. We select only elements that have TRUE in the index
1 == 2
30 == 30
c(10, 20, 30, 40) == 20
c(10, 20, 30, 40) > 20
Imagine, we want to store information about a sex of people: Males, Females. Let's have a vector:
s <- c(T, F, T, F, F, F) # not obvious, who is Male, who is Female.
s <- c("Male", "Female", "Male", "Female") # 1) we can make a typo 2) needs a lot of memoro
s <- c(1, 2, 1, 2, 2, 2) # you have to specify what is what
That's why we need factors. They are used to store variables that have a finite number of possible values (Male, Female or Bad, Medium, Good). We use a function called "factor" to create factors:
sexes <- factor(c("Male", "Female", "Male", "Male")) # we create a factor from a character vector
sexes
# we can specify levels explicitly
states <- factor(c("Good", "Medium", "Good"), levels=c("Good", "Medium", "Bad"))
states
How are factors stored. A factor is just a numeric vector (1, 2, 1, 1, 2) with additional metainformation inside. This is an information about levels, that is, what each number means.
Factor = Numeric vector + Metainformation. When we print a factor, we don't see numbers, we see them substituted by levels
Factors can be ordered. For states of a patient we know that Good is better than Medius, Medium is better that Bad. So we can say that this factor is ordered. In future we will be able to sort based on this factor:
states <- factor(c("Good", "Medium", "Good"), levels=c("Good", "Medium", "Bad"), ordered=T)
x <- c(1, 2, 5, 3, 5, 7, 8, 4, 3, 1, 4, 4, 5, 6, 7, 8)
summary(x)
Min. 1st Qu. Median Mean 3rd Qu. Max. 1.000 3.000 4.500 4.562 6.250 8.000
summary
shows information about a vector (or any other data given to it). Minimum, maximum, mean value, median value, 1st quantile, 3rd quantile.
(Median = 2nd quantile)
x <- c("a", "b", "a", "a", "c", "b")
summary(x)
Length Class Mode 6 character character
This only means that this is a vector of charaters (strings) of length 6
summary(sexes)
summary
is a smart function that works differently for different types of arguments. Another example of such a function is a table
function:
sexes <- factor(c("Male", "Female", "Male", "Male", "Female"))
states <- factor(c("Good", "Good", "Bad", "Medium", "Good"), levels=c("Bad", "Medium", "Good"), ordered=T)
table(sexes, states)
states sexes Bad Medium Good Female 0 0 2 Male 1 1 1
This works for vectors of the same size. It counts elements and tables them. It even works if there are 3 vectors. (you may try it)
x <- c(10, 20, 30)
x
x
has data 10, 20, 30 inside. But we can add names for its elements
names(x) <- c("first", "second", "third")
x
Metainformation controls the way data is printed and manipulated
x == 10 # metainformation is not discarded
x[1]
x["first"]
A distribution is a low about how random numbers are generated. We can not ask a computer to return a random number.
runif(10) # numbers from 0 to 1, r = generate, unif = uniform
# help(runif)
runif(10, min=-100, max=100) # from -100 to 100
# normal distribution: norm
rnorm(10) # mean=0, variance=1
plot(rnorm(1000))
We can also use functions that start with 'd' instead of 'r', this means that we get a density function:
# plot the same way we did it on Octave
x <- seq(-5, 5, by=0.01)
y <- dnorm(x)
plot(x, y)
y <- dunif(x)
plot(x, y)
a binomial distribution
#help(rbinom)
rbinom(20, size=3, prob=1/10) # we throw a coin three times, 1/2 is a probability of a head (head / tail)
rbinom(20, size=1, prob=1/2) # just one throw, so we just generate either 0 or 1 with probabilities 1/0.5
#help(read.csv)
cats <- read.csv("data.csv")
cats
summary(cats) # makes a summary for each colums
name | age | sex | weight |
---|---|---|---|
Tom | 5 | Male | 10 |
Bob | 10 | Male | 12 |
Kitty | 12 | Female | 12 |
Lucy | 4 | Female | 7 |
Any,thing | 7 | Female | 8 |
name age sex weight Any,thing:1 Min. : 4.0 Female:3 Min. : 7.0 Bob :1 1st Qu.: 5.0 Male :2 1st Qu.: 8.0 Kitty :1 Median : 7.0 Median :10.0 Lucy :1 Mean : 7.6 Mean : 9.8 Tom :1 3rd Qu.:10.0 3rd Qu.:12.0 Max. :12.0 Max. :12.0
By default, string colums are loaded as factors. So both "name" and "sex" are factors now. This is not good for the name. Let's load cats again:
cats <- read.csv('data.csv', stringsAsFactors=F)
cats$sex <- factor(cats$sex) # $ operator allows extracting columns
cats
summary(cats)
name | age | sex | weight |
---|---|---|---|
Tom | 5 | Male | 10 |
Bob | 10 | Male | 12 |
Kitty | 12 | Female | 12 |
Lucy | 4 | Female | 7 |
Any,thing | 7 | Female | 8 |
name age sex weight Length:5 Min. : 4.0 Female:3 Min. : 7.0 Class :character 1st Qu.: 5.0 Male :2 1st Qu.: 8.0 Mode :character Median : 7.0 Median :10.0 Mean : 7.6 Mean : 9.8 3rd Qu.:10.0 3rd Qu.:12.0 Max. :12.0 Max. :12.0
Columns are vector, we can process them as we discussed previously
cats$name # extract a column
cats[["name"]] # is used when the name of the column is not a good identifier
cats[["col name with space"]] <- 1:5 # it is impossible to use this Col with $
cats
name | age | sex | weight | col name with space |
---|---|---|---|---|
Tom | 5 | Male | 10 | 1 |
Bob | 10 | Male | 12 | 2 |
Kitty | 12 | Female | 12 | 3 |
Lucy | 4 | Female | 7 | 4 |
Any,thing | 7 | Female | 8 | 5 |
Use [rows,cols]
to specify rows and columns:
cats[2, "name"] # Bob
cats[c(2, 4), c("name", "age")] # 2nd and 4rd rows, cols name and age.
cats[c(2, 4),] # all columns if none are specified
cats$weight # this is weights column
cats$weight >= 10 # this is true/false column
cats[cats$weight >= 10, ]
# and so, on, see other examples in Cheat sheets.
name | age | |
---|---|---|
2 | Bob | 10 |
4 | Lucy | 4 |
name | age | sex | weight | col name with space | |
---|---|---|---|---|---|
2 | Bob | 10 | Male | 12 | 2 |
4 | Lucy | 4 | Female | 7 | 4 |
name | age | sex | weight | col name with space |
---|---|---|---|---|
Tom | 5 | Male | 10 | 1 |
Bob | 10 | Male | 12 | 2 |
Kitty | 12 | Female | 12 | 3 |
There are many libraries to make plots in R:
We will study a little bit only a built-in graphics.
plot(c(1, 2, 3), c(20, 10, 30), "l") # l is for lines
plot(c(1, 2, 3), c(20, 10, 30), "p") # p is for points, other values are in help(plot)
par(col="blue", bty="]") # graphical parameters are specified globally, so all other plots will be blue
plot(1:5, 1:5, "l")
#see other parameters in help(par)
#help(par)
How to deal with global parameters. We should save them before modification and then restore:
opar <- par(no.readonly=T) # save all parameters to a variable opar, return only params that can be modified
opar
par(col="red")
plot(1:2)
par(opar) # returns everything back, and col is 'black' again
plot(1:2)
Meet at 13:55