Statistical package R¶

It is used for statistical computations. It has a lot of statistical functions implemented. It tryies to show results of evaluations in a beautiful and understandible form. It also allows making beautiful plots and diagrams.

Let us use an "R cheatsheet". https://www.rstudio.com/wp-content/uploads/2016/10/r-cheat-sheet-3.pdf

In [1]:

x <- 10 # an assignment
x = 10 # it is possible to use =, but don't do this, because it can lead to problems, and it is used in a different place
42 -> y # assignment

The most basic data type in R is a vector. Vector is a linear array of elements. There are several "modes": numerical, character, boolean. You create a vector with the c() function:

In [2]:

c(1, 2, 3) # numeric
c(T, F, T) # boolean
c("abc", "xyz") #character
# in octave: c(...) -> [...]

Out[2]:

1
2
3

Out[2]:

TRUE
FALSE
TRUE

Out[2]:

'abc'
'xyz'

You can not mix modes, you either have numbers or characters, but not both.

In [3]:

2:6 #forward
10:1 # backwards
seq(1, 10, by=2) # from 1 to 10 with step 2 (In Octave: 1:2:10)
seq(1, 10, length.out=4) # from 1 to 10, but create 4 elements (In octave: linspace(1, 10, 4))
seq(1, 10, 4) # if you don't write anything before 4, it is "by="

Out[3]:

2
3
4
5
6

Out[3]:

10
9
8
7
6
5
4
3
2
1

Out[3]:

1
3
5
7
9

Out[3]:

1
4
7
10

Out[3]:

1
5
9

Here we see that functions in R usually take named arguments. You put the name of an argument with the equality.

Indexing¶

In [4]:

x <- seq(10, 1000, by=10)
x

Out[4]:

10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
160
170
180
190
200
210
220
230
240
250
260
270
280
290
300
310
320
330
340
350
360
370
380
390
400
410
420
430
440
450
460
470
480
490
500
510
520
530
540
550
560
570
580
590
600
610
620
630
640
650
660
670
680
690
700
710
720
730
740
750
760
770
780
790
800
810
820
830
840
850
860
870
880
890
900
910
920
930
940
950
960
970
980
990
1000

In [5]:

x[4] # square brackets for indexing
x[c(1, 5, 7)]           # in Octave: x([1, 4])
x[-4]    # all except the fourth element
x[c(-4, -5)] # all except the 4th and 5th elements

x[x > 600] # all elements that are greater than 600 (In Octave: x(x > 600) )

x > 600 # this is a logical vector. And x[x > 600] is a logical indexing. We select only elements that have TRUE in the index

Out[5]:

40

Out[5]:

10
50
70

Out[5]:

10
20
30
50
60
70
80
90
100
110
120
130
140
150
160
170
180
190
200
210
220
230
240
250
260
270
280
290
300
310
320
330
340
350
360
370
380
390
400
410
420
430
440
450
460
470
480
490
500
510
520
530
540
550
560
570
580
590
600
610
620
630
640
650
660
670
680
690
700
710
720
730
740
750
760
770
780
790
800
810
820
830
840
850
860
870
880
890
900
910
920
930
940
950
960
970
980
990
1000

Out[5]:

10
20
30
60
70
80
90
100
110
120
130
140
150
160
170
180
190
200
210
220
230
240
250
260
270
280
290
300
310
320
330
340
350
360
370
380
390
400
410
420
430
440
450
460
470
480
490
500
510
520
530
540
550
560
570
580
590
600
610
620
630
640
650
660
670
680
690
700
710
720
730
740
750
760
770
780
790
800
810
820
830
840
850
860
870
880
890
900
910
920
930
940
950
960
970
980
990
1000

Out[5]:

610
620
630
640
650
660
670
680
690
700
710
720
730
740
750
760
770
780
790
800
810
820
830
840
850
860
870
880
890
900
910
920
930
940
950
960
970
980
990
1000

Out[5]:

FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE

In [6]:

1 == 2
30 == 30
c(10, 20, 30, 40) == 20
c(10, 20, 30, 40) > 20

Out[6]:

FALSE

Out[6]:

TRUE

Out[6]:

FALSE
TRUE
FALSE
FALSE

Out[6]:

FALSE
FALSE
TRUE
TRUE

Factors¶

Imagine, we want to store information about a sex of people: Males, Females. Let's have a vector:

In [7]:

s <- c(T, F, T, F, F, F) # not obvious, who is Male, who is Female.
s <- c("Male", "Female", "Male", "Female") # 1) we can make a typo 2) needs a lot of memoro
s <- c(1, 2, 1, 2, 2, 2) # you have to specify what is what

That's why we need factors. They are used to store variables that have a finite number of possible values (Male, Female or Bad, Medium, Good). We use a function called "factor" to create factors:

In [8]:

sexes <- factor(c("Male", "Female", "Male", "Male")) # we create a factor from a character vector
sexes

# we can specify levels explicitly
states <- factor(c("Good", "Medium", "Good"), levels=c("Good", "Medium", "Bad"))
states

Out[8]:

Male
Female
Male
Male

Levels:

'Female'
'Male'

Out[8]:

Good
Medium
Good

Levels:

'Good'
'Medium'
'Bad'

How are factors stored. A factor is just a numeric vector (1, 2, 1, 1, 2) with additional metainformation inside. This is an information about levels, that is, what each number means.

Factor = Numeric vector + Metainformation. When we print a factor, we don't see numbers, we see them substituted by levels

Factors can be ordered. For states of a patient we know that Good is better than Medius, Medium is better that Bad. So we can say that this factor is ordered. In future we will be able to sort based on this factor:

In [9]:

states <- factor(c("Good", "Medium", "Good"), levels=c("Good", "Medium", "Bad"), ordered=T)

Simple statistics¶

In [10]:

x <- c(1, 2, 5, 3, 5, 7, 8, 4, 3, 1, 4, 4, 5, 6, 7, 8)
summary(x)

Out[10]:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   3.000   4.500   4.562   6.250   8.000

summary shows information about a vector (or any other data given to it). Minimum, maximum, mean value, median value, 1st quantile, 3rd quantile.

(Median = 2nd quantile)

In [11]:

x <- c("a", "b", "a", "a", "c", "b")
summary(x)

Out[11]:

   Length     Class      Mode 
        6 character character

This only means that this is a vector of charaters (strings) of length 6

In [12]:

summary(sexes)

Out[12]:

Female: 1
Male: 3

summary is a smart function that works differently for different types of arguments. Another example of such a function is a table function:

In [13]:

sexes <- factor(c("Male", "Female", "Male", "Male", "Female"))
states <- factor(c("Good", "Good", "Bad", "Medium", "Good"), levels=c("Bad", "Medium", "Good"), ordered=T)
table(sexes, states)

Out[13]:

        states
sexes    Bad Medium Good
  Female   0      0    2
  Male     1      1    1

This works for vectors of the same size. It counts elements and tables them. It even works if there are 3 vectors. (you may try it)

A little more about metainformation¶

In [4]:

x <- c(10, 20, 30)
x

Out[4]:

10
20
30

x has data 10, 20, 30 inside. But we can add names for its elements

In [2]:

names(x) <- c("first", "second", "third")
x

Out[2]:

first: 10
second: 20
third: 30

Metainformation controls the way data is printed and manipulated

In [3]:

x == 10 # metainformation is not discarded
x[1]
x["first"]

Out[3]:

first: TRUE
second: FALSE
third: FALSE

Out[3]:

first: 10

Out[3]:

first: 10

Statistical distrubutions¶

A distribution is a low about how random numbers are generated. We can not ask a computer to return a random number.

A Uniform distribution. The numbers are drawn from a segment $[a, b]$.

In [6]:

runif(10) # numbers from 0 to 1, r = generate, unif = uniform

Out[6]:

0.973763826303184
0.876397028332576
0.648635848425329
0.702227446017787
0.910166483139619
0.259626370621845
0.422927958192304
0.932122326688841
0.430529864504933
0.183352742344141

In [8]:

# help(runif)
runif(10, min=-100, max=100) # from -100 to 100

Out[8]:

71.1554687470198
81.2086193356663
-79.4797552749515
90.4260812792927
81.8975884933025
-71.0195608902723
71.3210453279316
46.8834490515292
-63.6308727785945
72.136812005192

In [9]:

# normal distribution: norm
rnorm(10) # mean=0, variance=1

Out[9]:

-1.25715949752676
0.800453385547091
0.562525167211137
0.259310133390342
-0.757579549784462
-1.07411566059952
-0.754656141318137
-0.736191117196667
-1.07123670248022
2.21039207303123

In [10]:

plot(rnorm(1000))

Out[10]:

We can also use functions that start with 'd' instead of 'r', this means that we get a density function:

In [13]:

# plot the same way we did it on Octave
x <- seq(-5, 5, by=0.01)
y <- dnorm(x)
plot(x, y)
y <- dunif(x)
plot(x, y)

Out[13]:

a binomial distribution

In [23]:

#help(rbinom)
rbinom(20, size=3, prob=1/10) # we throw a coin three times, 1/2 is a probability of a head (head / tail)

rbinom(20, size=1, prob=1/2) # just one throw, so we just generate either 0 or 1 with probabilities 1/0.5

Out[23]:

0
1
0
0
1
0
0
1
0
0
0
0
0
2
0
0
0
2
0
1

Out[23]:

1
1
0
0
0
1
1
1
0
0
1
1
1
1
1
0
0
1
1
1

Data frames¶

CSV file format¶

Let's use the function read.csv() to read a file in csv format:

In [28]:

#help(read.csv)
cats <- read.csv("data.csv")
cats
summary(cats) # makes a summary for each colums

Out[28]:

name	age	sex	weight
Tom	5	Male	10
Bob	10	Male	12
Kitty	12	Female	12
Lucy	4	Female	7
Any,thing	7	Female	8

Out[28]:

        name        age           sex        weight    
 Any,thing:1   Min.   : 4.0   Female:3   Min.   : 7.0  
 Bob      :1   1st Qu.: 5.0   Male  :2   1st Qu.: 8.0  
 Kitty    :1   Median : 7.0              Median :10.0  
 Lucy     :1   Mean   : 7.6              Mean   : 9.8  
 Tom      :1   3rd Qu.:10.0              3rd Qu.:12.0  
               Max.   :12.0              Max.   :12.0

By default, string colums are loaded as factors. So both "name" and "sex" are factors now. This is not good for the name. Let's load cats again:

In [37]:

cats <- read.csv('data.csv', stringsAsFactors=F)
cats$sex <- factor(cats$sex)  # $ operator allows extracting columns
cats
summary(cats)

Out[37]:

name	age	sex	weight
Tom	5	Male	10
Bob	10	Male	12
Kitty	12	Female	12
Lucy	4	Female	7
Any,thing	7	Female	8

Out[37]:

     name                age           sex        weight    
 Length:5           Min.   : 4.0   Female:3   Min.   : 7.0  
 Class :character   1st Qu.: 5.0   Male  :2   1st Qu.: 8.0  
 Mode  :character   Median : 7.0              Median :10.0  
                    Mean   : 7.6              Mean   : 9.8  
                    3rd Qu.:10.0              3rd Qu.:12.0  
                    Max.   :12.0              Max.   :12.0

Indexing dataframes¶

Columns are vector, we can process them as we discussed previously

In [40]:

cats$name # extract a column
cats[["name"]]  # is used when the name of the column is not a good identifier

cats[["col name with space"]] <- 1:5 # it is impossible to use this Col with $
cats

Out[40]:

'Tom'
'Bob'
'Kitty'
'Lucy'
'Any,thing'

Out[40]:

'Tom'
'Bob'
'Kitty'
'Lucy'
'Any,thing'

Out[40]:

name	age	sex	weight	col name with space
Tom	5	Male	10	1
Bob	10	Male	12	2
Kitty	12	Female	12	3
Lucy	4	Female	7	4
Any,thing	7	Female	8	5

Use [rows,cols] to specify rows and columns:

In [45]:

cats[2, "name"] # Bob
cats[c(2, 4), c("name", "age")] # 2nd and 4rd rows, cols name and age.
cats[c(2, 4),] # all columns if none are specified

cats$weight # this is weights column
cats$weight >= 10 # this is true/false column
cats[cats$weight >= 10, ]

# and so, on, see other examples in Cheat sheets.

Out[45]:

'Bob'

Out[45]:

	name	age
2	Bob	10
4	Lucy	4

Out[45]:

	name	age	sex	weight	col name with space
2	Bob	10	Male	12	2
4	Lucy	4	Female	7	4

Out[45]:

10
12
12
7
8

Out[45]:

TRUE
TRUE
TRUE
FALSE
FALSE

Out[45]:

name	age	sex	weight	col name with space
Tom	5	Male	10	1
Bob	10	Male	12	2
Kitty	12	Female	12	3

Plotting in R¶

There are many libraries to make plots in R:

Built-in library: it is simple for simple cases
ggplot2: wery powerfull and used a lot, but not so easy to learn.
....

We will study a little bit only a built-in graphics.

In [51]:

plot(c(1, 2, 3), c(20, 10, 30), "l") # l is for lines
plot(c(1, 2, 3), c(20, 10, 30), "p") # p is for points, other values are in help(plot)

Out[51]:

In [64]:

par(col="blue", bty="]") # graphical parameters are specified globally, so all other plots will be blue
plot(1:5, 1:5, "l")
#see other parameters in help(par)
#help(par)

Out[64]:

How to deal with global parameters. We should save them before modification and then restore:

In [66]:

opar <- par(no.readonly=T) # save all parameters to a variable opar, return only params that can be modified
opar

par(col="red")
plot(1:2)

par(opar) # returns everything back, and col is 'black' again
plot(1:2)

Out[66]:

$xlog

FALSE

$ylog

FALSE

$adj

0.5

$ann

TRUE

$ask

FALSE

$bg

'white'

$bty

'o'

$cex

1

$cex.axis

1

$cex.lab

1

$cex.main

1.2

$cex.sub

1

$col

'black'

$col.axis

'black'

$col.lab

'black'

$col.main

'black'

$col.sub

'black'

$crt

0

$err

0

$family

''

$fg

'black'

$fig

0
1
0
1

$fin

6.66666666666667
6.66666666666667

$font

1

$font.axis

1

$font.lab

1

$font.main

2

$font.sub

1

$lab

5
5
7

$las

0

$lend

'round'

$lheight

1

$ljoin

'round'

$lmitre

10

$lty

'solid'

$lwd

1

$mai

1.02
0.82
0.82
0.42

$mar

5.1
4.1
4.1
2.1

$mex

1

$mfcol

1
1

$mfg

1
1
1
1

$mfrow

1
1

$mgp

3
1
0

$mkh

0.001

$new

FALSE

$oma

0
0
0
0

$omd

0
1
0
1

$omi

0
0
0
0

$pch

1

$pin

5.42666666666667
4.82666666666667

$plt

0.123
0.937
0.153
0.877

$ps

12

$pty

'm'

$smo

1

$srt

0

$tck

<NA>

$tcl

-0.5

$usr

0
1
0
1

$xaxp

0
1
5

$xaxs

'r'

$xaxt

's'

$xpd

FALSE

$yaxp

0
1
5

$yaxs

'r'

$yaxt

's'

$ylbias

0.2

Out[66]:

Meet at 13:55

In [0]: