Logistic Regression with daru and statsample-glm¶

In this notebook we'll see with some examples how the probability of a given outcome can be predicted with logistic regression using daru and statsample-glm.

In [1]:
require 'daru'
require 'statsample-glm'
require 'open-uri'

Out[1]:
true

For this notebook, we will utilize this dataset denoting whether students got admission for a graduate degree program depending on their GRE scores, GPA and rank of the institute they did an undergraduate degree in (ranked from 1 to 4).

It should be noted that statsample-glm does not yet support categorical data so the ranks will be treated as continuos.

In [2]:
content = open('http://www.ats.ucla.edu/stat/data/binary.csv')

df = Daru::DataFrame.from_csv "binary.csv"
df.vectors = Daru::Index.new([:admit, :gpa, :gre, :rank])
df

Out[2]:
Daru::DataFrame:27633020 rows: 400 cols: 4
003.613803
113.676603
2148001
313.196404
402.935204
5137602
612.985601
703.084002
813.395403
903.927002
10048004
1103.224401
12147601
1303.087002
14147001
1503.444803
1603.877804
1702.563603
1803.758002
1913.815401
2003.175003
2113.636602
2202.826004
2303.196804
2413.357602
2513.668001
2613.616201
2713.745204
2813.227802
2903.295201
3003.785404
3103.357603
...............
39903.896003

Use the Statsampel::GLM.compute method for logisitic regression analysis.

The first method in the compute function is the DataFrame object, followed by the Vector that is to be the dependent variable, and then the method to be used for the link function. Can be :logit, :probit, :poisson or :normal.

The coefficients method calculates the coefficients of the GLM and returns them as a Hash.

In [3]:
glm = Statsample::GLM::compute df, :admit, :logistic, constant: 1
c = glm.coefficients :hash

Out[3]:
{:gpa=>0.777013573719857, :gre=>0.0022939595044433273, :rank=>-0.5600313868499897, :constant=>-3.4495483976684773}

The logistic regression coefficients give the change in the log odds of the outcome for a one unit increase in the predictor variable.

Therefore, to interpret each of the above co-efficients:

• For every one unit change in gre, the log odds of admission (versus non-admission) increases by 0.002.
• For a one unit increase in gpa, the log odds of being admitted to graduate school increases by 0.777.
• For every increase in the rank number of the institute (aka decrease in quality of the institute), the log odds of being admitted to graduate school increase by -0.56.

Log odds become a little difficult to interpret, so we'll exponentiate each of the co-efficients so that each co-efficient can be interpreted as an odds-ratio.

In [4]:
Daru::Vector.new(c).exp # Calling #exp on Daru::Vector exponentiates each element of the Vector.

Out[4]:
Daru::Vector:17552980 size: 4
nil
gpa2.174967177712439
gre1.0022965926425997
rank0.571191135676971
constant0.03175997601913591

We can now compute the probability of gaining admission into a graduate college based on the rank of the undergraduate college, by keeping the GRE score and GPA constant.

As you can see in the result below, the rankp Vector shows the probability of admission based on the rank. The person from the most highly rated undergrad school (rank 1) has a probability of 0.49 of getting admitted into graduate school.

In [5]:
e = Math::E
new_data = Daru::DataFrame.new({
gre: [df[:gre].mean]*4,
gpa: [df[:gpa].mean]*4,
rank: df[:rank].factors
})

new_data[:rankp] = new_data.collect(:row) do |x|
1 / (1 + e ** -(c[:constant]  + x[:gre] * c[:gre] + x[:gpa] * c[:gpa] + x[:rank] * c[:rank]))
end

new_data.sort! [:rank]

Out[5]:
Daru::DataFrame:16947240 rows: 4 cols: 4
gpagrerankrankp
13.3899000000000017587.710.4931450619837156
33.3899000000000017587.720.357219500353945
03.3899000000000017587.730.240948896129993
23.3899000000000017587.740.1534862275970381

To demonstrate with another example, lets create a hypothetical dataset consisting of the body weight of 20 people and whether they survived or not.

For this example we will just assume that people with less body weight have lesser chances of survival.

In [6]:
require 'distribution'

# Create a normally distributed Vector with mean 30 and standard deviation 2
rng = Distribution::Normal.rng(30,2)
body_weight = Daru::Vector.new(20.times.map { rng.call }.sort)

# Populate chances of survival, assume that people with less body weight on average
# are less likely to survive.
survive = Daru::Vector.new [0,0,0,0,0,1,0,1,0,0,1,1,0,1,1,1,0,1,1,1]

df = Daru::DataFrame.new({
body_weight: body_weight,
survive: survive
})

Out[6]:
Daru::DataFrame:27686700 rows: 20 cols: 2
body_weightsurvive
027.0443646507124170
127.471593129275520
227.8981665920078260
328.1242901822027030
428.1305597367509750
528.758652626328711
628.8938911709321040
729.3795371734881421
829.3874557466142650
929.730116546724030
1029.7328655902817541
1129.8046006403850861
1230.8542863969080760
1331.106705413449171
1431.4668026033057481
1531.5206414254100441
1631.9331975672145240
1732.113979627912811
1832.7606066497197761
1934.337393851086471

Compute the logistic regression co-efficients.

In [7]:
glm    = Statsample::GLM.compute df, :survive, :logistic, constant: 1
coeffs = glm.coefficients :hash

Out[7]:
{:body_weight=>0.8433486251123171, :constant=>-25.24920458377614}

Based on the coefficients, we compute the predicted probabilities for each number in the Vector :body_weight and store them in another Vector called :survive_pred.

In [8]:
e = Math::E
df[:survive_pred] = df[:body_weight].map { |x| 1 / (1 + e ** -(coeffs[:constant] + x*coeffs[:body_weight])) }
df

Out[8]:
Daru::DataFrame:27686700 rows: 20 cols: 3
body_weightsurvivesurvive_pred
027.04436465071241700.08007143558819431
127.4715931292755200.11094995452363857
227.89816659200782600.15170068399992506
328.12429018220270300.17790253325703076
428.13055973675097500.1786771529208482
528.7586526263287110.26980060957631496
628.89389117093210400.2928502245475736
729.37953717348814210.38414006941637974
829.38745574661426500.3857211724501716
929.7301165467240300.456025989208083
1029.73286559028175410.4566011649897577
1129.80460064038508610.4716465476624143
1230.85428639690807600.6838918583579029
1331.1067054134491710.7280185490554567
1431.46680260330574810.7838559408058121
1531.52064142541004410.7914495278564925
1631.93319756721452400.843118090723654
1732.1139796279128110.8622465766953867
1832.76060664971977610.9152435218371247
1934.3373938510864710.9760883965278441

The above results can then be plotted using the plot function.

The curve looks is an ideal logit regression curve.

In [9]:
df.plot type: [:scatter,:line], x: [:body_weight]*2, y: [:survive_pred]*2 do |plot, diagram|
plot.x_label "Body Weight"
plot.y_label "Probability of Survival"
end