In the real world, we often come across scenarios which requires to make decisions that result into finite outcomes, like the below examples,
Will it rain today?
Will I reach office on time today?
Would a child graduate from his/her university?
Does sedentary lifestyle increase the chances to get the heart disease?
Does smoking lead to lung cancer?
Would I wear blue, black, red outfit today?
What grade a student would get in an exam?
All the above situations do reflect the input-output relationships. Here the output variable values are discrete & finite rather than continuous & infinite values like in Linear Regression. How could we model and analyze such data?
We could try to frame a rule which helps in guessing the outcome from the input variables. This is called a classification problem, and is an important topic in statistics and machine learning. Classification, a task of assigning objects to one of the several predefined categories, is a pervasive problem that encompasses many diverse applications in a broad array of domains. Some examples of Classification Tasks are listed below:
In medical field, the classification task could be assigning a diagnosis to a given patient as described by observed characteristics of the patient such as age, gender, blood pressure, body mass index, presence or absence of certain symptoms, etc.
In banking sector, one may want to categorize hundreds or thousands of applications for new cards containing information for several attributes such as annual salary, outstanding debts, age etc., into users who have good credit or bad credit for enabling a credit card company to do further analysis for decision making; OR one might want to learn to predict whether a particular credit card charge is legitimate or fraudulent.
In social sciences, we may be interested to predict the preference of a voter for a party based on – age, income, sex, race, residence state, votes in previous elections etc.
In finance sector, one would require to ascertain “whether a vendor is credit worthy”?
In insurance domain, the company will need to assess “Is the submitted claim fraudulent or genuine”?
In Marketing, the marketer would like to figure out “Which segment of consumers are likely to buy”?
Mostly, in the business world, Classification problems where the response or dependent variable have discrete and finite outcomes, are more prevalent than the Regression problems where the response variable is continuous and have infinite values. Logistic Regression is one of the most common algorithm used for modeling classification problems.
Step 1: load data and run numerical and graphical summaries
Step 2: Split the data into training data and test data
Step 3: Fit a model using training data
Step 3: Use a fitted model to do predictions for the test data
Step 4: Create a confusion matrix, and compute the misclassification rate
**Install and load required packages**
ipak <- function(pkg){
new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
if (length(new.pkg))
install.packages(new.pkg, repos = "http://cran.us.r-project.org", dependencies = TRUE)
sapply(pkg, require, character.only = TRUE)
}
# usage
packages <- c("DMwR","caret","party", "partykit","e1071", "caret")
ipak(packages)
Sigmoid Function
# install.packages('mlbench')
data(BreastCancer, package="mlbench")
bc <- BreastCancer[complete.cases(BreastCancer), ] # keep complete rows
# remove id column
bc <- bc[,-1]
# convert to numeric
for(i in 1:9) {
bc[, i] <- as.numeric(as.character(bc[, i]))
}
# Change Y values to 1's and 0's
bc$Class <- ifelse(bc$Class == "malignant", 1, 0)
bc$Class <- factor(bc$Class, levels = c(0, 1))
library(caret)
'%ni%' <- Negate('%in%') # define 'not in' func
options(scipen=999) # prevents printing scientific notations.
set.seed(100)
trainDataIndex <- createDataPartition(bc$Class, p=0.7, list = F)
trainData <- bc[trainDataIndex, ]
testData <- bc[-trainDataIndex, ]
# Class distribution of train data
table(trainData$Class)
# Down Sample
set.seed(100)
down_train <- downSample(x = trainData[, colnames(trainData) %ni% "Class"],
y = trainData$Class)
table(down_train$Class)
# Up Sample (optional)
set.seed(100)
up_train <- upSample(x = trainData[, colnames(trainData) %ni% "Class"],
y = trainData$Class)
table(up_train$Class)
logitmod <- glm(Class ~ Cl.thickness + Cell.size + Cell.shape, family = "binomial", data=down_train)
summary(logitmod)
pred <- predict(logitmod, newdata = testData, type = "response")
pred
# Recode factors
y_pred_num <- ifelse(pred > 0.5, 1, 0)
y_pred <- factor(y_pred_num, levels=c(0, 1))
y_act <- testData$Class
# Accuracy
mean(y_pred == y_act) # 94%
Loading required package: lattice Loading required package: ggplot2
0 1 311 168
0 1 168 168
0 1 311 311
Call: glm(formula = Class ~ Cl.thickness + Cell.size + Cell.shape, family = "binomial", data = down_train) Deviance Residuals: Min 1Q Median 3Q Max -3.0716 -0.1534 -0.0316 0.0193 2.8420 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -8.8755 1.3285 -6.681 0.0000000000237 *** Cl.thickness 0.7293 0.1768 4.125 0.0000370162955 *** Cell.size 1.0803 0.3334 3.240 0.00119 ** Cell.shape 0.8571 0.2920 2.935 0.00334 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 465.795 on 335 degrees of freedom Residual deviance: 70.378 on 332 degrees of freedom AIC: 78.378 Number of Fisher Scoring iterations: 8
# install.packages("party")
# install.packages("partykit")
# library(party)
# library(partykit)
print(head(readingSkills))
input.dat <- readingSkills[c(1:105),]
nativeSpeaker age shoeSize score 1 yes 5 24.83189 32.29385 2 yes 6 25.95238 36.63105 3 no 11 30.42170 49.60593 4 yes 7 28.66450 40.28456 5 yes 11 31.88207 55.46085 6 yes 10 30.07843 52.83124
output.tree <- ctree(
nativeSpeaker ~ age + shoeSize + score,
data = input.dat)
plot(as.simpleparty(output.tree))
# library(party)
# library(partykit)
# Iris data
data("iris")
str(iris)
summary(iris)
# Data partition
set.seed(555)
ind <- sample(2,
nrow(iris),
replace = TRUE,
prob = c(0.8, 0.2))
train <- iris[ind==1, ]
test <- iris[ind==2, ]
# Decision tree model
tree <- partykit::ctree(Species~.,
train,
control = ctree_control(mincriterion = .9999, minsplit =20 ))
print(tree)
# Visualization of decision trees
plot(tree)
plot(tree, type = 'simple')
# Prediction
predict(tree, train, type = 'prob')
# Misclassification error - train data
p1 <- predict(tree, train)
tab1 <- table(Predicted = p1, Actual = train$Species)
tab1
1 - sum(diag(tab1))/sum(tab1)
# Misclassification error - test data
p2 <- predict(tree, test)
tab2 <- table(Predicted = p2, Actual = test$Species)
tab2
1 - sum(diag(tab2))/sum(tab2)
'data.frame': 150 obs. of 5 variables: $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Sepal.Length Sepal.Width Petal.Length Petal.Width Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 Median :5.800 Median :3.000 Median :4.350 Median :1.300 Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 Species setosa :50 versicolor:50 virginica :50
Model formula: Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width Fitted party: [1] root | [2] Petal.Length <= 1.9: setosa (n = 42, err = 0.0%) | [3] Petal.Length > 1.9 | | [4] Petal.Width <= 1.7: versicolor (n = 42, err = 11.9%) | | [5] Petal.Width > 1.7: virginica (n = 34, err = 2.9%) Number of inner nodes: 2 Number of terminal nodes: 3
setosa | versicolor | virginica | |
---|---|---|---|
1 | 1 | 0 | 0 |
3 | 1 | 0 | 0 |
4 | 1 | 0 | 0 |
5 | 1 | 0 | 0 |
6 | 1 | 0 | 0 |
8 | 1 | 0 | 0 |
9 | 1 | 0 | 0 |
12 | 1 | 0 | 0 |
13 | 1 | 0 | 0 |
14 | 1 | 0 | 0 |
16 | 1 | 0 | 0 |
17 | 1 | 0 | 0 |
18 | 1 | 0 | 0 |
19 | 1 | 0 | 0 |
21 | 1 | 0 | 0 |
22 | 1 | 0 | 0 |
23 | 1 | 0 | 0 |
24 | 1 | 0 | 0 |
25 | 1 | 0 | 0 |
26 | 1 | 0 | 0 |
27 | 1 | 0 | 0 |
29 | 1 | 0 | 0 |
30 | 1 | 0 | 0 |
31 | 1 | 0 | 0 |
32 | 1 | 0 | 0 |
33 | 1 | 0 | 0 |
34 | 1 | 0 | 0 |
35 | 1 | 0 | 0 |
36 | 1 | 0 | 0 |
37 | 1 | 0 | 0 |
... | ... | ... | ... |
114 | 0 | 0.02941176 | 0.9705882 |
115 | 0 | 0.02941176 | 0.9705882 |
117 | 0 | 0.02941176 | 0.9705882 |
118 | 0 | 0.02941176 | 0.9705882 |
119 | 0 | 0.02941176 | 0.9705882 |
120 | 0 | 0.88095238 | 0.1190476 |
121 | 0 | 0.02941176 | 0.9705882 |
122 | 0 | 0.02941176 | 0.9705882 |
123 | 0 | 0.02941176 | 0.9705882 |
124 | 0 | 0.02941176 | 0.9705882 |
125 | 0 | 0.02941176 | 0.9705882 |
126 | 0 | 0.02941176 | 0.9705882 |
128 | 0 | 0.02941176 | 0.9705882 |
130 | 0 | 0.88095238 | 0.1190476 |
131 | 0 | 0.02941176 | 0.9705882 |
132 | 0 | 0.02941176 | 0.9705882 |
133 | 0 | 0.02941176 | 0.9705882 |
134 | 0 | 0.88095238 | 0.1190476 |
135 | 0 | 0.88095238 | 0.1190476 |
138 | 0 | 0.02941176 | 0.9705882 |
139 | 0 | 0.02941176 | 0.9705882 |
140 | 0 | 0.02941176 | 0.9705882 |
141 | 0 | 0.02941176 | 0.9705882 |
142 | 0 | 0.02941176 | 0.9705882 |
143 | 0 | 0.02941176 | 0.9705882 |
144 | 0 | 0.02941176 | 0.9705882 |
145 | 0 | 0.02941176 | 0.9705882 |
146 | 0 | 0.02941176 | 0.9705882 |
147 | 0 | 0.02941176 | 0.9705882 |
149 | 0 | 0.02941176 | 0.9705882 |
Actual Predicted setosa versicolor virginica setosa 42 0 0 versicolor 0 37 5 virginica 0 1 33
Actual Predicted setosa versicolor virginica setosa 8 0 0 versicolor 0 12 0 virginica 0 0 12
plot(iris)
plot(iris$Sepal.Length, iris$Sepal.Width, col=iris$Species)#
plot(iris$Petal.Length, iris$Petal.Width, col=iris$Species)
s <- sample(150, 100)
col <- c('Petal.Length','Petal.Width','Species')
iris_train <- iris[s,col]
iris_test <- iris[-s,col]
svmfit <- svm(Species ~ ., data = iris_train, kernel="linear", cost=.1, scale = FALSE)
print(svmfit)
plot(svmfit, iris_train[, col])
Call: svm(formula = Species ~ ., data = iris_train, kernel = "linear", cost = 0.1, scale = FALSE) Parameters: SVM-Type: C-classification SVM-Kernel: linear cost: 0.1 gamma: 0.5 Number of Support Vectors: 46
tuned <- tune(svm, Species ~ ., data = iris_train, kernel = "linear", ranges = list(cost = c(0.001,
0.01, 0.1, 1.1, 100))) # best cast parameter cross validation
summary(tuned)
Parameter tuning of 'svm': - sampling method: 10-fold cross validation - best parameters: cost 0.1 - best performance: 0.04 - Detailed performance results: cost error dispersion 1 0.001 0.74 0.06992059 2 0.010 0.42 0.11352924 3 0.100 0.04 0.05163978 4 1.100 0.04 0.05163978 5 100.000 0.04 0.05163978
p <- predict(svmfit, iris_test[,col], type='class')
plot(p)
table(p, iris_test[,3])
mean(p==iris_test[,3])
p setosa versicolor virginica setosa 18 0 0 versicolor 0 15 1 virginica 0 2 14
#library(caret)
# import data from UCI Machine Learning website
# url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data"
# download.file(url = url, destfile = "wine.data")
df <- read.csv("data/wine.csv", header = TRUE)
str(df)
'data.frame': 178 obs. of 14 variables: $ V1 : int 1 1 1 1 1 1 1 1 1 1 ... $ V2 : num 14.2 13.2 13.2 14.4 13.2 ... $ V3 : num 1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ... $ V4 : num 2.43 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ... $ V5 : num 15.6 11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 ... $ V6 : int 127 100 101 113 118 112 96 121 97 98 ... $ V7 : num 2.8 2.65 2.8 3.85 2.8 3.27 2.5 2.6 2.8 2.98 ... $ V8 : num 3.06 2.76 3.24 3.49 2.69 3.39 2.52 2.51 2.98 3.15 ... $ V9 : num 0.28 0.26 0.3 0.24 0.39 0.34 0.3 0.31 0.29 0.22 ... $ V10: num 2.29 1.28 2.81 2.18 1.82 1.97 1.98 1.25 1.98 1.85 ... $ V11: num 5.64 4.38 5.68 7.8 4.32 6.75 5.25 5.05 5.2 7.22 ... $ V12: num 1.04 1.05 1.03 0.86 1.04 1.05 1.02 1.06 1.08 1.01 ... $ V13: num 3.92 3.4 3.17 3.45 2.93 2.85 3.58 3.58 2.85 3.55 ... $ V14: int 1065 1050 1185 1480 735 1450 1290 1295 1045 1045 ...
set.seed(3033)
intrain <- createDataPartition(y = df$V1, p= 0.7, list = FALSE)
training <- df[intrain,]
testing <- df[-intrain,]
anyNA(df);summary(df)
training[["V1"]] = factor(training[["V1"]]) # V1 integer variable to factor variable
V1 V2 V3 V4 Min. :1.000 Min. :11.03 Min. :0.740 Min. :1.360 1st Qu.:1.000 1st Qu.:12.36 1st Qu.:1.603 1st Qu.:2.210 Median :2.000 Median :13.05 Median :1.865 Median :2.360 Mean :1.938 Mean :13.00 Mean :2.336 Mean :2.367 3rd Qu.:3.000 3rd Qu.:13.68 3rd Qu.:3.083 3rd Qu.:2.558 Max. :3.000 Max. :14.83 Max. :5.800 Max. :3.230 V5 V6 V7 V8 Min. :10.60 Min. : 70.00 Min. :0.980 Min. :0.340 1st Qu.:17.20 1st Qu.: 88.00 1st Qu.:1.742 1st Qu.:1.205 Median :19.50 Median : 98.00 Median :2.355 Median :2.135 Mean :19.49 Mean : 99.74 Mean :2.295 Mean :2.029 3rd Qu.:21.50 3rd Qu.:107.00 3rd Qu.:2.800 3rd Qu.:2.875 Max. :30.00 Max. :162.00 Max. :3.880 Max. :5.080 V9 V10 V11 V12 Min. :0.1300 Min. :0.410 Min. : 1.280 Min. :0.4800 1st Qu.:0.2700 1st Qu.:1.250 1st Qu.: 3.220 1st Qu.:0.7825 Median :0.3400 Median :1.555 Median : 4.690 Median :0.9650 Mean :0.3619 Mean :1.591 Mean : 5.058 Mean :0.9574 3rd Qu.:0.4375 3rd Qu.:1.950 3rd Qu.: 6.200 3rd Qu.:1.1200 Max. :0.6600 Max. :3.580 Max. :13.000 Max. :1.7100 V13 V14 Min. :1.270 Min. : 278.0 1st Qu.:1.938 1st Qu.: 500.5 Median :2.780 Median : 673.5 Mean :2.612 Mean : 746.9 3rd Qu.:3.170 3rd Qu.: 985.0 Max. :4.000 Max. :1680.0
trctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
set.seed(3333)
knn_fit <- train(V1 ~., data = training, method = "knn",
trControl=trctrl,
preProcess = c("center", "scale"),
tuneLength = 10)
knn_fit
k-Nearest Neighbors 125 samples 13 predictor 3 classes: '1', '2', '3' Pre-processing: centered (13), scaled (13) Resampling: Cross-Validated (10 fold, repeated 3 times) Summary of sample sizes: 113, 114, 111, 113, 112, 112, ... Resampling results across tuning parameters: k Accuracy Kappa 5 0.9627511 0.9436722 7 0.9435204 0.9148116 9 0.9490759 0.9230590 11 0.9495421 0.9237385 13 0.9465507 0.9194040 15 0.9521062 0.9276424 17 0.9548840 0.9318090 19 0.9493284 0.9233699 21 0.9574481 0.9356161 23 0.9490759 0.9231172 Accuracy was used to select the optimal model using the largest value. The final value used for the model was k = 5.
plot(knn_fit)
test_pred <- predict(knn_fit, newdata = testing)
test_pred
confusionMatrix(test_pred, testing$V1 )
Confusion Matrix and Statistics Reference Prediction 1 2 3 1 15 0 0 2 0 24 0 3 0 0 14 Overall Statistics Accuracy : 1 95% CI : (0.9328, 1) No Information Rate : 0.4528 P-Value [Acc > NIR] : < 2.2e-16 Kappa : 1 Mcnemar's Test P-Value : NA Statistics by Class: Class: 1 Class: 2 Class: 3 Sensitivity 1.000 1.0000 1.0000 Specificity 1.000 1.0000 1.0000 Pos Pred Value 1.000 1.0000 1.0000 Neg Pred Value 1.000 1.0000 1.0000 Prevalence 0.283 0.4528 0.2642 Detection Rate 0.283 0.4528 0.2642 Detection Prevalence 0.283 0.4528 0.2642 Balanced Accuracy 1.000 1.0000 1.0000
Further Reference:
https://www.datacamp.com/community/tutorials/machine-learning-in-r
cforest()
function can be used from party
packagecForest()
improves the prediction accuracy compared to cTree(), however, it does not have the visualization capabilities that cTree()
.#install.packages("ipred", repos='http://cran.us.r-project.org', lib="/home/nbuser/R")
library (ipred)
library (party)
data ("GlaucomaMVF", package = "ipred")
inputData <- GlaucomaMVF
set.seed (100)
train <- sample(1:nrow(inputData), 0.7*nrow(inputData)) # random sample
trainData <- inputData[train,] # training data
testData <- inputData[-train,] # test data
ctree()
function¶cTreeMod <- ctree (Class ~ ., data = trainData) # fit cTree with 'Class' as dependent
actuals <- testData$Class # actuals
predicted <- predict(cTreeMod, newdata = testData) # predicted
table(true = actuals, pred = predicted) # confusion matrix
mean (testData$Class != predicted) # Misclassification Error %
pred true glaucoma normal glaucoma 21 4 normal 8 19
cForest()
function¶cForestMod <- cforest(Class ~ ., data = trainData) # random Forest model
actuals <- testData$Class # actuals
predicted <- predict(cForestMod, newdata = testData) # predicted
table (true = actuals, pred = predict(cForestMod, newdata = testData))
mean (testData$Class != predicted) # Misclassification Error %
pred true glaucoma normal glaucoma 22 3 normal 2 25
# Data
getwd()
data <- read.csv("binary.csv", header = TRUE)
str(data)
# Min-Max Normalization
data$gre <- (data$gre - min(data$gre))/(max(data$gre) - min(data$gre))
data$gpa <- (data$gpa - min(data$gpa))/(max(data$gpa) - min(data$gpa))
data$rank <- (data$rank - min(data$rank))/(max(data$rank)-min(data$rank))
# Data Partition
set.seed(222)
ind <- sample(2, nrow(data), replace = TRUE, prob = c(0.7, 0.3))
training <- data[ind==1,]
testing <- data[ind==2,]
# Neural Networks
library(neuralnet)
set.seed(333)
n <- neuralnet(admit~gre+gpa+rank,
data = training,
hidden = 5,
err.fct = "ce",
linear.output = FALSE)
plot(n)
# Prediction
output <- compute(n, training[,-1])
head(output$net.result)
head(training[1,])
# Node Output Calculations with Sigmoid Activation Function
in4 <- 0.0455 + (0.82344*0.7586206897) + (1.35186*0.8103448276) + (-0.87435*0.6666666667)
out4 <- 1/(1+exp(-in4))
in5 <- -7.06125 +(8.5741*out4)
out5 <- 1/(1+exp(-in5))
# Confusion Matrix & Misclassification Error - training data
output <- compute(n, training[,-1])
p1 <- output$net.result
pred1 <- ifelse(p1>0.5, 1, 0)
tab1 <- table(pred1, training$admit)
tab1
1-sum(diag(tab1))/sum(tab1)
# Confusion Matrix & Misclassification Error - testing data
output <- compute(n, testing[,-1])
p2 <- output$net.result
pred2 <- ifelse(p2>0.5, 1, 0)
tab2 <- table(pred2, testing$admit)
tab2
1-sum(diag(tab2))/sum(tab2)