Notebook

机器学习分类算法介绍与Sklearn实现¶

Multiclass classification using scikit-learn

Approach

Load dataset from source.
Split the dataset into “training” and “test” data.
Train Decision tree, SVM, and KNN classifiers on the training data.
Use the above classifiers to predict labels for the test data.
Measure accuracy and visualise classification.

Types of classification algorithms in Machine Learning

Types

Linear Classifiers: Logistic Regression, Naive Bayes Classifier
Support Vector Machines
Decision Trees
Boosted Trees
Random Forest
Stochastic Gradient Descent
Nearest Neighbor
Neural Networks

7 Types of Classification Algorithms

测试数据：iris数据

In [1]:

# importing necessary libraries
from sklearn import datasets
from sklearn.model_selection import train_test_split

# loading the iris dataset
iris = datasets.load_iris()

# X -> features, y -> label
X = iris.data
y = iris.target

# dividing X, y into train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 0)

混淆矩阵用法：

In [2]:

from sklearn.metrics import confusion_matrix

y_true=[2,1,0,1,2,0]
y_pred=[2,0,0,1,2,2]

cm=confusion_matrix(y_true, y_pred)
print(cm)

[[1 0 1]
 [1 1 0]
 [0 0 2]]

第一行1 0 1第三个1是：将0预测为2

第二行1 0 1第一个1是：将1预测为0

1.Logistic Regression (Predictive Learning Model) 逻辑回归:¶

It is a statistical method for analysing a data set in which there are one or more independent variables that determine an outcome. The outcome is measured with a dichotomous variable (in which there are only two possible outcomes). The goal of logistic regression is to find the best fitting model to describe the relationship between the dichotomous characteristic of interest (dependent variable = response or outcome variable) and a set of independent (predictor or explanatory) variables.

它是用于分析数据集的统计方法，其中存在一个或多个确定结果的独立变量。结果用二分变量（其中只有两种可能的结果）来衡量。逻辑回归的目标是找到最佳拟合模型来描述感兴趣的二分特征（因变量=响应或结果变量）与一组独立（预测或解释）变量之间的关系。

Definition: Logistic regression is a machine learning algorithm for classification. In this algorithm, the probabilities describing the possible outcomes of a single trial are modelled using a logistic function.

Advantages: Logistic regression is designed for this purpose (classification), and is most useful for understanding the influence of several independent variables on a single outcome variable.

Disadvantages: Works only when the predicted variable is binary, assumes all predictors are independent of each other, and assumes data is free of missing values.

In [3]:

# # training a Logistic Regression classifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix #混淆矩阵

clf = LogisticRegression(random_state=0, solver='lbfgs',multi_class='multinomial')
clf.fit(X_train,y_train)

# accuracy on X_test
accuracy = clf.score(X_test,y_test)
print(accuracy)

y_predictions = clf.predict(X_test)
# creating a confusion matrix
cm = confusion_matrix(y_test, y_predictions)
print(cm)

0.9777777777777777
[[16  0  0]
 [ 0 17  1]
 [ 0  0 11]]

2.Naive Bayes Classifier (Generative Learning Model)朴素贝叶斯分类器 :¶

It is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. Even if these features depend on each other or upon the existence of the other features, all of these properties independently contribute to the probability. Naive Bayes model is easy to build and particularly useful for very large data sets. Along with simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods.

它是一种基于贝叶斯定理的分类技术，假设在预测变量中具有独立性。简单来说，朴素贝叶斯分类器假定类中特定特征的存在与任何其他特征的存在无关。即使这些特征彼此依赖或依赖于其他特征的存在，所有这些特性也独立地影响概率。朴素贝叶斯模型易于构建，特别适用于非常大的数据集。除简单外，Naive Bayes的表现甚至超过了高度复杂的分类方法。

Definition: Naive Bayes algorithm based on Bayes’ theorem with the assumption of independence between every pair of features. Naive Bayes classifiers work well in many real-world situations such as document classification and spam filtering.

Advantages: This algorithm requires a small amount of training data to estimate the necessary parameters. Naive Bayes classifiers are extremely fast compared to more sophisticated methods.

Disadvantages: Naive Bayes is is known to be a bad estimator.

Naive Bayes classification method is based on Bayes’ theorem. It is termed as ‘Naive’ because it assumes independence between every pair of feature in the data. Let (x₁, x₂, …, x_n) be a feature vector and y be the class label corresponding to this feature vector.

Applying Bayes’ theorem,