Multiclass classification using scikit-learn
Approach
Types of classification algorithms in Machine Learning
Types
测试数据:iris数据
# importing necessary libraries
from sklearn import datasets
from sklearn.model_selection import train_test_split
# loading the iris dataset
iris = datasets.load_iris()
# X -> features, y -> label
X = iris.data
y = iris.target
# dividing X, y into train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 0)
混淆矩阵用法:
from sklearn.metrics import confusion_matrix
y_true=[2,1,0,1,2,0]
y_pred=[2,0,0,1,2,2]
cm=confusion_matrix(y_true, y_pred)
print(cm)
[[1 0 1] [1 1 0] [0 0 2]]
第一行1 0 1第三个1是:将0预测为2
第二行1 0 1第一个1是:将1预测为0
It is a statistical method for analysing a data set in which there are one or more independent variables that determine an outcome. The outcome is measured with a dichotomous variable (in which there are only two possible outcomes). The goal of logistic regression is to find the best fitting model to describe the relationship between the dichotomous characteristic of interest (dependent variable = response or outcome variable) and a set of independent (predictor or explanatory) variables.
它是用于分析数据集的统计方法,其中存在一个或多个确定结果的独立变量。结果用二分变量(其中只有两种可能的结果)来衡量。逻辑回归的目标是找到最佳拟合模型来描述感兴趣的二分特征(因变量=响应或结果变量)与一组独立(预测或解释)变量之间的关系。
Definition: Logistic regression is a machine learning algorithm for classification. In this algorithm, the probabilities describing the possible outcomes of a single trial are modelled using a logistic function.
Advantages: Logistic regression is designed for this purpose (classification), and is most useful for understanding the influence of several independent variables on a single outcome variable.
Disadvantages: Works only when the predicted variable is binary, assumes all predictors are independent of each other, and assumes data is free of missing values.
# # training a Logistic Regression classifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix #混淆矩阵
clf = LogisticRegression(random_state=0, solver='lbfgs',multi_class='multinomial')
clf.fit(X_train,y_train)
# accuracy on X_test
accuracy = clf.score(X_test,y_test)
print(accuracy)
y_predictions = clf.predict(X_test)
# creating a confusion matrix
cm = confusion_matrix(y_test, y_predictions)
print(cm)
0.9777777777777777 [[16 0 0] [ 0 17 1] [ 0 0 11]]
It is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. Even if these features depend on each other or upon the existence of the other features, all of these properties independently contribute to the probability. Naive Bayes model is easy to build and particularly useful for very large data sets. Along with simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods.
它是一种基于贝叶斯定理的分类技术,假设在预测变量中具有独立性。简单来说,朴素贝叶斯分类器假定类中特定特征的存在与任何其他特征的存在无关。即使这些特征彼此依赖或依赖于其他特征的存在,所有这些特性也独立地影响概率。朴素贝叶斯模型易于构建,特别适用于非常大的数据集。除简单外,Naive Bayes的表现甚至超过了高度复杂的分类方法。
Definition: Naive Bayes algorithm based on Bayes’ theorem with the assumption of independence between every pair of features. Naive Bayes classifiers work well in many real-world situations such as document classification and spam filtering.
Advantages: This algorithm requires a small amount of training data to estimate the necessary parameters. Naive Bayes classifiers are extremely fast compared to more sophisticated methods.
Disadvantages: Naive Bayes is is known to be a bad estimator.
Naive Bayes classification method is based on Bayes’ theorem. It is termed as ‘Naive’ because it assumes independence between every pair of feature in the data. Let (x1, x2, …, xn) be a feature vector and y be the class label corresponding to this feature vector.
Applying Bayes’ theorem,
Since, x1, x2, …, xn are independent of each other,
Inserting proportionality by removing the P(x1, …, xn) (since, it is constant).
Therefore, the class label is decided by,
P(y) is the relative frequency of class label y in the training dataset.
In case of Gaussian Naive Bayes classifier, P(xi | y) is calculated as,
# training a Naive Bayes classifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix #混淆矩阵
clf = GaussianNB()
clf.fit(X_train, y_train)
# accuracy on X_test
accuracy = clf.score(X_test, y_test)
print(accuracy)
y_predictions = clf.predict(X_test)
# creating a confusion matrix
cm = confusion_matrix(y_test, y_predictions)
print(cm)
1.0 [[16 0 0] [ 0 18 0] [ 0 0 11]]
SVM (Support vector machine) is an efficient classification method when the feature vector is high dimensional. In sci-kit learn, we can specify the the kernel function (here, linear). To know more about kernel functions and SVM refer – Kernel function | sci-kit learn and SVM.
当特征向量是高维的时,SVM(支持向量机)是一种有效的分类方法。在sci-kit学习中,我们可以指定内核函数(这里是线性)。要了解有关内核函数和SVM的更多信息,请参阅 - 内核函数| sci-kit learn和SVM。
Definition: Support vector machine is a representation of the training data as points in space separated into categories by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall.
Advantages: Effective in high dimensional spaces and uses a subset of training points in the decision function so it is also memory efficient.
Disadvantages: The algorithm does not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation.
# training a linear SVM classifier
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix #混淆矩阵
clf = SVC(kernel = 'linear', C = 1)
clf.fit(X_train, y_train)
# accuracy on X_test
accuracy = clf.score(X_test, y_test)
print(accuracy)
y_predictions = clf.predict(X_test)
# creating a confusion matrix
cm = confusion_matrix(y_test, y_predictions)
print(cm)
0.9777777777777777 [[16 0 0] [ 0 17 1] [ 0 0 11]]
Decision tree classifier is a systematic approach for multiclass classification. It poses a set of questions to the dataset (related to its attributes/features). The decision tree classification algorithm can be visualized on a binary tree. On the root and each of the internal nodes, a question is posed and the data on that node is further split into separate records that have different characteristics. The leaves of the tree refer to the classes in which the dataset is split. In the following code snippet, we train a decision tree classifier in scikit-learn.
决策树分类器是多类分类的系统方法。它为数据集提出了一组问题(与其属性/特征相关)。决策树分类算法可以在二叉树上可视化。在根节点和每个内部节点上,提出了一个问题,并且该节点上的数据被进一步拆分为具有不同特征的单独记录。树的叶子指的是分割数据集的类。在下面的代码片段中,我们在scikit-learn中训练决策树分类器。
Definition: Given a data of attributes together with its classes, a decision tree produces a sequence of rules that can be used to classify the data.
Advantages: Decision Tree is simple to understand and visualise, requires little data preparation, and can handle both numerical and categorical data.
Disadvantages: Decision tree can create complex trees that do not generalise well, and decision trees can be unstable because small variations in the data might result in a completely different tree being generated.
# training a Descision Tree Classifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix #混淆矩阵
clf = DecisionTreeClassifier(max_depth = 2)
clf.fit(X_train, y_train)
# accuracy on X_test
accuracy = clf.score(X_test, y_test)
print(accuracy)
y_predictions = clf.predict(X_test)
# creating a confusion matrix
cm = confusion_matrix(y_test, y_predictions)
print(cm)
0.9111111111111111 [[16 0 0] [ 0 17 1] [ 0 3 8]]
集成方法(Ensemble Method)
GBDT(Gradient Boosting Decision Tree)
http://www.liuhaihua.cn/archives/57364.html
作为一个非常有效的机器学习方法,Boosted Tree是数据挖掘和机器学习中最常用的算法之一。因为它效果好,对于输入要求不敏感,往往是从统计学家到数据科学家必备的工具之一,它同时也是kaggle比赛冠军选手最常用的工具。最后,因为它的效果好,计算复杂度不高,也在工业界中有大量的应用。
Boosted Tree有各种马甲,比如GBDT, GBRT (gradient boosted regression tree),MART,LambdaMART也是一种boosted tree的变种。
AdaBoost classifier with 100 weak learners:
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_iris
from sklearn.ensemble import AdaBoostClassifier
iris = load_iris()
clf = AdaBoostClassifier(n_estimators=100)
scores = cross_val_score(clf, iris.data, iris.target, cv=5)
scores.mean()
0.9466666666666665
梯度提升回归树:
Gradient Tree Boosting or Gradient Boosted Regression Trees (GBRT):
The advantages of GBRT are:
The disadvantages of GBRT are:
from sklearn.datasets import make_hastie_10_2
from sklearn.ensemble import GradientBoostingClassifier
X, y = make_hastie_10_2(random_state=0)
X_train, X_test = X[:2000], X[2000:]
y_train, y_test = y[:2000], y[2000:]
clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0,
max_depth=1, random_state=0).fit(X_train, y_train)
clf.score(X_test, y_test)
0.913
Definition: Random forest classifier is a meta-estimator that fits a number of decision trees on various sub-samples of datasets and uses average to improve the predictive accuracy of the model and controls over-fitting. The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement.
Advantages: Reduction in over-fitting and random forest classifier is more accurate than decision trees in most cases.
Disadvantages: Slow real time prediction, difficult to implement, and complex algorithm.
# training a Descision Tree Classifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix #混淆矩阵
clf = RandomForestClassifier(n_estimators=70,oob_score=True)
clf.fit(X_train, y_train)
# accuracy on X_test
accuracy = clf.score(X_test, y_test)
print(accuracy)
y_predictions = clf.predict(X_test)
# creating a confusion matrix
cm = confusion_matrix(y_test, y_predictions)
print(cm)
0.8661 [[4160 889] [ 450 4501]]
https://www.geeksforgeeks.org/ml-stochastic-gradient-descent-sgd/
什么是梯度下降?
在谈论随机梯度下降(SGD)之前,让我们先了解什么是梯度下降?Gradient Descent是机器学习和深度学习中非常流行的优化技术,它可以与大多数(如果不是全部)学习算法一起使用。梯度基本上是函数的斜率; 参数的变化程度与另一个参数的变化量。在数学上,它可以被描述为关于其输入的一组参数的偏导数。梯度越大,坡度越陡。渐变下降是凸函数。 梯度下降可以被描述为迭代方法,其用于找到函数的参数的值,其尽可能地最小化成本函数。参数最初被定义为特定值,并且由此,梯度下降以迭代方式运行以使用微积分找到参数的最佳值,以找到给定成本函数的最小可能值。
梯度下降的类型: 通常,有三种类型的梯度下降:
在本文中,我们将讨论随机梯度下降或SGD。
“ _随机_ ” 一词是指与随机概率相关联的系统或过程。因此,在随机梯度下降中,随机选择一些样本而不是每次迭代的整个数据集。在Gradient Descent中,有一个名为“batch”的术语,表示来自数据集的样本总数,用于计算每次迭代的梯度。在典型的梯度下降优化中,如批量梯度下降,批次被视为整个数据集。虽然使用整个数据集对于以较少噪声或较不随机的方式获得最小值非常有用,但是当我们的数据集变得非常庞大时会出现问题。 假设您的数据集中有一百万个样本,因此如果您使用典型的梯度下降优化技术,则必须使用所有一百万个样本来完成一次迭代,同时执行渐变下降,并且必须完成每次迭代,直到达到最小值。因此,执行计算成本非常高。
这个问题通过随机梯度下降来解决。在SGD中,它仅使用单个样本(即批量大小为1)来执行每次迭代。随机洗牌并选择样本以执行迭代。
SGD算法:
因此,在SGD中,我们在每次迭代中找出单个示例的成本函数的梯度,而不是所有示例的成本函数的梯度之和。
在SGD中,由于每次迭代只选择一个来自数据集的样本,因此算法达到最小值所采用的路径通常比典型的梯度下降算法更嘈杂。但这并不重要,因为算法所采用的路径并不重要,只要我们达到最小值并且训练时间明显缩短。
批量梯度下降采取的路径 -
随机梯度下降所采取的路径 -
需要注意的一点是,由于SGD通常比典型的梯度下降更嘈杂,因此通常需要更多的迭代才能达到最小值,因为它的下降具有随机性。尽管与典型的梯度下降相比,它需要更多的迭代次数来达到最小值,但它的计算成本仍然比典型的梯度下降要便宜得多。因此,在大多数情况下,SGD优于批量梯度下降优化学习算法。
https://scikit-learn.org/stable/modules/sgd.html
The advantages of Stochastic Gradient Descent are:
The disadvantages of Stochastic Gradient Descent include:
from sklearn.linear_model import SGDClassifier
X = [[0., 0.], [1., 1.]]
y = [0, 1]
clf = SGDClassifier(loss="hinge", penalty="l2", tol=1e-3, max_iter=15)
clf.fit(X, y)
SGDClassifier(alpha=0.0001, average=False, class_weight=None, early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True, l1_ratio=0.15, learning_rate='optimal', loss='hinge', max_iter=15, n_iter=None, n_iter_no_change=5, n_jobs=None, penalty='l2', power_t=0.5, random_state=None, shuffle=True, tol=0.001, validation_fraction=0.1, verbose=0, warm_start=False)
clf.predict([[2., 2.]])
array([1])
KNN or k-nearest neighbours is the simplest classification algorithm. This classification algorithm does not depend on the structure of the data. Whenever a new example is encountered, its k nearest neighbours from the training data are examined. Distance between two examples can be the euclidean distance between their feature vectors. The majority class among the k nearest neighbours is taken to be the class for the encountered example.
KNN或k-最近邻是最简单的分类算法。该分类算法不依赖于数据的结构。每当遇到新的例子时,都会检查训练数据中的k个最近邻居。两个示例之间的距离可以是它们的特征向量之间的欧几里德距离。 k个最近邻居中的大多数类被认为是遇到的例子的类。
Definition: Neighbours based classification is a type of lazy learning as it does not attempt to construct a general internal model, but simply stores instances of the training data. Classification is computed from a simple majority vote of the k nearest neighbours of each point.
Advantages: This algorithm is simple to implement, robust to noisy training data, and effective if training data is large.
Disadvantages: Need to determine the value of K and the computation cost is high as it needs to computer the distance of each instance to all the training samples.
# training a KNN classifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix #混淆矩阵
clf = KNeighborsClassifier(n_neighbors = 7).fit(X_train, y_train)
clf.fit(X_train, y_train)
# accuracy on X_test
accuracy = clf.score(X_test, y_test)
print(accuracy)
y_predictions = clf.predict(X_test)
# creating a confusion matrix
cm = confusion_matrix(y_test, y_predictions)
print(cm)
0.6384 [[4996 53] [3563 1388]]