大家都知道Python中有一个强大的可视化库matplotlib,其中matplotlib.pyplot也是针对统计可视化,可问题在于matplotlib太过繁复,这里推荐另一个统计可视化库Seaborn,只需简单的几行代码,就可以画出相当漂亮的统计图。
另外,如果需要制作交互式图表,Python中我推荐bokeh库
注:本分享可直接下载,然后用IPython notebook(现在叫jupyter)运行。
%matplotlib inline
import numpy as np
import pandas as pd
from scipy import stats, integrate
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(color_codes=True)
np.random.seed(sum(map(ord, "distributions")))
x = np.random.normal(size=100)
sns.distplot(x)#默认增加趋势线
<matplotlib.axes._subplots.AxesSubplot at 0x1ede3f60>
sns.distplot(x, kde=False, rug=True)#删除趋势线,增加地毯图
<matplotlib.axes._subplots.AxesSubplot at 0x1f0c27f0>
sns.distplot(x, bins=20, kde=False, rug=True)#将箱数设置为20
<matplotlib.axes._subplots.AxesSubplot at 0x1f672390>
sns.distplot(x, hist=False, rug=True)#要趋势线,不要直方图
<matplotlib.axes._subplots.AxesSubplot at 0x1fca87f0>
sns.kdeplot(x, shade=True)#也可以直接使用线图
<matplotlib.axes._subplots.AxesSubplot at 0x2038df98>
一个新的参数 bw(banwidth)类似于直方图里面的箱数
sns.kdeplot(x)
sns.kdeplot(x, bw=.2, label="bw:0.2")
sns.kdeplot(x, bw=2, label="bw:2")
plt.legend()
<matplotlib.legend.Legend at 0x20f41940>
sns.kdeplot(x, shade=True, cut=0)#删除极端值
sns.rugplot(x)
<matplotlib.axes._subplots.AxesSubplot at 0x201fb7b8>
mean, cov = [0, 1], [(1, .5), (.5, 1)]
data = np.random.multivariate_normal(mean, cov, 200)
df = pd.DataFrame(data, columns=["x", "y"])
sns.jointplot(x="x", y="y", data=df)#默认散点图
<seaborn.axisgrid.JointGrid at 0x20379898>
x, y = np.random.multivariate_normal(mean, cov, 1000).T
with sns.axes_style("white"):
sns.jointplot(x=x, y=y, kind="hex", color="k")#六边分图
sns.jointplot(x="x", y="y", data=df, kind="kde")
<seaborn.axisgrid.JointGrid at 0x226196d8>
f, ax = plt.subplots(figsize=(6, 6))
sns.kdeplot(df.x, df.y, ax=ax)#通过kdeplot画图
sns.rugplot(df.x, color="g", ax=ax)
sns.rugplot(df.y, vertical=True, ax=ax);
iris = sns.load_dataset("iris")
sns.pairplot(iris)
<seaborn.axisgrid.PairGrid at 0x20e45978>
np.random.seed(sum(map(ord, "regression")))
tips = sns.load_dataset('tips')
regplot()和lmplot()两个函数,前一个接受各种数据输入,后一种更为严格
sns.regplot(x="total_bill", y="tip", data=tips)
<matplotlib.axes._subplots.AxesSubplot at 0x3e32470>
sns.lmplot(x="total_bill", y="tip", data=tips)
有时候数据集所产生的散点图并不是最优的
sns.lmplot(x='size', y='tip', data=tips)
<seaborn.axisgrid.FacetGrid at 0x242cd438>
改进的方法,一种是增加一些随机噪音('jitter')
sns.lmplot(x='size', y='tip', data=tips, x_jitter=.05)
<seaborn.axisgrid.FacetGrid at 0x2544d4a8>
另一种方法是根据每个离散值的中心,以绘制趋势
sns.lmplot(x='size', y='tip', data=tips, x_estimator=np.mean)
<seaborn.axisgrid.FacetGrid at 0x240b4898>
anscombe = sns.load_dataset('anscombe')
sns.lmplot(x="x", y="y", data=anscombe.query("dataset == 'I'"),
ci=None, scatter_kws={"s": 80})
<seaborn.axisgrid.FacetGrid at 0x240a8cf8>
上图是合适的,然而下图却不是合适模型
sns.lmplot(x="x", y="y", data=anscombe.query("dataset == 'II'"),
ci=None, scatter_kws={"s": 80})
<seaborn.axisgrid.FacetGrid at 0x2565a208>
此时,非线性模型更为合适,这里用二次模型
sns.lmplot(x="x", y="y", data=anscombe.query("dataset == 'II'"),
order=2, ci=None, scatter_kws={"s": 80})
<seaborn.axisgrid.FacetGrid at 0x259110b8>
另外一个问题是离群值的影响
sns.lmplot(x="x", y="y", data=anscombe.query("dataset == 'III'"),
ci=None, scatter_kws={"s": 80})
<seaborn.axisgrid.FacetGrid at 0x25b202e8>
为了排除离群值的影响,我们可以拟合一种更为robust的模型
sns.lmplot(x="x", y="y", data=anscombe.query("dataset == 'III'"),
robust=True, ci=None, scatter_kws={"s": 80})
<seaborn.axisgrid.FacetGrid at 0x25d3c390>
当因变量为0-1变量是,线性回归仍然有效,但有些预测值并不合理
tips["big_tip"] = (tips.tip / tips.total_bill) > .15
sns.lmplot(x="total_bill", y="big_tip", data=tips,
y_jitter=.03)
<seaborn.axisgrid.FacetGrid at 0x25eee240>
此时,最好拟合logistic模型
sns.lmplot(x="total_bill", y="big_tip", data=tips,
logistic=True, y_jitter=.03)
<seaborn.axisgrid.FacetGrid at 0x2595a8d0>
此时,最为简单的方法就是用不同颜色表示
sns.lmplot(x="total_bill", y="tip", hue="smoker", data=tips)
<seaborn.axisgrid.FacetGrid at 0x26393fd0>
除了改变颜色,还可以改变点的标记
sns.lmplot(x="total_bill", y="tip", hue="smoker", data=tips,
markers=["o", "x"], palette="Set1")
<seaborn.axisgrid.FacetGrid at 0x255a74a8>
要再增加其他变量,可以画多个图
sns.lmplot(x="total_bill", y="tip", hue="smoker", col="time", data=tips)
<seaborn.axisgrid.FacetGrid at 0x26e5bcc0>
sns.lmplot(x="total_bill", y="tip", hue="smoker",
col="time", row="sex", data=tips)
<seaborn.axisgrid.FacetGrid at 0x273fedd8>
f, ax = plt.subplots(figsize=(5,6))
sns.regplot(x="total_bill", y="tip", data=tips, ax=ax)
<matplotlib.axes._subplots.AxesSubplot at 0x27e6acf8>
在lmplot()中,用size和aspect控制
sns.lmplot(x="total_bill", y="tip", col="day", data=tips,
col_wrap=2, size=3)
<seaborn.axisgrid.FacetGrid at 0x27878fd0>
sns.lmplot(x="total_bill", y="tip", col="day", data=tips,
aspect=0.5)
<seaborn.axisgrid.FacetGrid at 0x2c765d30>
用jointplot()
sns.jointplot(x="total_bill", y="tip", data=tips, kind="reg", size=10)
<seaborn.axisgrid.JointGrid at 0x2d6dff60>
用pairplot()
sns.pairplot(tips, x_vars=["total_bill", "size"], y_vars=["tip"],
size=5, aspect=.8, kind="reg")
<seaborn.axisgrid.PairGrid at 0x2d344470>
sns.pairplot(tips, x_vars=["total_bill", "size"], y_vars=["tip"],
hue="smoker",markers=["o", "x"], palette="Set2",
size=5, aspect=.8, kind="reg")
<seaborn.axisgrid.PairGrid at 0x2fa28a90>
np.random.seed(sum(map(ord, "categorical")))
tips = sns.load_dataset("tips")
iris = sns.load_dataset("iris")
titanic = sns.load_dataset("titanic")
sns.stripplot(x="day", y="total_bill", data=tips)
<matplotlib.axes._subplots.AxesSubplot at 0x2ff2cb00>
在其他变量的条件下
sns.stripplot(x="day", y="total_bill", hue="time", data=tips)
<matplotlib.axes._subplots.AxesSubplot at 0x2ffafc88>
转换坐标轴
sns.stripplot(x="total_bill", y="day", hue="time", data=tips)
<matplotlib.axes._subplots.AxesSubplot at 0x307aee80>
sns.boxplot(x="day", y="total_bill", hue="time", data=tips)
<matplotlib.axes._subplots.AxesSubplot at 0x30b9d0b8>
箱型图+核密度
sns.violinplot(x="total_bill", y="day", hue="time", data=tips)
<matplotlib.axes._subplots.AxesSubplot at 0x30819080>
改变核密度的变化细分,相当于改变箱数
sns.violinplot(x="total_bill", y="day", hue="time", data=tips,
bw=.1, scale="count", scale_hue=False)
<matplotlib.axes._subplots.AxesSubplot at 0x2fc32470>
当每个箱都有且只有两个水平时,可以使用拆分split
sns.violinplot(x="day", y="total_bill", hue="sex", data=tips, split=True)
<matplotlib.axes._subplots.AxesSubplot at 0x329ef2e8>
如果想看细分的数据,可以用inner
sns.violinplot(x="day", y="total_bill", hue="sex", data=tips,
split=True, inner="stick", palette="Set3")
<matplotlib.axes._subplots.AxesSubplot at 0x34b3b898>
综合提琴图和带图
sns.violinplot(x="day", y="total_bill", data=tips, inner=None)
sns.stripplot(x="day", y="total_bill", data=tips, jitter=True, size=4)
<matplotlib.axes._subplots.AxesSubplot at 0x35896668>
sns.barplot(x="sex", y="survived", hue="class", data=titanic)
<matplotlib.axes._subplots.AxesSubplot at 0x1f38f780>
sns.countplot(x="deck", data=titanic, palette="Blues_d")
<matplotlib.axes._subplots.AxesSubplot at 0x20af9ba8>
sns.countplot(y="deck", hue="class", data=titanic, palette="Greens_d")
<matplotlib.axes._subplots.AxesSubplot at 0x20ca8470>
sns.pointplot(x="sex", y="survived", hue="class", data=titanic)
<matplotlib.axes._subplots.AxesSubplot at 0x20fac198>
sns.pointplot(x="class", y="survived", hue="sex", data=titanic,
palette={"male": "g", "female": "m"},
markers=["^", "o"], linestyles=["-", "--"])
<matplotlib.axes._subplots.AxesSubplot at 0x2054ab00>