Hypothesis Testing so important for Data Scientist, because you should know how to check distributions and relations between data.
In this tutorial, we will learn statistical hypothesis tests that you may need to use in your machine learning career. And after completing this tutorial, you will know how testing hypothesis for:
In this tutorial, you discovered the key statistical hypothesis tests that you may need to use in a machine learning project.
Specifically, you learned:
The types of tests to use in different circumstances, such as normality checking, relationships between variables, and differences between samples. The key assumptions for each test and how to interpret the test result. How to implement the test using the Python
Before you can apply the statistical tests, you must know how to interpret the results.
Each test will return at least two things:
Each test calculates a test-specific statistic. This statistic can aid in the interpretation of the result, although it may require a deeper proficiency with statistics and a deeper knowledge of the specific statistical test. Instead, the p-value can be used to quickly and accurately interpret the statistic in practical applications.
At first part we'll be testing data that the sample was drawn from a Gaussian distribution. Technically this is called the null hypothesis, or H0. A threshold level is chosen called alpha, typically 5% (or 0.05), that is used to interpret the p-value. Interpret the p value as follows:
This means that, in general, we are seeking results with a larger p-value to confirm that our sample was likely drawn from a Gaussian distribution.
A result above 5% DOESN'T MEAN that the null hypothesis is true. It means that it is very likely true given available evidence. The p-value is not the probability of the data fitting a Gaussian distribution; it can be thought of as a value that helps us interpret the statistical test.
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
normal = np.random.standard_normal(
1000
) # Draw random samples from a normal (Gaussian) distribution.
uniform = np.random.uniform(size=1000)
log_normal = np.random.lognormal(size=1000)
sns.distplot(log_normal)
sns.distplot(normal);
sns.distplot(uniform);
This section lists statistical tests that you can use to check if your data has a Gaussian distribution.
Tests whether a data sample has a Gaussian distribution(Normal).
Assumptions
Observations in each sample are independent and identically distributed (iid).
Interpretation
Returns the test statistic and the p-value.
from scipy.stats import shapiro
shapiro(normal), shapiro(log_normal)
More information:
The D’Agostino’s K^2 test calculates summary statistics from the data, namely kurtosis and skewness, to determine if the data distribution departs from the normal distribution, named for Ralph D’Agostino.
The D’Agostino’s K^2 test is available via the normaltest() SciPy function and returns the test statistic and the p-value.
The complete example of the D’Agostino’s K^2 test on the dataset is listed below.
from scipy.stats import normaltest
stat, p = normaltest(normal)
print("Statistics=%.3f, p=%.3f" % (stat, p))
alpha = 0.05
if p > alpha:
print("Sample looks Gaussian (fail to reject H0)")
else:
print("Sample does not look Gaussian (reject H0)")
from scipy.stats import normaltest
stat, p = normaltest(log_normal)
print("Statistics=%.3f, p=%.3f" % (stat, p))
alpha = 0.05
if p > alpha:
print("Sample looks Gaussian (fail to reject H0)")
else:
print("Sample does not look Gaussian (reject H0)")
More information:
from scipy.stats import anderson
result = anderson(normal)
print("Statistic: %.3f" % result.statistic)
p = 0
for i in range(len(result.critical_values)):
sl, cv = result.significance_level[i], result.critical_values[i]
if result.statistic < result.critical_values[i]:
print("%.3f: %.3f, data looks normal (fail to reject H0)" % (sl, cv))
else:
print("%.3f: %.3f, data does not look normal (reject H0)" % (sl, cv))
from scipy.stats import anderson
result = anderson(uniform)
print("Statistic: %.3f" % result.statistic)
p = 0
for i in range(len(result.critical_values)):
sl, cv = result.significance_level[i], result.critical_values[i]
if result.statistic < result.critical_values[i]:
print("%.3f: %.3f, data looks normal (fail to reject H0)" % (sl, cv))
else:
print("%.3f: %.3f, data does not look normal (reject H0)" % (sl, cv))
More information:
A simple and commonly used plot to quickly check the distribution of a sample of data is the histogram. A sample of data has a Gaussian distribution of the histogram plot, showing the familiar bell shape.
# Histogram Plot Gaussian distribution
sns.distplot(normal);
# Histogram Plot Uniform distribution
sns.distplot(uniform);
Another popular plot for checking the distribution of a data sample is the quantile-quantile plot, Q-Q plot, or QQ plot for short.
This plot generates its own sample of the idealized distribution that we are comparing with, in this case the Gaussian distribution. The idealized samples are divided into groups (e.g. 5), called quantiles. Each data point in the sample is paired with a similar member from the idealized distribution at the same cumulative distribution.
The resulting points are plotted as a scatter plot with the idealized value on the x-axis and the data sample on the y-axis.
A perfect match for the distribution will be shown by a line of dots on a 45-degree angle from the bottom left of the plot to the top right. Often a line is drawn on the plot to help make this expectation clear. Deviations by the dots from the line shows a deviation from the expected distribution.
import statsmodels.api as sm
from matplotlib import pyplot as plt
fig = sm.qqplot(normal, line="s")
plt.show() # Gaussian distribution
fig = sm.qqplot(log_normal, line="s")
plt.show() # Log Normal distribution
This section lists statistical tests that you can use to check if two samples are related.
x = np.random.normal(0, 1, 1000)
y = (3 * x) - np.random.normal(0, 2, 1000)
Tests whether two samples have a linear relationship.
from scipy.stats import pearsonr
corr, p = pearsonr(x, y)
corr, p
More information:
Tests whether two samples have a monotonic relationship. Observations in each sample can be ranked.
from scipy.stats import spearmanr
corr, p = spearmanr(x, y)
corr, p
More information:
Tests whether two samples have a monotonic relationship. Like in previous
from scipy.stats import kendalltau
corr, p = kendalltau(x, y)
corr, p
More information:
This section lists statistical tests that you can use to compare data samples. Parametric statistical methods often mean those methods that assume the data samples have a Gaussian distribution. Parametric statistical significance tests that quantify the difference between the means of two or more samples of data.
For parametric statistical tests assume that a data sample was drawn from a specific population distribution.
They often refer to statistical tests that assume the Gaussian distribution. Because it is so common for data to fit this distribution, parametric statistical methods are more commonly used.
A typical question we may have about two or more samples of data is whether they have the same distribution. Parametric statistical significance tests are those statistical methods that assume data comes from the same Gaussian distribution, that is a data distribution with the same mean and standard deviation: the parameters of the distribution.
The p-value can be thought of as the probability of observing the two data samples given the base assumption (null hypothesis) that the two samples were drawn from a population with the same distribution.
p <= alpha: reject null hypothesis, different distribution.
p > alpha: fail to reject null hypothesis, same distribution
x1 = 3 * np.random.randn(1000) + 20
x2 = 3 * np.random.randn(1000) + 21
print("x1: mean=%.3f stdv=%.3f" % (np.mean(x1), np.std(x1)))
print("x2: mean=%.3f stdv=%.3f" % (np.mean(x2), np.std(x2)))
sns.distplot(x1)
sns.distplot(x2);
Tests whether the means of two independent samples are significantly different. One of the most commonly used t tests is the independent samples t test. You use this test when you want to compare the means of two independent samples on a given variable
Assumptions
Interpretation
from scipy.stats import ttest_ind
stat, p = ttest_ind(x1, x2)
print("P value:", p)
alpha = 0.05
if p > alpha:
print("Same distributions (fail to reject H0)")
else:
print("Different distributions (reject H0)")
The interpretation of the statistic finds that the sample means are different, with a significance of at least 5%.
More information:
Tests whether the means of two paired samples are significantly different. The paired Student’s t-test for quantifying the difference between the mean of two dependent data samples.
Assumptions
Interpretation
from scipy.stats import ttest_rel
stat, p = ttest_rel(x1, x2)
print("P value:", p)
alpha = 0.05
if p > alpha:
print("Same distributions (fail to reject H0)")
else:
print("Different distributions (reject H0)")
The interpretation of the result suggests that the samples have different means and therefore different distributions
More information:
Tests whether the means of two or more independent samples are significantly different. The ANOVA and repeated measures ANOVA for checking the similarity or difference between the means of 2 or more data samples.
from scipy.stats import f_oneway
stat, p = f_oneway(x1, x2)
print("P value:", p)
alpha = 0.05
if p > alpha:
print("Same distributions (fail to reject H0)")
else:
print("Different distributions (reject H0)")
More information:
Tests whether the distributions of two independent samples are equal or not.
Assumptions
Interpretation
x = np.random.normal(0, 1, 1000)
y = np.random.uniform(0, 1, 1000)
z = np.random.uniform(0, 1, 1000)
from scipy.stats import mannwhitneyu
stat, p = mannwhitneyu(y, z)
stat, p
More information:
Tests whether the distributions of two paired samples are equal or not.
Assumptions
Interpretation
from scipy.stats import wilcoxon
stat, p = wilcoxon(z, y)
stat, p
More information:
Tests whether the distributions of two or more independent samples are equal or not.
Assumptions
Interpretation
from scipy.stats import kruskal z1 = np.random.uniform(0,1,1000) stat, p = kruskal(y,z,z1) stat, p
More information:
Tests whether the distributions of two or more paired samples are equal or not.
Assumptions
Interpretation
from scipy.stats import friedmanchisquare
stat, p = friedmanchisquare(y, z, z1)
stat, p
Depends on what you want to determine. If you are interested in determining whether the distributions have the same mean, and don't care about the rest, then K-S is not best. K-S is very good for determining whether two samples in essence come from the same population.
import numpy as np
from scipy.stats import ks_2samp
np.random.seed(123456)
x = np.random.normal(0, 1, 1000)
y = np.random.normal(0, 1, 1000)
z = np.random.normal(1.1, 0.9, 1000)
ks_2samp(x, y), ks_2samp(x, z)
Under the null hypothesis the two distributions are identical. If the K-S statistic is small or the p-value is high (greater than the significance level, say 5%), then we cannot reject the hypothesis that the distributions of the two samples are the same. Conversely, we can reject the null hypothesis if the p-value is low.
More information:
In this tutorial, you discovered the key statistical hypothesis tests that you may need to use in a machine learning project.
Specifically, you learned:
The types of tests to use in different circumstances, such as normality checking, relationships between variables, and differences between samples. The key assumptions for each test and how to interpret the test result. How to implement the test using the Python