This notebook discusses fundamentals concepts of descriptive statistics such as central tendency and dispersion (spread) measures - mean/median/mode and variance.
We show how one can compute such descriptive statistics using basic Python code (without using any library) as well as using NumPy
functions.
A measure of central tendency is a single value that attempts to describe a set of data by identifying the central position within that set of data. They are also categorized as summary statistics:
Generally, the mean is a better measure to use for symmetric data and median is a better measure for data with a skewed (left or right heavy) distribution. For categorical data, you have to use the mode.
The spread of the data is a measure of by how much the values in the dataset are likely to differ from the mean of the values. If all the values are close together then the spread is low; on the other hand, if some or all of the values differ by a large amount from the mean (and each other), then there is a large spread in the data.
$$V = \frac{\sum{(n_i-\mu)^2}}{N-2}$$NOTE: When we later build regression models, we will revisit these definitions in the conext of statistical estimation. There, the sample variance will be given by a slightly different formula (the denominator will change),
We can simply write a 'for' loop, add the numbers, and divide by the length of the array
array = [3,4,4,7,5,6,5.5,8,5,6.5,9,7.5,6]
sum = 0
for num in array:
sum+=num
mean = sum/len(array)
print("Arithmetic Mean: ",mean)
Arithmetic Mean: 5.884615384615385
from time import time
t1 = time()
for _ in range(100000):
sum = 0
for num in array:
sum+=num
mean = sum/len(array)
t2 = time()
print("Mean: {}\nAverage time taken for computing the mean using for loop: {} seconds ".format(mean,(t2-t1)/100000))
Mean: 5.884615384615385 Average time taken for computing the mean using for loop: 1.1469221115112304e-06 seconds
ndarray.mean()
method¶What is Numpy? - NumPy is the fundamental package for scientific computing in Python. It is a Python library that provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and much more.
https://docs.scipy.org/doc/numpy-1.13.0/user/whatisnumpy.html
import numpy as np
np_array = np.array(array)
print("Mean: ",np_array.mean())
Mean: 5.884615384615385
t1 = time()
np_array = np.array(array)
for _ in range(100000):
mean = np_array.mean()
t2 = time()
print("Mean: {}\nAverage time taken for computing the mean using NumPy: {} seconds ".format(mean,(t2-t1)/100000))
Mean: 5.884615384615385 Average time taken for computing the mean using NumPy: 3.880062103271484e-06 seconds
NumPy
method does not offer significant boost in performance. But what happens when the array is large?¶from random import randint
lst = []
for _ in range(1000000):
lst.append(randint(1,100))
len(lst)
1000000
t1 = time()
for _ in range(100):
sum = 0
for num in lst:
sum+=num
mean = sum/len(lst)
t2 = time()
print("Mean: {}\nAverage time taken for computing the mean using for loop: {} seconds ".format(mean,(t2-t1)/100))
Mean: 50.539717 Average time taken for computing the mean using for loop: 0.06872276782989502 seconds
t1 = time()
np_lst = np.array(lst)
for _ in range(100):
mean = np_lst.mean()
t2 = time()
print("Mean: {}\nAverage time taken for computing the mean using NumPy: {} seconds ".format(mean,(t2-t1)/100))
Mean: 50.539717 Average time taken for computing the mean using NumPy: 0.001326603889465332 seconds
def random_array(num_elements,lower=1,upper=100):
"""
"""
from random import randint
lst = []
for _ in range(num_elements):
lst.append(randint(lower,upper))
return lst
random_array(5)
[99, 68, 65, 99, 87]
random_array(10,-20,20)
[-6, -19, -17, 9, 15, -9, -4, 15, 14, -5]
np.median()
¶array_2 = random_array(15,10,30)
array_2
[25, 13, 11, 19, 13, 24, 27, 18, 30, 14, 30, 20, 18, 19, 23]
# Using the built-in Python 'sorted' method
array_sorted = sorted(array_2)
array_sorted
[11, 13, 13, 14, 18, 18, 19, 19, 20, 23, 24, 25, 27, 30, 30]
def median (array):
"""
Computes median of a given numeric array
"""
num_elements = len(array)
array_sorted = sorted(array)
if num_elements%2==1:
median = array_sorted[int(((num_elements+1)/2)-1)]
else:
median = (array_sorted[int(((num_elements+1)/2)-1)]+array_sorted[int(((num_elements+1)/2))])/2.0
return median
median(array_2)
19
np.median(np.array(array_2))
19.0
array_3 = random_array(16,100,200)
print(array_3)
[183, 199, 136, 101, 196, 107, 153, 173, 122, 157, 117, 118, 125, 161, 171, 169]
median(array_3)
155.0
np.median(np.array(array_3))
155.0
NOTE: Unlike mean()
, an Numpy array does not have median()
method. We have to use np.median()
and pass on the array as the argument.
ndarray.var()
ndarray.std()
def mean(array):
"""
Computes mean
"""
length = len(array)
sum = 0
for i in range(length):
sum+=array[i]
mean = sum/length
return mean
def variance(array):
"""
Computes variance
"""
length = len(array)
avg = mean(array)
sumsq = 0
for i in range(length):
sumsq+=(array[i]-avg)**2
variance = sumsq/length
return variance
def std_dev(array):
"""
Computes std. deviation
"""
from math import sqrt
return (sqrt(variance(array)))
array_4 = random_array(100,1,100)
print(array_4)
[12, 14, 9, 89, 31, 38, 60, 45, 18, 48, 61, 21, 80, 47, 91, 83, 57, 92, 85, 60, 43, 61, 76, 71, 100, 18, 35, 77, 27, 18, 95, 15, 71, 50, 92, 78, 64, 58, 49, 5, 9, 55, 19, 20, 36, 27, 62, 81, 42, 64, 95, 89, 40, 66, 75, 44, 54, 57, 41, 39, 34, 87, 33, 64, 61, 84, 51, 6, 1, 69, 5, 14, 54, 42, 94, 24, 34, 78, 56, 98, 35, 40, 11, 90, 7, 16, 7, 60, 32, 16, 64, 68, 85, 48, 91, 38, 34, 9, 95, 1]
variance(array_4)
790.0474999999996
std_dev(array_4)
28.10778361948874
np.var(np.array(array_4))
790.0474999999999
np.std(np.array(array_4))
28.107783619488746
NaN
values in the array¶nanmean()
nanmedian()
nanstd()
nanvar()
array = random_array(20,1,50)
print(array)
[33, 1, 7, 34, 47, 34, 2, 24, 27, 9, 23, 45, 26, 46, 1, 7, 2, 47, 49, 19]
array[2]=np.nan
array[6]=np.nan
print(array)
[33, 1, nan, 34, 47, 34, nan, 24, 27, 9, 23, 45, 26, 46, 1, 7, 2, 47, 49, 19]
array = np.array(array)
print("Mean:",array.mean())
print("Var:",array.var())
Mean: nan Var: nan
NaN
.¶Notice they are methods of the base Numpy (np
) class, and not of an individual array
print("Mean ignoring NaN:",np.nanmean(array))
print("Var ignoring NaN:",np.nanvar(array))
print("Std. dev ignoring NaN:",np.nanstd(array))
print("Median ignoring NaN:",np.nanmedian(array))
Mean ignoring NaN: 26.333333333333332 Var ignoring NaN: 271.44444444444446 Std. dev ignoring NaN: 16.47557114167653 Median ignoring NaN: 26.5
array = random_array(20,1,100)
array = np.array(array)
sorted_array = sorted(array)
print(array)
[29 59 16 95 64 3 40 7 61 4 99 32 6 38 59 26 84 53 51 69]
# Using np.amax()
print("Max of the array:",np.amax(array))
# Using array.max()
print("Max of the array:",array.max())
Max of the array: 99 Max of the array: 99
# Using np.amin()
print("Min of the array:",np.amin(array))
# Using array.max()
print("Min of the array:",array.min())
Min of the array: 3 Min of the array: 3
# Compute range by using max() and min() functions
print("Range of the array: ", array.max()-array.min())
# Compute range by using ptp() function
print("Range of the array: ", np.ptp(array))
Range of the array: 96 Range of the array: 96
# Percentile
print("20th percentile of the array: ", np.percentile(array,20))
20th percentile of the array: 14.200000000000003
# Quantile
print("0.25-th quantile of the array: ", np.quantile(array,0.25))
print("0.5-th quantile of the array: ", np.quantile(array,0.5))
print("0.75-th quantile of the array: ", np.quantile(array,0.75))
0.25-th quantile of the array: 23.5 0.5-th quantile of the array: 45.5 0.75-th quantile of the array: 61.75
sorted_array[5]='HERE'
sorted_array[10]='HERE'
sorted_array[15]='HERE'
print(sorted_array)
[3, 4, 6, 7, 16, 'HERE', 29, 32, 38, 40, 'HERE', 53, 59, 59, 61, 'HERE', 69, 84, 95, 99]