Introduction to Statistics¶

Mean, Median and Mode¶

When working with a large data set, it can be useful to represent the entire data set with a single value that describes the "middle" or "average" value of the entire set. In statistics, that single value is called the central tendency and mean, median and mode are all ways to describe it.

To find the mean, add up the values in the data set and then divide by the number of values that you added. To find the median, list the values of the data set in numerical order and identify which value appears in the middle of the list. To find the mode, identify which value in the data set occurs most often. Range, which is the difference between the largest and smallest value in the data set, describes how well the central tendency represents the data. If the range is large, the central tendency is not as representative of the data as it would be if the range was small.

For Example - In data centers, IT professionals need to understand the definition of mean, median, mode and range to plan capacity and balance load, manage systems, perform maintenance and troubleshoot issues. These various tasks dictate that the administrator calculate mean, median, mode or range, or often some combination, to show a statistically significant quantity, trend or deviation from the norm. Finding the mean, median, mode and range is only the start. The administrator then needs to apply this information to investigate root causes of a problem, accurately forecast future needs or set acceptable working parameters for IT systems.

Mean¶

The mean is the average of the numbers: a calculated "central" value of a set of numbers. The statistical mean refers to the mean or average that is used to derive the central tendency of the data in question. It is determined by adding all the data points in a population and then dividing the total by the number of points. The resulting number is known as the mean or the average.

In python we can use mean() to calculate the mean of a distribution.

Exercise¶

Find mean of the array below and assign it to a variable, b.

In [2]:

import numpy as np
a = np.array([1,2,3,4])

In [3]:

b=np.mean(a)
print(b)

2.5

In [1]:

ref_tmp_var = False

try:
    if b==2.5:
        ref_assert_var = True
        ref_tmp_var = True
    else:
        ref_assert_var = False
        print('Please follow the instructions given and use the same variables provided in the instructions.')
except Exception:
    print('Please follow the instructions given and use the same variables provided in the instructions.')

assert ref_tmp_var

Please follow the instructions given and use the same variables provided in the instructions.

Median¶

The median is a simple measure of central tendency. To find the median, we arrange the observations in order from smallest to largest value.

If there is an odd number of observations, the median is the middle value.
If there is an even number of observations, the median is the average of the two middle values.

In python we can use median() to calculate the median of a distribution.

Exercise¶

Find median of the array below and assign it to a variable, b.

In [5]:

a = np.array([10, 7, 4, 3, 2])

In [6]:

b=np.median(a)
print(b)

4.0

In [2]:

ref_tmp_var = False

try:
    if b==4.0:
        ref_assert_var = True
        ref_tmp_var = True
    else:
        ref_assert_var = False
        print('Please follow the instructions given and use the same variables provided in the instructions.')
except Exception:        
    print('Please follow the instructions given and use the same variables provided in the instructions.')

assert ref_tmp_var

Please follow the instructions given and use the same variables provided in the instructions.

Mode¶

A statistical term that refers to the most frequently occurring number found in a set of numbers. The mode is found by collecting and organizing the data in order to count the frequency of each result. The result with the highest occurrences is the mode of the set. In python we can use mode() to calculate the mode of a distribution.

Exercise¶

Find mode of the array below and assign it to a variable, b.

In [8]:

from statistics import mode
a=np.array([1,2,3,3,4,4,4,5,6,6])

In [9]:

b=mode(a)
print('Mode of the array is:',b)

Mode of the array is: 4

In [3]:

ref_tmp_var = False

try:
    if b==4:
        ref_assert_var = True
        ref_tmp_var = True
    else:
        ref_assert_var = False
        print('Please follow the instructions given and use the same variables provided in the instructions.')
except Exception:        
    print('Please follow the instructions given and use the same variables provided in the instructions.')

assert ref_tmp_var

Please follow the instructions given and use the same variables provided in the instructions.

Range¶

The range is the difference between the highest and lowest values within a set of numbers. To calculate range, subtract the smallest number from the largest number in the set.

Exercise¶

Find range of the array below using min() and max() function and assign it to a variable, b.

In [11]:

a= np.array([2,6,8,9,3,6,2,1,7,9,0,3,8])

In [12]:

b=max(a)-min(a)
print(b)

In [13]:

ref_tmp_var = False

try:
    if b==9:
        ref_assert_var = True
        ref_tmp_var = True
    else:
        ref_assert_var = False
        print('Please follow the instructions given and use the same variables provided in the instructions.')
        
except Exception:        
    print('Please follow the instructions given and use the same variables provided in the instructions.')

assert ref_tmp_var

True

Interquartile range (IQR)¶

For understanding Interquartile range, lets understand what is a Quartile first.

Quartile is one of the three points that divide a range of data or population into four equal parts. The first quartile (also called the lower quartile) is the number below which lies the 25 percent of the bottom data. The second quartile (the median) divides the range in the middle and has 50 percent of the data below it. The third quartile (also called the upper quartile) has 75 percent of the data below it and the top 25 percent of the data above it. See also interquartile range and percentile.

So in other words we can say the first quartile (Q1) is defined as the middle number between the smallest number and the median of the data set. The second quartile (Q2) is the median of the data. The third quartile (Q3) is the middle value between the median and the highest value of the data set.

The interquartile range (IQR) is a measure of variability, based on dividing a data set into quartiles.

It is calculated by subtracting Q3 from Q1 and is used to measure the variability in the data. In python we can calculate IQR by importing iqr() it from stats.

Exercise¶

Find IQR of the array below and assign it to a variable, b.

In [14]:

from scipy.stats import iqr
x = np.array([10, 7, 4, 3, 2, 1])

In [37]:

b=iqr(x)
print('Inter Quartile Range is:',b,)

Inter Quartile Range is: 4.0

In [15]:

ref_tmp_var = False

try:
    if b==4.0:
        ref_assert_var = True
        ref_tmp_var = True
    else:
        ref_assert_var = False
        print('Please follow the instructions given and use the same variables provided in the instructions.')
except Exception:        
    print('Please follow the instructions given and use the same variables provided in the instructions.')

assert ref_tmp_var

False