#!/usr/bin/env python
# coding: utf-8

# # Understanding the regression line with standard units
# 
# This is a notebook to accompany the blog post ["Understanding the regression line with standard units"](http://composition.al/blog/2018/08/31/understanding-the-regression-line-with-standard-units/).  Read the post for additional context!

# In[1]:


import numpy as np
from datascience import *
from datetime import *
import matplotlib
get_ipython().run_line_magic('matplotlib', 'inline')
import matplotlib.pyplot as plots


# We're using the [datascience](http://data8.org/datascience/) package that the Data 8X course uses.  Everything else we're using here is pretty standard.
# 
# To start with, let's make a table of the height and weight data we'll be working with.  For now, dates are just strings, but we'll fix that later!

# In[2]:


heightweight = Table().with_columns([
    'Date',        ['07/28/2017', '08/07/2017', '08/25/2017', '09/25/2017', '11/28/2017', '01/26/2018', '04/27/2018', '07/30/2018'],
    'Height (cm)', [        53.3,         54.6,         55.9,           61,         63.5,         67.3,         71.1,         74.9],
    'Weight (kg)', [       4.204,         4.65,        5.425,         6.41,        7.985,        9.125,        10.39,       10.785],
                                    ])
heightweight


# Now we can plot weight as a function of height.

# In[3]:


heightweight.scatter('Height (cm)', 'Weight (kg)')


# Look at that!  Those dots aren't too far from being a straight line.  It looks as though height and weight are more or less linearly related!  We'll come back to that in a moment, but first, let's convert them to standard units.

# ## Converting to standard units
# 
# A value in standard units is _how many standard deviations above the mean_ it is.  So, to convert a data point to standard units, we need the mean and the standard deviation of the data set that it came from.  Then we subtract the mean from it, and then divide that by the standard deviation  The `standard_units` function below, which comes [from the Data 8 textbook](https://www.inferentialthinking.com/chapters/14/2/Variability#standard-units), does this for a whole array of numbers at once.

# In[4]:


def standard_units(nums):
    return (nums - np.mean(nums)) / np.std(nums)


# Now we can create a version of the `heightweight` table that's in standard units. 

# In[5]:


heightweight_standard = Table().with_columns(
    'Date', heightweight.column('Date'),
    'Height (standard units)', standard_units(heightweight.column('Height (cm)')),
    'Weight (standard units)', standard_units(heightweight.column('Weight (kg)')))
heightweight_standard


# And we can plot this data, just like we did with the data in original units.

# In[6]:


heightweight_standard.scatter('Height (standard units)', 'Weight (standard units)')


# The data points on our plot are exactly the same.  Only the axes have changed.

# ## The correlation coefficient
# 
# Now that we've got standard units, though, it's easy to compute the correlation coefficient $r$.  First, we need to take the product of our $x$ and $y$ values -- in this case, height and weight -- for each data point.  We'll add a new column to our table that has these products.

# In[7]:


heightweight_product = heightweight_standard.with_column(
    'Product of standardized height and weight',
    heightweight_standard.column('Height (standard units)') * heightweight_standard.column('Weight (standard units)'))
heightweight_product


# To compute $r$, we just need to take the mean of the product column.

# In[8]:


r = np.mean(heightweight_product.column('Product of standardized height and weight'))
r


# Wow. $r$ is very close to 1, which means that our data has an almost perfect linear correlation.
# 
# What does "almost perfect" look like?  Now that we have $r$, we can plot the regression line.  (To be specific, we're plotting a segment of the regression line that goes from -2 to 2 on the $x$-axis, but we could plot a longer segment if we wanted to.)

# In[9]:


heightweight_standard.scatter('Height (standard units)', 'Weight (standard units)', label="data")
x = np.array(range(-2, 3))  
y = r * x # <-- the regression equation!
plots.plot(x, y, label="regression line", linewidth=1.5)  
plots.legend();


# Since $r$ is so close to 1, the regression line is awfully close to $y = x$, a perfect linear correlation.  Let's plot that line, too.

# In[10]:


heightweight_standard.scatter('Height (standard units)', 'Weight (standard units)', label="data")
x = np.array(range(-2, 3))  
y = r * x # <-- the regression equation
plots.plot(x, y, label="regression line", linewidth=1.5)  
y = x # <-- linear correlation
plots.plot(x, y, label="linear correlation", linewidth=1.5)  
plots.legend();


# The slope of the actual regression line is just sliiiiightly smaller than the slope of the line $y = x$.  That is, the regression line is slightly higher than the red line on the left side of the plot, and slightly lower on the right side.  If we scale up the plot size and tweak the line width and range of the axes, we can see the separation between the two lines a bit more clearly.

# In[11]:


original_dpi = plots.rcParams['figure.dpi']
plots.rcParams['figure.dpi'] = plots.rcParams['figure.dpi'] * 1.5
heightweight_standard.scatter('Height (standard units)', 'Weight (standard units)', label="data")
x = np.array(range(-2, 101))  
y = r * x # <-- the regression equation
plots.plot(x, y, label="regression line", linewidth=0.5)  
y = x # <-- linear correlation
plots.plot(x, y, label="linear correlation", linewidth=0.5)  
plots.legend()
plots.rcParams['figure.dpi'] = original_dpi


# ## Secret Bonus Content&trade;
# 
# We've seen that the relationship between height and weight in this data set is very well captured by a straight line.  But what about the relationship between height and time, or weight and time?
# 
# Let's go back to our original table and look at the date column.  To make dates easier to work with, let's convert them to Python date objects instead of strings.

# In[12]:


heightweight_dates = heightweight.with_columns(
   'Date',
    [datetime.strptime(date, "%m/%d/%Y").date() for date in heightweight.column(0)])
heightweight_dates


# We can now compute the number of days since birth for each row in our data set.  Sylvia was born on July 24, 2017, so we'll define that date to be her `birthday` and work from there.

# In[13]:


birthday = date(2017,7,24)
days_since_birth = [(date - birthday).days for date in heightweight_dates.column('Date')]
heightweight_with_days = heightweight_dates.with_columns('Days since birth', days_since_birth)
heightweight_with_days


# And now we can easily plot the relationship between days since birth and height, and the relationship between days since birth and weight.

# In[14]:


heightweight_with_days.scatter('Days since birth', 'Height (cm)')
heightweight_with_days.scatter('Days since birth', 'Weight (kg)')


# How linear are these relationships?  We can find out by computing $r$ for each relationship.

# In[15]:


heightweight_with_days_standard = Table().with_columns(
    "Date", heightweight_with_days.column('Date'),
    "Height (standard units)", standard_units(heightweight_with_days.column('Height (cm)')),
    "Weight (standard units)", standard_units(heightweight_with_days.column('Weight (kg)')),
    "Days since birth (standard units)", standard_units(heightweight_with_days.column('Days since birth'))
)
heightweight_with_days_product = heightweight_with_days_standard.with_columns(
    "Product of standardized height and days since birth",
    heightweight_with_days_standard.column('Height (standard units)') * heightweight_with_days_standard.column('Days since birth (standard units)'),
    "Product of standardized weight and days since birth",
    heightweight_with_days_standard.column('Weight (standard units)') * heightweight_with_days_standard.column('Days since birth (standard units)'))
heightweight_with_days_product


# In[16]:


r_days_height = np.mean(heightweight_with_days_product.column('Product of standardized height and days since birth'))
print("r_days_height:", r_days_height)
r_days_weight = np.mean(heightweight_with_days_product.column('Product of standardized weight and days since birth'))
print("r_days_weight:", r_days_weight)


# Both of these values for $r$ are pretty close to 1, although a little less so than the relationship between height and weight.  Let's plot the regression lines:

# In[17]:


heightweight_with_days_standard.scatter("Days since birth (standard units)", "Height (standard units)", label="data")
x = np.array(range(-2, 3))  
y = r_days_height * x # <-- the regression equation
plots.plot(x, y, label="regression line", linewidth=1.5)  
y = x # <-- linear correlation
plots.plot(x, y, label="linear correlation", linewidth=1.5)  
plots.legend()

heightweight_with_days_standard.scatter("Days since birth (standard units)", "Weight (standard units)", label="data")
x = np.array(range(-2, 3))  
y = r_days_weight * x # <-- the regression equation
plots.plot(x, y, label="regression line", linewidth=1.5)  
y = x # <-- linear correlation
plots.plot(x, y, label="linear correlation", linewidth=1.5)
plots.legend();


# Although $r$ is pretty close to 1 in each case, the shapes of the scatter plots look to me like the data would be better modeled by a curved line, especially the second one.  Maybe the next part of the course will talk about those.