#!/usr/bin/env python
# coding: utf-8
# # Way number eight of looking at the correlation coefficient
#
# This is a notebook to accompany the blog post ["Way number eight of looking at the correlation coefficient"](http://composition.al/blog/2019/01/31/way-number-eight-of-looking-at-the-correlation-coefficient/). Read the post for additional context!
# In[1]:
import numpy as np
from datascience import *
from datetime import *
import matplotlib
get_ipython().run_line_magic('matplotlib', 'inline')
import matplotlib.pyplot as plots
from mpl_toolkits.mplot3d import Axes3D
import pandas as pd
import math
# ## Recap from last time
#
# As [before](http://composition.al/blog/2018/08/31/understanding-the-regression-line-with-standard-units/), we're using the [datascience](http://data8.org/datascience/) package, and everything else we're using is pretty standard.
#
# And, as before, here's the data we'll be working with, [converted to standard units](https://www.inferentialthinking.com/chapters/14/2/Variability#standard-units) and plotted:
# In[2]:
heightweight = Table().with_columns([
'Date', ['07/28/2017', '08/07/2017', '08/25/2017', '09/25/2017', '11/28/2017', '01/26/2018', '04/27/2018', '07/30/2018'],
'Height (cm)', [ 53.3, 54.6, 55.9, 61, 63.5, 67.3, 71.1, 74.9],
'Weight (kg)', [ 4.204, 4.65, 5.425, 6.41, 7.985, 9.125, 10.39, 10.785],
])
def standard_units(nums):
return (nums - np.mean(nums)) / np.std(nums)
heightweight_standard = Table().with_columns(
'Date', heightweight.column('Date'),
'Height (standard units)', standard_units(heightweight.column('Height (cm)')),
'Weight (standard units)', standard_units(heightweight.column('Weight (kg)')))
heightweight_standard
# In[3]:
heightweight_standard.scatter(
'Height (standard units)',
'Weight (standard units)')
# ## Visualizing the data in "person space"
#
# So far, this is all a recap of [last time](http://composition.al/blog/2018/08/31/understanding-the-regression-line-with-standard-units/). Now, let's try turning our data sideways.
#
# The hacky way I have of doing this is to convert the data first to a numpy [ndarray](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.html), then to a [pandas DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html), and then [transposing](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.T.html#pandas.DataFrame.T) the DataFrame. This is kind of silly, but I don't know a better way to transpose a [structured ndarray](https://docs.scipy.org/doc/numpy/user/basics.rec.html). If you do, let me know.
# In[4]:
# First convert to a plain old numpy ndarray.
heightweight_standard_np = heightweight_standard.to_array()
# Now convert *that* to a pandas DataFrame.
df = pd.DataFrame(heightweight_standard_np)
# Get the transpose of the DataFrame.
df = df.T
df
# pandas defaults to using `RangeIndex (0, 1, 2, …, n)` for the column labels, but we want the dates from the first row to be the column headers rather than being an actual row. That's [an easy change to make](https://stackoverflow.com/questions/26147180/convert-row-to-column-header-for-pandas-dataframe), though.
# In[5]:
df.columns = df.iloc[0]
df = df.drop("Date")
df
# While we're at it, we'll convert the values in our DataFrame to numeric values, so that we can visualize them in a moment.
# In[6]:
df = df.apply(pd.to_numeric)
df
# Eight dimensions are too many to try to visualize, but we can pare it down to three. We'll pick three -- the first (07/28/2017), the last (07/30/2018), and one in the middle (01/26/2018) -- and drop the rest.
# In[7]:
df_3dim = df.drop(df.columns[[1, 2, 3, 4, 6]],axis=1)
df_3dim
# Now we can visualize the data with a three-dimensional scatter plot.
# In[9]:
get_ipython().run_line_magic('matplotlib', 'notebook')
scatter_3d = plots.figure().gca(projection='3d')
scatter_3d.scatter(df_3dim.iloc[:, 0], df_3dim.iloc[:, 1], df_3dim.iloc[:, 2])
scatter_3d.set_xlabel(df_3dim.columns[0])
scatter_3d.set_ylabel(df_3dim.columns[1])
scatter_3d.set_zlabel(df_3dim.columns[2])
height_point = df_3dim.iloc[0]
weight_point = df_3dim.iloc[1]
origin = [0,0,0]
X, Y, Z = zip(origin,origin)
U, V, W = zip(height_point, weight_point)
scatter_3d.quiver(X, Y, Z, U, V, W, arrow_length_ratio=0.09)
plots.show()
# What's going on here? We're in the "person space", where, as Rodgers and Nicewander explained, each axis represents an observation -- in this case, three observations. And there are two points, as promised -- one for each of height and weight.
#
# If we look at the difference between the two points on the z-axis -- that is, the axis for 07/30/2018 -- the darker-colored blue dot is higher up, so it must represent the height variable, with coordinates (-1.26135, 0.617255, 1.63707) That means that the other, lighter-colored blue dot, with coordinates (-1.3158, 0.728253, 1.41777), must represent the weight variable.
#
# I've also plotted vectors going from the origin to each point. These are the "variable vectors" for the two points.
# ## The angle between the variable vectors
#
# Finally, we want to figure out the angle between the two vectors. There are [various ways](https://stackoverflow.com/questions/2827393/angles-between-two-n-dimensional-vectors-in-python) to do that in Python; we'll use a simple one that works for us:
# In[10]:
def dotproduct(v1, v2):
return sum((a*b) for a, b in zip(v1, v2))
def length(v):
return math.sqrt(dotproduct(v, v))
def angle(v1, v2):
return math.acos(dotproduct(v1, v2) / (length(v1) * length(v2)))
angle_between_vvs = angle(height_point, weight_point)
angle_between_vvs
# Finally, we can take the cosine of that to get the correlation coefficient $r$:
# In[11]:
math.cos(angle_between_vvs)
# Almost 1! That means that, just like [last time](http://composition.al/blog/2018/08/31/understanding-the-regression-line-with-standard-units/), we have an almost perfect linear correlation.
#
# It's a bit different from what we had last time, though, which was 0.9910523777994954. That's because, for the sake of visualization, we decided to only look at three of the observations.
# ## The angle between the _actual_ variable vectors
#
# We can, however, go back to all eight dimensions. We may not be able to visualize them, but we can still measure the angle between them!
# In[12]:
height_point_8dim = df.iloc[0]
weight_point_8dim = df.iloc[1]
angle_between_8dim_vvs = angle(height_point_8dim, weight_point_8dim)
angle_between_8dim_vvs
# Taking the cosine of this slightly bigger angle:
# In[13]:
math.cos(angle_between_8dim_vvs)
# This turns out to be the same as what we had previously calculated $r$ to be, modulo a little numerical imprecision. And so, that's way number eight of looking at the correlation coefficient -- as the angle between two variable vectors in "person space".