# Multiple Regression¶

Let's grab a data set of of car values:

In [2]:
import pandas as pd


Out[2]:
Price Mileage Make Model Trim Type Cylinder Liter Doors Cruise Sound Leather
0 17314.103129 8221 Buick Century Sedan 4D Sedan 6 3.1 4 1 1 1
1 17542.036083 9135 Buick Century Sedan 4D Sedan 6 3.1 4 1 1 0
2 16218.847862 13196 Buick Century Sedan 4D Sedan 6 3.1 4 1 1 0
3 16336.913140 16342 Buick Century Sedan 4D Sedan 6 3.1 4 1 0 0
4 16339.170324 19832 Buick Century Sedan 4D Sedan 6 3.1 4 1 0 1
In [5]:
%matplotlib inline
import numpy as np

df1 = df[['Mileage','Price']]
bins =  np.arange(0,50000,10000)
groups = df1.groupby(pd.cut(df1['Mileage'],bins)).mean()

groups['Price'].plot.line()

                     Mileage         Price
Mileage
(0, 10000]       5588.629630  24096.714451
(10000, 20000]  15898.496183  21955.979607
(20000, 30000]  24114.407104  20278.606252
(30000, 40000]  33610.338710  19463.670267

Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fd9d3394710>

We can use pandas to split up this matrix into the feature vectors we're interested in, and the value we're trying to predict.

Note how we are avoiding the make and model; regressions don't work well with ordinal values, unless you can convert them into some numerical order that makes sense somehow.

Let's scale our feature data into the same range so we can easily compare the coefficients we end up with.

In [8]:
import statsmodels.api as sm
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()

X = df[['Mileage', 'Cylinder', 'Doors']]
y = df['Price']

X[['Mileage', 'Cylinder', 'Doors']] = scale.fit_transform(X[['Mileage', 'Cylinder', 'Doors']].as_matrix())

print (X)

est = sm.OLS(y, X).fit()

est.summary()

      Mileage  Cylinder     Doors
0   -1.417485  0.527410  0.556279
1   -1.305902  0.527410  0.556279
2   -0.810128  0.527410  0.556279
3   -0.426058  0.527410  0.556279
4    0.000008  0.527410  0.556279
5    0.293493  0.527410  0.556279
6    0.335001  0.527410  0.556279
7    0.382369  0.527410  0.556279
8    0.511409  0.527410  0.556279
9    0.914768  0.527410  0.556279
10  -1.171368  0.527410  0.556279
11  -0.581834  0.527410  0.556279
12  -0.390532  0.527410  0.556279
13  -0.003899  0.527410  0.556279
14   0.430591  0.527410  0.556279
15   0.480156  0.527410  0.556279
16   0.509822  0.527410  0.556279
17   0.757160  0.527410  0.556279
18   1.594886  0.527410  0.556279
19   1.810849  0.527410  0.556279
20  -1.326046  0.527410  0.556279
21  -1.129860  0.527410  0.556279
22  -0.667658  0.527410  0.556279
23  -0.405792  0.527410  0.556279
24  -0.112796  0.527410  0.556279
25  -0.044552  0.527410  0.556279
26   0.190700  0.527410  0.556279
27   0.337442  0.527410  0.556279
28   0.566102  0.527410  0.556279
29   0.660837  0.527410  0.556279
..        ...       ...       ...
774 -0.161262 -0.914896  0.556279
775 -0.089234 -0.914896  0.556279
776 -0.040523 -0.914896  0.556279
777  0.002572 -0.914896  0.556279
778  0.236603 -0.914896  0.556279
779  0.249666 -0.914896  0.556279
780  0.357220 -0.914896  0.556279
781  0.365521 -0.914896  0.556279
782  0.434131 -0.914896  0.556279
783  0.517269 -0.914896  0.556279
784  0.589908 -0.914896  0.556279
785  0.599186 -0.914896  0.556279
786  0.793052 -0.914896  0.556279
787  1.033554 -0.914896  0.556279
788  1.045762 -0.914896  0.556279
789  1.205567 -0.914896  0.556279
790  1.541414 -0.914896  0.556279
791  1.561070 -0.914896  0.556279
792  1.725026 -0.914896  0.556279
793  1.851502 -0.914896  0.556279
794 -1.709871  0.527410  0.556279
795 -1.474375  0.527410  0.556279
796 -1.187849  0.527410  0.556279
797 -1.079929  0.527410  0.556279
798 -0.682430  0.527410  0.556279
799 -0.439853  0.527410  0.556279
800 -0.089966  0.527410  0.556279
801  0.079605  0.527410  0.556279
802  0.750446  0.527410  0.556279
803  1.932565  0.527410  0.556279

[804 rows x 3 columns]

/home/nikolas/Desktop/venv/lib/python3.6/site-packages/ipykernel_launcher.py:8: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.

/home/nikolas/Desktop/venv/lib/python3.6/site-packages/sklearn/utils/validation.py:595: DataConversionWarning: Data with input dtype int64 was converted to float64 by StandardScaler.
warnings.warn(msg, DataConversionWarning)
/home/nikolas/Desktop/venv/lib/python3.6/site-packages/sklearn/utils/validation.py:595: DataConversionWarning: Data with input dtype int64 was converted to float64 by StandardScaler.
warnings.warn(msg, DataConversionWarning)
/home/nikolas/Desktop/venv/lib/python3.6/site-packages/ipykernel_launcher.py:8: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

/home/nikolas/Desktop/venv/lib/python3.6/site-packages/pandas/core/indexing.py:543: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self.obj[item] = s

Out[8]:
Dep. Variable: R-squared (uncentered): Price 0.064 OLS 0.060 Least Squares 18.11 Sun, 01 Sep 2019 2.23e-11 03:30:21 -9207.1 804 1.842e+04 801 1.843e+04 3 nonrobust
coef std err t P>|t| [0.025 0.975] -1272.3412 804.623 -1.581 0.114 -2851.759 307.077 5587.4472 804.509 6.945 0.000 4008.252 7166.642 -1404.5513 804.275 -1.746 0.081 -2983.288 174.185
 Omnibus: Durbin-Watson: 157.913 0.008 0 257.529 1.278 1.2e-56 4.074 1.03

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The table of coefficients above gives us the values to plug into an equation of form: B0 + B1 Mileage + B2 cylinders + B3 * doors

In this example, it's pretty clear that the number of cylinders is more important than anything based on the coefficients.

Could we have figured that out earlier?

In [4]:
y.groupby(df.Doors).mean()

Out[4]:
Doors
2    23807.135520
4    20580.670749
Name: Price, dtype: float64

Surprisingly, more doors does not mean a higher price! (Maybe it implies a sport car in some cases?) So it's not surprising that it's pretty useless as a predictor here. This is a very small data set however, so we can't really read much meaning into it.

In [29]:
scaled = scale.transform([[20000, 8, 4]])
print(scaled)
predicted = est.predict(scaled[0])
print(predicted)

[[0.02051781 1.96971667 0.55627894]]
[10198.25991671]