Estimate a regression using the Income data
We'll be working with a dataset from US Census indome (data dictionary).
Many businesses would like to personalize their offer based on customer’s income. High-income customers could be, for instance, exposed to premium products. As a customer’s income is not always explicitly known, predictive model could estimate income of a person based on other information.
Our goal is to create a predictive model that will be able to output an estimation of a person income.
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
# read the data and set the datetime as the index
import zipfile
with zipfile.ZipFile('../datasets/income.csv.zip', 'r') as z:
f = z.open('income.csv')
income = pd.read_csv(f, index_col=0)
income.head()
Age | Workclass | fnlwgt | Education | Education-Num | Martial Status | Occupation | Relationship | Race | Sex | Capital Gain | Capital Loss | Hours per week | Country | Income | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 39 | State-gov | 77516 | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 2174 | 0 | 40 | United-States | 51806.0 |
1 | 50 | Self-emp-not-inc | 83311 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 13 | United-States | 68719.0 |
2 | 38 | Private | 215646 | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | United-States | 51255.0 |
3 | 53 | Private | 234721 | 11th | 7 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 0 | 0 | 40 | United-States | 47398.0 |
4 | 28 | Private | 338409 | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Wife | Black | Female | 0 | 0 | 40 | Cuba | 30493.0 |
income.shape
(32561, 15)
What is the relation between the age and Income?
For a one percent increase in the Age how much the income increases?
Using sklearn estimate a linear regression and predict the income when the Age is 30 and 40 years
income.plot(x='Age', y='Income', kind='scatter')
<matplotlib.axes._subplots.AxesSubplot at 0x1c5a2f1fac8>
Evaluate the model using the MSE
Run a regression model using as features the Age and Age$^2$ using the OLS equations
Estimate a regression using more features.
How is the performance compared to using only the Age?