Exercise 02

Estimate a regression using the Income data

Forecast of income

We'll be working with a dataset from US Census indome (data dictionary).

Many businesses would like to personalize their offer based on customer’s income. High-income customers could be, for instance, exposed to premium products. As a customer’s income is not always explicitly known, predictive model could estimate income of a person based on other information.

Our goal is to create a predictive model that will be able to output an estimation of a person income.

In [5]:
import pandas as pd
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt

# read the data and set the datetime as the index
import zipfile
with zipfile.ZipFile('../datasets/income.csv.zip', 'r') as z:
    f = z.open('income.csv')
    income = pd.read_csv(f, index_col=0)

income.head()
Out[5]:
Age Workclass fnlwgt Education Education-Num Martial Status Occupation Relationship Race Sex Capital Gain Capital Loss Hours per week Country Income
0 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States 51806.0
1 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States 68719.0
2 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States 51255.0
3 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States 47398.0
4 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba 30493.0
In [6]:
income.shape
Out[6]:
(32561, 15)

Exercise 2.1

What is the relation between the age and Income?

For a one percent increase in the Age how much the income increases?

Using sklearn estimate a linear regression and predict the income when the Age is 30 and 40 years

In [3]:
income.plot(x='Age', y='Income', kind='scatter')
Out[3]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f7d835da7f0>

Exercise 2.2

Evaluate the model using the MSE

Exercise 2.3

Run a regression model using as features the Age and Age$^2$ using the OLS equations

Exercise 2.4

Estimate a regression using more features.

How is the performance compared to using only the Age?

Exercise 2.5

Estimate a logistic regression to predict if a person is in the United States.

What is the performance of the model

In [10]:
income['isUS'] = (income['Country'] == 'United-States')*1.0
income['isUS'].value_counts()
Out[10]:
1.0    29170
0.0     3391
Name: isUS, dtype: int64