Exercise 02¶

Estimate a regression using the Income data

Forecast of income¶

We'll be working with a dataset from US Census indome (data dictionary).

Many businesses would like to personalize their offer based on customer’s income. High-income customers could be, for instance, exposed to premium products. As a customer’s income is not always explicitly known, predictive model could estimate income of a person based on other information.

Our goal is to create a predictive model that will be able to output an estimation of a person income.

In [5]:

import pandas as pd
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt

# read the data and set the datetime as the index
import zipfile
with zipfile.ZipFile('../datasets/income.csv.zip', 'r') as z:
    f = z.open('income.csv')
    income = pd.read_csv(f, index_col=0)

income.head()

Out[5]:

	Age	Workclass	fnlwgt	Education	Education-Num	Martial Status	Occupation	Relationship	Race	Sex	Capital Gain	Hours per week	Country	Income
0	39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	40	United-States	51806.0
1	50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	13	United-States	68719.0
2	38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	40	United-States	51255.0
3	53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	40	United-States	47398.0
4	28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	40	Cuba	30493.0

In [6]:

income.shape

Out[6]:

(32561, 15)

Exercise 2.1¶

What is the relation between the age and Income?

For a one percent increase in the Age how much the income increases?

Using sklearn estimate a linear regression and predict the income when the Age is 30 and 40 years

In [3]:

income.plot(x='Age', y='Income', kind='scatter')

Out[3]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f7d835da7f0>

In [ ]:

Exercise 2.2¶

Evaluate the model using the MSE

In [ ]:

Exercise 2.3¶

Run a regression model using as features the Age and Age$^2$ using the OLS equations

In [ ]:

Exercise 2.4¶

Estimate a regression using more features.

How is the performance compared to using only the Age?

In [ ]:

Exercise 2.5¶

Estimate a logistic regression to predict if a person is in the United States.

What is the performance of the model

In [10]:

income['isUS'] = (income['Country'] == 'United-States')*1.0
income['isUS'].value_counts()

Out[10]:

1.0    29170
0.0     3391
Name: isUS, dtype: int64

In [ ]: