Estimate a regression using the Income data
We'll be working with a dataset from US Census indome (data dictionary).
Many businesses would like to personalize their offer based on customer’s income. High-income customers could be, for instance, exposed to premium products. As a customer’s income is not always explicitly known, predictive model could estimate income of a person based on other information.
Our goal is to create a predictive model that will be able to output an estimation of a person income.
import pandas as pd import numpy as np %matplotlib inline import matplotlib.pyplot as plt # read the data and set the datetime as the index import zipfile with zipfile.ZipFile('../datasets/income.csv.zip', 'r') as z: f = z.open('income.csv') income = pd.read_csv(f, index_col=0) income.head()
|Age||Workclass||fnlwgt||Education||Education-Num||Martial Status||Occupation||Relationship||Race||Sex||Capital Gain||Capital Loss||Hours per week||Country||Income|
What is the relation between the age and Income?
For a one percent increase in the Age how much the income increases?
Using sklearn estimate a linear regression and predict the income when the Age is 30 and 40 years
income.plot(x='Age', y='Income', kind='scatter')
<matplotlib.axes._subplots.AxesSubplot at 0x7f7d835da7f0>
Estimate a regression using more features.
How is the performance compared to using only the Age?
Estimate a logistic regression to predict if a person is in the United States.
What is the performance of the model
income['isUS'] = (income['Country'] == 'United-States')*1.0 income['isUS'].value_counts()
1.0 29170 0.0 3391 Name: isUS, dtype: int64