# This will include the required code snippets from the previous lesson.
# 'requirements.py' should be provided in the Lesson10 folder.
%run requirements.py
For preprocessing, we are going to make a duplicate copy of our original dataframe. We are duplicating adult_df to adult_df_rev. As we want to perform some imputation for missing values.
adult_df_rev = adult_df
Before doing that, we need some summary statistics of our dataframe. For this, we can use describe() method. It can be used to generate various summary statistics, excluding NaN values.
We are passing an “include” parameter with value as “all”, this is used to specify that we want summary statistics of all the attributes.
#Dataset basic statistics
adult_df_rev.describe(include= 'all')
After displaying the basic statistics about the dataset after running the above command. Spend some time looking at the details about each stat provided.
Now, it’s time to impute the missing values. Some of our categorical values have missing values i.e, “?”. We are going to replace the “?” with the above describe methods - top row's value. For example, we are going to replace the “?” values of the workplace attribute with the “Private” value.
for value in ['workclass', 'education',
'marital_status', 'occupation',
'relationship','race', 'sex',
'native_country', 'income']:
adult_df_rev[value].replace(['?'], [adult_df_rev.describe(include='all')[value][2]],
inplace=True)
You have successfully performed the data imputation step. 🙂
# display and verify the changes in the adult_df_rev dataframe
#print(adult_df_rev)
For Gaussian Naive Bayes, we need to convert all the data values in one format. We are going to encode all the labels with a value between 0 and n_classes-1.
We are going to use LabelEncoder from the 'scikit-learn' library to implement encoding.
One-Hot encoders, encode the data into a binary format.
First thing we need to do is initialize the label encoder.
Then we need to use fit_transform() to assign each features data to a new label.
# initialize label encoder
le = preprocessing.LabelEncoder()
# create and fit new column labels with existing data
# that fits in with the code in the cell below
# example: label_cat = le.fit_transform(dataframe.column_label)
After we have successfully created new labels for our new columns.
It is time to add the new columns into the copied dataframe.
# initialize the encoded categorical columns
# by assigning the newly created columns to the copied dataframe
adult_df_rev['workclass_cat'] = workclass_cat
adult_df_rev['education_cat'] = education_cat
adult_df_rev['marital_cat'] = marital_cat
adult_df_rev['occupation_cat'] = occupation_cat
adult_df_rev['relationship_cat'] = relationship_cat
adult_df_rev['race_cat'] = race_cat
adult_df_rev['sex_cat'] = sex_cat
adult_df_rev['native_country_cat'] = native_country_cat
Now that we have the updated data in the copied dataframe.
We need to drop the old features so they do not conflict.
# determine the features you want to drop by label
dummy_fields = ['workclass', 'education', 'marital_status',
'occupation', 'relationship', 'race',
'sex', 'native_country']
# drop the old categorical columns from the copied dataframe
Using the above code snippets, we have created multiple categorical columns like “marital_cat”, “race_cat” etc. You can see the top of the dataframe using adult_df_rev.head()
By printing adult_df_rev. You will be able to see that all the columns should be reindexed.
However, they are not in the proper order.
For reindexing the columns, you can use the code snippet provided below:
adult_df_rev = adult_df_rev.reindex_axis(['age', 'workclass_cat', 'fnlwgt', 'education_cat',
'education_num', 'marital_cat', 'occupation_cat',
'relationship_cat', 'race_cat', 'sex_cat', 'capital_gain',
'capital_loss', 'hours_per_week', 'native_country_cat',
'income'], axis= 1)
adult_df_rev.head()
The output to the above code snippet will show you that all the columns are reindexed properly. I have passed the list of column names as a parameter and axis=1 for reindexing the columns.
You should also notice that the columns we did not drop from the original dataframe are still present.
As well, every value is now an integer aside from the target value.
Now that all of the data values in our copied dataframe are numeric. We need to convert them onto a single scale, in other words, we need to standardize the values.
We can use the below formula for standardization:
# Standardization of Data
# outline column headers/feature names
num_features = ['age', 'workclass_cat', 'fnlwgt', 'education_cat', 'education_num',
'marital_cat', 'occupation_cat', 'relationship_cat', 'race_cat',
'sex_cat', 'capital_gain', 'capital_loss', 'hours_per_week',
'native_country_cat']
# create variable for extraction into
scaled_features = {}
# for each feature/label/column
for each in num_features:
# define and assign the mean and standard deviation
mean, std = adult_df_rev[each].mean(), adult_df_rev[each].std()
# assign the mean and std to the extraction variable for viewing
scaled_features[each] = [mean, std]
# return the current features new value using the
# current features mean, std and the standardization formula displayed above.
adult_df_rev.loc[:, each] = (adult_df_rev[each] - mean)/std
We have converted our data values into standardized values. You can print and check the output of the adult_df_rev dataframe to verify.
Let’s split the dataset into training and test set. We can easily perform this step using sklearn’s train_test_split() method.
# select and assign features
features = adult_df_rev.values[:,:14]
# select and assign data
target = adult_df_rev.values[:,14]
# use train_test_split to split data into training and testing sets
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size = 0.33, random_state = 10)
Using above code snippet, we have divided the data into features and target set. The feature set consists of 14 columns i.e, predictor variables and target set consists of 1 column with class values.
The features_train & target_train consists of training data and the features_test & target_test consists of testing data.
After completing the data preprocessing. it’s time to implement machine learning algorithm on it. We are going to use sklearn’s GaussianNB module.
# Initialize the model
clf = GaussianNB()
# fit the training data
clf.fit(features_train, target_train)
# create prediction classifier
target_pred = clf.predict(features_test)
We have built a GaussianNB classifier. The classifier is trained using training data. We can use fit() method for training it. After building a classifier, our model is ready to make predictions. We can use predict() method with test set features as its parameters.
# run the prediction classifier against the test data
accuracy_score(target_test, target_pred, normalize = True)
Awesome! Our model is returning an accuracy of ~80%.
This is not bad with a simple implementation, and could easily be made more effective.
You can create random test datasets and test the model to get know how well the trained Gaussian Naive Bayes model is performing.