Using income data of the adult US population, we want to classify who earns more or less than 50k USD
General imports needed
import time
import pandas as pd
import logging as logging
import pprint
import plotly
plotly.offline.init_notebook_mode(connected=True)
Pailab specific imports
from pailab import MLRepo, MeasureConfiguration, FIRST_VERSION, LAST_VERSION
Basic configuration
logging.basicConfig(level=logging.FATAL)
In this notebook we use the adult data from the census 1994 it can be found in UCI Machine Learning Repository or from Kaggle. The data set consists of 48842 observations each with 14 attributes. The table consists of qualitative (e.g. workclass, education) and quantitative attributes.
data = pd.read_csv('adult.csv')
Some basic adjustments, which are not part of the preprocessing
# change the output format
data['income']=data['income'].map({'<=50K': 0, '>50K': 1, '<=50K.': 0, '>50K.': 1})
convert categorical features into dummy variable
categorical_columns = ['workclass', 'education', 'marital.status', 'occupation', 'relationship', 'race', 'sex', 'native.country']
data = pd.get_dummies(data, columns=categorical_columns)
We first create a new repository for our task. The repository is the central key around all functionality is built. Similar to a repository used for source control in classical software development it contains all data and algorithms needed for the machine learning task. The repository needs storages for
To keep things simple, we just start using in memory storages. Note that the used memory interfaces are except for testing and playing around not be the first choice, since when ending the session, everything will be lost...
In addition to the storages the repository needs a reference to a JobRunner which the platform can use to execute machine learning jobs. For this example we use the most simple one, executing everything sequential in the same thread, the repository runs in.
# setting up the repository
config = None
# set this to True if you would like to test the git_handler together with a
# hdf handler for the big data parts instead of the memory handler
if False:
config = {'user': 'test_user',
'workspace': 'c:/temp',
'repo_store':
{
'type': 'git_handler',
'config': {
'folder': 'c:/temp',
'file_format': 'pck'
}
},
'numpy_store':
{
'type': 'hdf_handler',
'config':{
'folder': 'c:/temp/hdf',
'version_files': True
}
}
}
ml_repo = MLRepo( user = 'test_user', config=config)
job_runner = ml_repo._job_runner
To navigate in a simple way over all objects, one can add a so-called tree to the repository. The tree allows one to use auto completion to acces objcts and respectiv methods.
from pailab.tools.tree import MLTree
MLTree.add_tree(ml_repo)
The data in the repository is handled by two different data objects:
Normally one will add RawData and then define DataSets which are used to train or test a model which is exactly the way shown in the following. The target variable is 'income'. By default all other variables are used for as input_variables.
ml_repo.tree.raw_data.add('adult_census_data', data, target_variables=['income'])
# Add RawData. A convenient way to add RawData is simply to use the method add on the raw_data collection.
# This method just takes a pandas dataframe and the specification, which columns belong to the input
#and which to the targets.
ml_repo.tree.raw_data.add('adult_census_data', data, target_variables = ['income'])
# based on the raw data we now define training and test sets
ml_repo.tree.training_data.add('training_sample', ml_repo.tree.raw_data.adult_census_data(), 0, 1000)
ml_repo.tree.test_data.add('test_sample', ml_repo.tree.raw_data.adult_census_data(), 1001, None)
ml_repo.tree.training_data.training_sample.history()
When creating the DataSet we have to set two important informations for the repository, given as a dictionary:
Some may wonder what is now stored in version_list. ** Adding an object (independent if it is a data object or some other object such as a parameter), the object gets a version number and no object will be removed, adding just adds a new version.** The add method returns a dictionary of the object names together with their version number.
The next step to do machine learning would be to define a model which will be used in the repository. A model consists of the following pieces
** SKLearn models as an example**
We do not have to define the pieces defined above, if we use the sklearn module. Instead we can use the pailab.externals.sklearn module interfacing the sklearn package so that this can be used within the repository. This interface provides a simple method (add_model) to add an arbitrary sklearn model as a model which can be handled by the repository. This method adds a bunch of repo objects to the repository (according to the pieces described above):
For the following we just use a DecisionTree as our model.
# import pailab.externals.pandas_interface as pandas_interface
# categorical_columns = ['workclass', 'education', 'marital.status', 'occupation', 'relationship', 'race', 'sex', 'native.country']
# pandas_interface.add_preprocessor(ml_repo, pd.get_dummies, preprocessor_name='Pandas_Get_Dummies', preprocessor_param={columns=categorical_columns})
import pailab.externals.numpy_interface as numpy_interface
select_columns = ['age', 'fnlwgt', 'education.num', 'capital.gain', 'capital.loss', 'hours.per.week', 'workclass_?','workclass_Local-gov', 'workclass_Private', 'workclass_Self-emp-not-inc', 'education_HS-grad', 'education_Some-college', 'education_Bachelors', 'education_Masters',
'marital.status_Divorced', 'marital.status_Married-civ-spouse', 'marital.status_Never-married', 'marital.status_Separated',
'occupation_Prof-specialty', 'occupation_Craft-repair', 'occupation_Exec-managerial', 'occupation_Adm-clerical', 'relationship_Husband', 'relationship_Not-in-family', 'relationship_Own-child', 'relationship_Unmarried', 'race_Asian-Pac-Islander', 'race_Black', 'race_White', 'sex_Female',
'native.country_United-States', 'native.country_Mexico', 'native.country_?', 'native.country_Philippines']
numpy_interface.add_preprocessor_select_columns(ml_repo, preprocessor_name='Numpy_Select_Columns',
preprocessor_param={'columns':select_columns})
import pailab.externals.sklearn_interface as sklearn_interface
from sklearn.preprocessing import StandardScaler
sklearn_interface.add_preprocessor(ml_repo, StandardScaler(), preprocessor_name='SKLStandardScaler')
from sklearn.linear_model import LogisticRegression
sklearn_interface.add_model(ml_repo, LogisticRegression(), preprocessors=['Numpy_Select_Columns', 'SKLStandardScaler'], model_param={'max_iter': 1000})
Now, model taining is very simple, since you have defined training and testing data as well as methods to value and fit your model and the model parameter. So, you can just call run_training on the repository, and the training is perfomred automatically. The training job is executed via the JobRunner you specified setting up the repository. All method of the repository involving jobs return the job id when adding the job to the JobRunner so that you can control the status of the task and see if it sucessfully finished.
job_id = ml_repo.run_training()
print(job_id)
job_info = job_runner.get_info(job_id[0], job_id[1])
To measure errors and to provide plots the model must be evaluated on all test and training datasets.
ml_repo.tree.models.LogisticRegression.model.load()
ml_repo.tree.models.LogisticRegression.model.obj.to_dict()
job_id = ml_repo.run_evaluation()
# print information about the job
info = job_runner.get_info(job_id[0][0], job_id[0][1])
print(str(info))
ml_repo.add_measure(MeasureConfiguration.MAX)
ml_repo.add_measure(MeasureConfiguration.R2)
job_ids = ml_repo.run_measures()
ml_repo.tree.models.LogisticRegression.measures.test_sample.r2.load()
print(str(ml_repo.tree.models.LogisticRegression.measures.test_sample.r2.obj.value))
Introducing a new version of the preprocessor to the repository.
sklearn_interface.add_preprocessor(
ml_repo, StandardScaler(), preprocessor_name='SKLStandardScaler', preprocessor_param={'with_mean': False})
After we have changed the model parameter, we use the checker submodules run method for open tasks/inconsistencies. This method is called for the repository. If nothing is specified, all labeled models will be checked. Applying the method to the latest model we see that the output shows that the models last version has been calibrated using a different model parameter version then the current version.
import pailab.tools.checker as checker
results = checker.run(ml_repo)
pprint.pprint(results)
job_id = ml_repo.run_training(run_descendants=True)# we use run_descendants so that the issues with th measures are resolved too
print(job_id)
results = checker.run(ml_repo)
pprint.pprint(results)