#!/usr/bin/env python # coding: utf-8 # ## Step-by-step TDD in a data science task # # If you are interested in a longer introduction [click here]() # # I took an example dataset from Kaggle, the [House Prices dataet](https://www.kaggle.com/c/house-prices-advanced-regression-techniques) this is a sufficiently easy and fun data. Just right to pass the imaginary test of 'tutorial on TDD for analysis'. # # Since it was a csv file, I started by reading the data with Pandas. The first thing I wanted to check is if there are NaN/NULL values. # In[1]: get_ipython().run_line_magic('pylab', 'inline') import pandas as pd train = pd.read_csv('train.csv', index_col=['Id']) train.head() # In[2]: sum_nulls = train.isnull().sum() datatypes = train.dtypes summary = pd.DataFrame(dict(SumOfNulls = sum_nulls, DataTypes = datatypes)) summary.sort_values(by='SumOfNulls', ascending=False) # Yep, there are. Usually missing cells is inherent to data. If you don't find any in your data then it is probably time to start being suspicious. Nevertheless, it is good to start the test-driven part to come up with a test that checks if there are NaN values in the data before we go on analysing it. It's easy also, since missing data will not make sense for several model types it's an insanity test. Usually, I first write the test also quick and dirty in the notebook. # In[3]: def test_no_nan_values(data): assert not data.isnull().any(axis=None) test_no_nan_values(train) # Good, it failed. Also note that neither I have docstrings at this moment nor this is a proper unittest/pytest case. It's a test. And it fails. That was the purpose of the first stage, so let's move on to the next. # # Missing values are tricky. Pandas is schemaless which makes prototyping easy but does not help in this case. E.g. missing values are filled with np.nan in otherwise string fields. Nevertheless, this is an intentional feature of pandas which makes easier to use some generic DataFrame functions. In my case I don't really want this to happen though. Let's see what we can do with NaN values: # # 1. Add 'Unknown' as a separate string to replace NaNs so we can use these fields later - What should be the limit of NaN values where we apply this strategy, obviously you don't want to spend too much energy on columns which have 95% missing values in the first iteration. So 90? 75? This should be tested.. So probably this is not the easiest solution to pass the test. # # 2. Predict the missing values. - This would take lot's of models and maybe different strategies for different fields. Again not what I call easy solution. # # 3. Ommit columns with missing values. - Since I want to pass the test first with the least effort (and from the higher level perspective bring results the soonest possible) I will choose this. # # I looked at the test set and because there even more NaN values were present I decided to take the union of columns with missing values in the two dataset. Also Since I want to do the same operations on the two sets I decided to quickly create a common class for them # In[4]: class HousePriceData(object): def __init__(self, filename): self.X_cols = ['MSSubClass', 'LotArea', 'Street', 'LotShape', 'LandContour', 'LotConfig', 'LandSlope', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'RoofStyle', 'RoofMatl', 'ExterQual', 'ExterCond', 'Foundation', 'Heating', 'HeatingQC', 'CentralAir', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal', 'MoSold', 'YrSold', 'SaleCondition'] self.y_col = 'SalePrice' data = pd.read_csv(filename, index_col=['Id']) self.X = data[self.X_cols] if self.y_col in data.columns: self.y = np.log(data[self.y_col]) else: self.y = None train = HousePriceData('train.csv') test = HousePriceData('test.csv') test_no_nan_values(train.X), test_no_nan_values(train.y), test_no_nan_values(test.X) # Good, test passed. Now it's time to refactor. To this aim I created two python files, one for my class and one for the insanity tests. I then transformed the small test above to fit a unit testcase and rechecked if after the transformation whether it is going to fail for the raw data and does not fail for my transformed data. I left with 44 features from the 80+ which I considered enough to go further. # # So far I tested only insanity. It's right time to start sanity testing as well because it helps tremendously staying focused and keeping in mind the bigger picture. For sanity testing, I prefer using BDD frameworks for a while (e.g. [pytest-bdd](https://github.com/pytest-dev/pytest-bdd) or [behave](https://behave.readthedocs.io/en/latest/) ). The good thing in BDD is that it connects the testing to a human readable description of the testing steps written in Gherkin. This is useful since the tests are often actual requirements from the client and since it is easily readable for non-programmers it helps collecting external inputs to the solution. And as the developer I really want to keep my development process in sync with the client expectations. # # The first sanity test that I'm writing is going to test if any solution that we produce is better then the arthmetic average of the house prices. For this the Gherkin description looks like this: # ```gherkin # Feature: Sanity of the model # # Scenario: Better than the simple average # Given the trained model # When I use the arithmetic average of the outcome as a reference # And the rmse of the prediction of the arithmetic average on the validation data is the reference RMSE # And the rmse of the prediction with the trained_model on the validation data is our RMSE # Then we see that our RMSE is better than the reference RMSE # # ``` # Of course there is a proper .py file attached to this nice description. And also this fine textual description is maybe not your first iteration of the actual sanity test, maybe it is something like this: # In[5]: from sklearn.metrics import mean_squared_error from sklearn.dummy import DummyRegressor def rmse(y_true, y_pred): return np.sqrt(mean_squared_error(y_true, y_pred)) def test_better_than_average(model, X, y): reference_score = rmse(y, np.repeat(np.mean(y), len(y))) y_pred = model.predict(X) our_score = rmse(y, y_pred) assert our_score < reference_score stupid_model = DummyRegressor(strategy='constant', constant=1) # so I'm using no matter what the price of the house is 1. stupid_model = stupid_model.fit(train.X, train.y) # this model is so stupid that despite fitting it remains dumb test_better_than_average(stupid_model, train.X, train.y) # Hm. Expected my test fail, but not this way. Actually, this means that my data contains non numeric values. Although some models handle this, let's stick to numeric values for now. So I convert data to numeric the following way by vectorizing all text columns. But this is not so easy like calling `pd.get_dummies()` and bamm. The problem is that it could be that some categories (note, vectorizing only makes sense if we are talking about categorical variables) only present in the train or in the test data. So if you simply call `pd.get_dummies()` you get different shapes of data. # # Now, it's clear that I want to test that there are only numeric values in my data. Therefore I added a test for that. Now to get to that you have multiple solutions: # - Use a database with schema - that's the good and proper way to do it but it also takes some time to set up the schema properly for this 89 columns, so if there is another way then no. # - Use some dictionary in Python and iterate over column and set up the categorical values one by one. - Nightmare to maintain such thing. Also it means you are replicating function which has been written by several people in several ways. # - Use one of the already existing tools to get to this. - I actually opted for this solution and `pip install tdda` a nice package that check the values and validates the files. Now, the str --> categorical conversion is not built-in but the created JSON file contains the list of available values. So I added just a simple function to use that for this purpose # In[6]: from data_analysis import HousePriceData from pandas.api.types import is_numeric_dtype def test_if_all_numeric(data): assert all(is_numeric_dtype(data.values)) constraints_filename = 'house_prices_constraints_mod.tdda' train = HousePriceData('train.csv', constraints=constraints_filename) test = HousePriceData('test.csv', constraints=constraints_filename) test_if_all_numeric(train.X), test_if_all_numeric(train.y), test_if_all_numeric(test.X) # Now we are ready for real to fail our first sanity test. Let's see: # In[7]: test_better_than_average(stupid_model, train.X, train.y) # Good, failed. Now, let's pass the test. Tempting to get the XGBoost and task done, but not so fast. Although an XGBoost model sure passes the test, but so does many other model e.g. a Linear Regression. Not that fancy but a lot simpler which in this case means that is just enough to pass the test. Remember, when you pass a test do not only think about the code that you yourself has to write but also take into account the overall complexity of your solution. So let's see if the Linear Regression passes the test # In[8]: from sklearn.linear_model import LinearRegression model = LinearRegression() model = model.fit(train.X, train.y) test_better_than_average(model, train.X, train.y) # Without problems. In the refactoring step I consolidated the HousePricaData class and set up the feature file with the Gherkin program and added the necessary pytest files (I used pytest-bdd for this project). Now the next step is to have something better than Linear Regression. But, actually you don't want 'just' better, probably you want significantly better. Otherwise, how would you explain the client to but a model which predicts based on thousands of weights instand of less than a hundred? So when I'm failing my second sanity test that is going to be because I want at least less than 75% of the Linear Regression from any complex method. # # For these steps I also need a proper train and validation split, so I've splitted the data and saved the bigger part as train_short and the smaller part as validation # In[9]: from sklearn.ensemble import ExtraTreesRegressor def test_better_than_linear_regression(model, X_train, y_train, X_validation, y_validation, percent=75): lm = LinearRegression().fit(X_train, y_train) reference_score = rmse(y_validation, lm.predict(X_validation)) * percent /100 y_pred = model.predict(X_validation) our_score = rmse(y_validation, y_pred) assert our_score < reference_score train = HousePriceData('train_short.csv', constraints=constraints_filename) validation = HousePriceData('validation.csv', constraints=constraints_filename) model = ExtraTreesRegressor() model = model.fit(train.X, train.y) test_better_than_linear_regression(model, train.X, train.y, validation.X, validation.y, percent=75) # Of couse it failed, actually if you look at the results you can see that the Linear Regression explains already 70% of the variance which is pretty good. After some hyperparameter tuning I found an XGBoost model that passes this test. # In[ ]: from xgboost import XGBRegressor model = XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=4, min_child_weight=1, missing=None, n_estimators=400, n_jobs=1, nthread=None, objective='reg:linear', random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=42, silent=True, subsample=0.7) model = model.fit(train.X, train.y) test_better_than_linear_regression(model, train.X, train.y, validation.X, validation.y, percent=75) # The next step was also a refactoring: I took the 'test against model' logic and organized the to the same functions to make the whole testing logic more robust. # # So here I was with a working model and was able to answer already if it's better than an average prediction or a simple linear regression. Also I had a model very early on, by starting off without handling NULL values in sophisticated ways I had a model in just ~2 hour after I downloaded the data, note taking included. I would have been able to stop at basically anytime after that and had a working model. Also I had a list of steps that I planned to include. # # I uploaded the code to github which includes the state after refactoring. In the repository you can also see that I went on adding two more sanity tests and implemented solutions to pass them. In the first test, the passing criteria was at least an RMSE that would have been enough to land a submission in the middle of the Kaggle leaderboard (~0.13 RMSE). This was easy to fail and hard to pass :). I tried combining models, search hyperparameters, data transformations. In the end, I had to go back to the missing value problem and add the fields that had had only a small number of missing values to the data and handle the data imputation. In the next step, I tested whether I overfit the training data. Overfitting is a serious problem for a production model. Just imagine if you are overfitting on a test set that you split from the data that you had how much you would underperform on newly collected data... I stopped after this step, but I had to fight the urgue of refactoring, because now my tests ran for 8 seconds which is way to much for TDD. But this step is left for the next iteration. # # In sum, I think it is quite clear at this point how useful to do TDD data analysis. Of course, at first you may feel that writing the test is just extra time and you of course have a model that is better than the average. I'm sure you are right, but sometimes all of us makes mistakes, tests help to reduce their effect. Also tests often live longer than the version of the code that you ended up delivering. Maybe later someone takes over but you want to make sure that his/her approach is not only different than yours but better, or at least as good. So it passes your insanity and sanity tests. # # Also, TDD helps keeping focus. Data analysis, cleaning, imputing, modeling are all huge areas. Let's say you impute the missing values with zero, but then you think maybe you should have predicted those, then you think if you predict them then maybe you can even 'generate' extra data for training, then you start to think about how would you be able to test the effectiveness of such data generation... and you find yourself in the middle of the forest with absolutely no idea how you can get back to you original goal. Many effective individual I know follow similar practices, and what is more important, I see that working in this framework makes teamwork so much more effective and fun. # # Try it and let me know how it works in your practice! # In[10]: get_ipython().system('pytest -vv --gherkin-terminal-reporter --gherkin-terminal-reporter-expanded -p no:warnings') # ``` # << FIN >> # ``` # In[ ]: