Step-by-step TDD in a data science task

If you are interested in a longer introduction click here

I took an example dataset from Kaggle, the House Prices dataet this is a sufficiently easy and fun data. Just right to pass the imaginary test of 'tutorial on TDD for analysis'.

Since it was a csv file, I started by reading the data with Pandas. The first thing I wanted to check is if there are NaN/NULL values.

In [1]:
%pylab inline
import pandas as pd

train = pd.read_csv('train.csv', index_col=['Id'])
train.head()
Populating the interactive namespace from numpy and matplotlib
Out[1]:
MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotConfig ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
Id
1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub Inside ... 0 NaN NaN NaN 0 2 2008 WD Normal 208500
2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub FR2 ... 0 NaN NaN NaN 0 5 2007 WD Normal 181500
3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub Inside ... 0 NaN NaN NaN 0 9 2008 WD Normal 223500
4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub Corner ... 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub FR2 ... 0 NaN NaN NaN 0 12 2008 WD Normal 250000

5 rows × 80 columns

In [2]:
sum_nulls = train.isnull().sum()
datatypes = train.dtypes

summary = pd.DataFrame(dict(SumOfNulls = sum_nulls, DataTypes = datatypes))
summary.sort_values(by='SumOfNulls', ascending=False)
Out[2]:
SumOfNulls DataTypes
PoolQC 1453 object
MiscFeature 1406 object
Alley 1369 object
Fence 1179 object
FireplaceQu 690 object
LotFrontage 259 float64
GarageYrBlt 81 float64
GarageCond 81 object
GarageType 81 object
GarageFinish 81 object
GarageQual 81 object
BsmtExposure 38 object
BsmtFinType2 38 object
BsmtCond 37 object
BsmtQual 37 object
BsmtFinType1 37 object
MasVnrArea 8 float64
MasVnrType 8 object
Electrical 1 object
MSSubClass 0 int64
Fireplaces 0 int64
Functional 0 object
KitchenQual 0 object
KitchenAbvGr 0 int64
BedroomAbvGr 0 int64
HalfBath 0 int64
FullBath 0 int64
BsmtHalfBath 0 int64
TotRmsAbvGrd 0 int64
GarageCars 0 int64
... ... ...
HouseStyle 0 object
BldgType 0 object
Condition2 0 object
Condition1 0 object
LandSlope 0 object
2ndFlrSF 0 int64
LotConfig 0 object
Utilities 0 object
LandContour 0 object
LotShape 0 object
Street 0 object
LotArea 0 int64
YearBuilt 0 int64
YearRemodAdd 0 int64
RoofStyle 0 object
RoofMatl 0 object
Exterior1st 0 object
Exterior2nd 0 object
ExterQual 0 object
ExterCond 0 object
Foundation 0 object
BsmtFinSF1 0 int64
BsmtFinSF2 0 int64
BsmtUnfSF 0 int64
TotalBsmtSF 0 int64
Heating 0 object
HeatingQC 0 object
MSZoning 0 object
1stFlrSF 0 int64
SalePrice 0 int64

80 rows × 2 columns

Yep, there are. Usually missing cells is inherent to data. If you don't find any in your data then it is probably time to start being suspicious. Nevertheless, it is good to start the test-driven part to come up with a test that checks if there are NaN values in the data before we go on analysing it. It's easy also, since missing data will not make sense for several model types it's an insanity test. Usually, I first write the test also quick and dirty in the notebook.

In [3]:
def test_no_nan_values(data):
    assert not data.isnull().any(axis=None)
    
test_no_nan_values(train)
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-3-17e4e883296c> in <module>()
      2     assert not data.isnull().any(axis=None)
      3 
----> 4 test_no_nan_values(train)

<ipython-input-3-17e4e883296c> in test_no_nan_values(data)
      1 def test_no_nan_values(data):
----> 2     assert not data.isnull().any(axis=None)
      3 
      4 test_no_nan_values(train)

AssertionError: 

Good, it failed. Also note that neither I have docstrings at this moment nor this is a proper unittest/pytest case. It's a test. And it fails. That was the purpose of the first stage, so let's move on to the next.

Missing values are tricky. Pandas is schemaless which makes prototyping easy but does not help in this case. E.g. missing values are filled with np.nan in otherwise string fields. Nevertheless, this is an intentional feature of pandas which makes easier to use some generic DataFrame functions. In my case I don't really want this to happen though. Let's see what we can do with NaN values:

  1. Add 'Unknown' as a separate string to replace NaNs so we can use these fields later - What should be the limit of NaN values where we apply this strategy, obviously you don't want to spend too much energy on columns which have 95% missing values in the first iteration. So 90? 75? This should be tested.. So probably this is not the easiest solution to pass the test.

  2. Predict the missing values. - This would take lot's of models and maybe different strategies for different fields. Again not what I call easy solution.

  3. Ommit columns with missing values. - Since I want to pass the test first with the least effort (and from the higher level perspective bring results the soonest possible) I will choose this.

I looked at the test set and because there even more NaN values were present I decided to take the union of columns with missing values in the two dataset. Also Since I want to do the same operations on the two sets I decided to quickly create a common class for them

In [4]:
class HousePriceData(object):
    def __init__(self, filename):
        self.X_cols = ['MSSubClass', 'LotArea', 'Street', 'LotShape', 'LandContour',
                       'LotConfig', 'LandSlope', 'Condition1', 'Condition2',
                       'BldgType', 'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt',
                       'YearRemodAdd', 'RoofStyle', 'RoofMatl', 'ExterQual', 'ExterCond',
                       'Foundation', 'Heating', 'HeatingQC', 'CentralAir', 
                       '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'FullBath',
                       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd',
                       'Fireplaces', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF',
                       'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal',
                       'MoSold', 'YrSold', 'SaleCondition']
        self.y_col = 'SalePrice'
        data = pd.read_csv(filename, index_col=['Id'])
        self.X = data[self.X_cols]
        if self.y_col in data.columns:
            self.y = np.log(data[self.y_col])
        else:
            self.y = None
            
train = HousePriceData('train.csv')
test = HousePriceData('test.csv')

test_no_nan_values(train.X), test_no_nan_values(train.y), test_no_nan_values(test.X)
Out[4]:
(None, None, None)

Good, test passed. Now it's time to refactor. To this aim I created two python files, one for my class and one for the insanity tests. I then transformed the small test above to fit a unit testcase and rechecked if after the transformation whether it is going to fail for the raw data and does not fail for my transformed data. I left with 44 features from the 80+ which I considered enough to go further.

So far I tested only insanity. It's right time to start sanity testing as well because it helps tremendously staying focused and keeping in mind the bigger picture. For sanity testing, I prefer using BDD frameworks for a while (e.g. pytest-bdd or behave ). The good thing in BDD is that it connects the testing to a human readable description of the testing steps written in Gherkin. This is useful since the tests are often actual requirements from the client and since it is easily readable for non-programmers it helps collecting external inputs to the solution. And as the developer I really want to keep my development process in sync with the client expectations.

The first sanity test that I'm writing is going to test if any solution that we produce is better then the arthmetic average of the house prices. For this the Gherkin description looks like this:

Feature: Sanity of the model

  Scenario: Better than the simple average
    Given the trained model
    When I use the arithmetic average of the outcome as a reference 
    And the rmse of the prediction of the arithmetic average on the validation data is the reference RMSE
    And the rmse of the prediction with the trained_model on the validation data is our RMSE
    Then we see that our RMSE is better than the reference RMSE

Of course there is a proper .py file attached to this nice description. And also this fine textual description is maybe not your first iteration of the actual sanity test, maybe it is something like this:

In [5]:
from sklearn.metrics import mean_squared_error
from sklearn.dummy import DummyRegressor

def rmse(y_true, y_pred):
    return np.sqrt(mean_squared_error(y_true, y_pred))

def test_better_than_average(model, X, y):
    reference_score = rmse(y, np.repeat(np.mean(y), len(y)))
    y_pred = model.predict(X)
    our_score =  rmse(y, y_pred)
    assert our_score < reference_score
    
stupid_model = DummyRegressor(strategy='constant', constant=1) # so I'm using no matter what the price of the house is 1.
stupid_model = stupid_model.fit(train.X, train.y) # this model is so stupid that despite fitting it remains dumb

test_better_than_average(stupid_model, train.X, train.y)    
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-5-21b251f96bed> in <module>()
     14 stupid_model = stupid_model.fit(train.X, train.y) # this model is so stupid that despite fitting it remains dumb
     15 
---> 16 test_better_than_average(stupid_model, train.X, train.y)

<ipython-input-5-21b251f96bed> in test_better_than_average(model, X, y)
      7 def test_better_than_average(model, X, y):
      8     reference_score = rmse(y, np.repeat(np.mean(y), len(y)))
----> 9     y_pred = model.predict(X)
     10     our_score =  rmse(y, y_pred)
     11     assert our_score < reference_score

/opt/conda/lib/python3.6/site-packages/sklearn/dummy.py in predict(self, X)
    466         """
    467         check_is_fitted(self, "constant_")
--> 468         X = check_array(X, accept_sparse=['csr', 'csc', 'coo'])
    469         n_samples = X.shape[0]
    470 

/opt/conda/lib/python3.6/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    400         # make sure we actually converted to numeric:
    401         if dtype_numeric and array.dtype.kind == "O":
--> 402             array = array.astype(np.float64)
    403         if not allow_nd and array.ndim >= 3:
    404             raise ValueError("Found array with dim %d. %s expected <= 2."

ValueError: could not convert string to float: 'Pave'

Hm. Expected my test fail, but not this way. Actually, this means that my data contains non numeric values. Although some models handle this, let's stick to numeric values for now. So I convert data to numeric the following way by vectorizing all text columns. But this is not so easy like calling pd.get_dummies() and bamm. The problem is that it could be that some categories (note, vectorizing only makes sense if we are talking about categorical variables) only present in the train or in the test data. So if you simply call pd.get_dummies() you get different shapes of data.

Now, it's clear that I want to test that there are only numeric values in my data. Therefore I added a test for that. Now to get to that you have multiple solutions:

  • Use a database with schema - that's the good and proper way to do it but it also takes some time to set up the schema properly for this 89 columns, so if there is another way then no.
  • Use some dictionary in Python and iterate over column and set up the categorical values one by one. - Nightmare to maintain such thing. Also it means you are replicating function which has been written by several people in several ways.
  • Use one of the already existing tools to get to this. - I actually opted for this solution and pip install tdda a nice package that check the values and validates the files. Now, the str --> categorical conversion is not built-in but the created JSON file contains the list of available values. So I added just a simple function to use that for this purpose
In [6]:
from data_analysis import HousePriceData
from pandas.api.types import is_numeric_dtype

def test_if_all_numeric(data):
    assert all(is_numeric_dtype(data.values))

constraints_filename = 'house_prices_constraints_mod.tdda'
train = HousePriceData('train.csv', constraints=constraints_filename)
test = HousePriceData('test.csv', constraints=constraints_filename)

test_if_all_numeric(train.X), test_if_all_numeric(train.y), test_if_all_numeric(test.X)
/home/jovyan/work/Documents/TDD/tdd_data_analysis/data_analysis.py:16: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  X['has' + col] = (X[col] != 0).astype(int)
Out[6]:
(None, None, None)

Now we are ready for real to fail our first sanity test. Let's see:

In [7]:
test_better_than_average(stupid_model, train.X, train.y)    
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-7-a7994ac2c8ca> in <module>()
----> 1 test_better_than_average(stupid_model, train.X, train.y)

<ipython-input-5-21b251f96bed> in test_better_than_average(model, X, y)
      9     y_pred = model.predict(X)
     10     our_score =  rmse(y, y_pred)
---> 11     assert our_score < reference_score
     12 
     13 stupid_model = DummyRegressor(strategy='constant', constant=1) # so I'm using no matter what the price of the house is 1.

AssertionError: 

Good, failed. Now, let's pass the test. Tempting to get the XGBoost and task done, but not so fast. Although an XGBoost model sure passes the test, but so does many other model e.g. a Linear Regression. Not that fancy but a lot simpler which in this case means that is just enough to pass the test. Remember, when you pass a test do not only think about the code that you yourself has to write but also take into account the overall complexity of your solution. So let's see if the Linear Regression passes the test

In [8]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model = model.fit(train.X, train.y)

test_better_than_average(model, train.X, train.y)

Without problems. In the refactoring step I consolidated the HousePricaData class and set up the feature file with the Gherkin program and added the necessary pytest files (I used pytest-bdd for this project). Now the next step is to have something better than Linear Regression. But, actually you don't want 'just' better, probably you want significantly better. Otherwise, how would you explain the client to but a model which predicts based on thousands of weights instand of less than a hundred? So when I'm failing my second sanity test that is going to be because I want at least less than 75% of the Linear Regression from any complex method.

For these steps I also need a proper train and validation split, so I've splitted the data and saved the bigger part as train_short and the smaller part as validation

In [9]:
from sklearn.ensemble import ExtraTreesRegressor

def test_better_than_linear_regression(model, X_train, y_train, X_validation, y_validation, percent=75):
    lm = LinearRegression().fit(X_train, y_train)
    reference_score = rmse(y_validation, lm.predict(X_validation)) * percent /100
    y_pred = model.predict(X_validation)
    our_score =  rmse(y_validation, y_pred)
    assert our_score < reference_score
    
train = HousePriceData('train_short.csv', constraints=constraints_filename)
validation = HousePriceData('validation.csv', constraints=constraints_filename)
    
model = ExtraTreesRegressor()
model = model.fit(train.X, train.y)

test_better_than_linear_regression(model, train.X, train.y, validation.X, validation.y, percent=75)
/opt/conda/lib/python3.6/site-packages/sklearn/ensemble/weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release.
  from numpy.core.umath_tests import inner1d
/home/jovyan/work/Documents/TDD/tdd_data_analysis/data_analysis.py:16: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  X['has' + col] = (X[col] != 0).astype(int)
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-9-6814b04c685d> in <module>()
     14 model = model.fit(train.X, train.y)
     15 
---> 16 test_better_than_linear_regression(model, train.X, train.y, validation.X, validation.y, percent=75)

<ipython-input-9-6814b04c685d> in test_better_than_linear_regression(model, X_train, y_train, X_validation, y_validation, percent)
      6     y_pred = model.predict(X_validation)
      7     our_score =  rmse(y_validation, y_pred)
----> 8     assert our_score < reference_score
      9 
     10 train = HousePriceData('train_short.csv', constraints=constraints_filename)

AssertionError: 

Of couse it failed, actually if you look at the results you can see that the Linear Regression explains already 70% of the variance which is pretty good. After some hyperparameter tuning I found an XGBoost model that passes this test.

In [ ]:
from xgboost import XGBRegressor

model = XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=4, min_child_weight=1, missing=None, n_estimators=400,
       n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=42,
       silent=True, subsample=0.7)

model = model.fit(train.X, train.y)

test_better_than_linear_regression(model, train.X, train.y, validation.X, validation.y, percent=75)

The next step was also a refactoring: I took the 'test against model' logic and organized the to the same functions to make the whole testing logic more robust.

So here I was with a working model and was able to answer already if it's better than an average prediction or a simple linear regression. Also I had a model very early on, by starting off without handling NULL values in sophisticated ways I had a model in just ~2 hour after I downloaded the data, note taking included. I would have been able to stop at basically anytime after that and had a working model. Also I had a list of steps that I planned to include.

I uploaded the code to github which includes the state after refactoring. In the repository you can also see that I went on adding two more sanity tests and implemented solutions to pass them. In the first test, the passing criteria was at least an RMSE that would have been enough to land a submission in the middle of the Kaggle leaderboard (~0.13 RMSE). This was easy to fail and hard to pass :). I tried combining models, search hyperparameters, data transformations. In the end, I had to go back to the missing value problem and add the fields that had had only a small number of missing values to the data and handle the data imputation. In the next step, I tested whether I overfit the training data. Overfitting is a serious problem for a production model. Just imagine if you are overfitting on a test set that you split from the data that you had how much you would underperform on newly collected data... I stopped after this step, but I had to fight the urgue of refactoring, because now my tests ran for 8 seconds which is way to much for TDD. But this step is left for the next iteration.

In sum, I think it is quite clear at this point how useful to do TDD data analysis. Of course, at first you may feel that writing the test is just extra time and you of course have a model that is better than the average. I'm sure you are right, but sometimes all of us makes mistakes, tests help to reduce their effect. Also tests often live longer than the version of the code that you ended up delivering. Maybe later someone takes over but you want to make sure that his/her approach is not only different than yours but better, or at least as good. So it passes your insanity and sanity tests.

Also, TDD helps keeping focus. Data analysis, cleaning, imputing, modeling are all huge areas. Let's say you impute the missing values with zero, but then you think maybe you should have predicted those, then you think if you predict them then maybe you can even 'generate' extra data for training, then you start to think about how would you be able to test the effectiveness of such data generation... and you find yourself in the middle of the forest with absolutely no idea how you can get back to you original goal. Many effective individual I know follow similar practices, and what is more important, I see that working in this framework makes teamwork so much more effective and fun.

Try it and let me know how it works in your practice!

In [10]:
!pytest -vv --gherkin-terminal-reporter --gherkin-terminal-reporter-expanded -p no:warnings
============================= test session starts ==============================
platform linux -- Python 3.6.3, pytest-4.3.0, py-1.8.0, pluggy-0.9.0 -- /opt/conda/bin/python
cachedir: .pytest_cache
rootdir: /home/jovyan/work/Documents/TDD/tdd_data_analysis, inifile:
plugins: bdd-3.1.0
collected 6 items                                                              

tests/insanity_test.py::DataIntegrity::test_no_nan_values PASSED         [ 16%]
tests/insanity_test.py::DataIntegrity::test_whether_only_numeric_values PASSED [ 33%]
Feature: The model is making sense
    Scenario: Better than simple average
        Given the validation data and the trained_model
        When I claim that the simplest is to take the Average of the outcome
        And I take the train_short data
        And I get the Average of the outcome from the train_short data
        And the RMSE of the prediction of the Average of the outcome for the validation as the reference score
        And the RMSE of the prediction with the trained_model on the validation as our RMSE
        Then we see that our RMSE is lower than the reference score
    PASSED

Feature: The model is making sense
    Scenario: Better than linear_regression
        Given the validation data and the trained_model
        When I use the Linear Regression
        And I take the train_short data
        And I train the Linear Regression on the train_short data
        And the RMSE of the prediction with the Linear Regression on the validation as reference score
        And the RMSE of the prediction with trained_model on the validation as our RMSE
        And my target is less than 75% of the reference score
        Then we see that our RMSE is lower than the reference score
    PASSED

Feature: The model is making sense
    Scenario: We are providing good results
        Given the validation data and the trained_model
        When the RMSE of the prediction with trained_model on the validation as our RMSE
        And my reference score is 0.13 and I expect lower value from my model
        Then I see that our RMSE is indeed lower than the reference score
    PASSED

Feature: The model is making sense
    Scenario: We are not overfitting the training data
        Given the validation data and the trained_model
        When I take the train_short data
        And the RMSE of the prediction with trained_model on the validation as our RMSE
        And the RMSE of the prediction with trained_model on the train_short as the reference score
        And my target is max 150% of the reference score
        Then I see that our RMSE is under this reference score limit
    PASSED


=========================== 6 passed in 8.23 seconds ===========================
<< FIN >>
In [ ]: