A Better Bank Statement

The information in a regular bank statement is presented in a very unhelpful way - simply sorted by date, with no analysis of the items to assist you to understand them.

This notebook is a quick and dirty way to process a bank statement downloaded as a .csv file and extract useful information from it so that you can:

  • scan for fraudulent items
  • find mistakes such as doubled payments
  • find incorrect amounts
  • scan for missing items such as automatic bill payments

It does this by:

  • separating credits from debits
  • grouping into "new", "recurring" and "varying" items
  • categorising items into bills, POS, PayPal etc.
  • graphically displaying varying items to allow them to be checked quickly

[email protected] / http://tompaton.com/


An iPython notebook is a good way to perform this task as all of the code is processing the data and there isn't any need to building a UI or add configuration options.

To install iPython:

$ sudo apt-get install ipython-notebook python-matplotlib python-scipy python-pandas python-sympy python-nose

Run using:

$ ipython notebook --pylab inline


First, the python environment needs to be configured:

In [25]:
import re
import matplotlib.pyplot as plt
import matplotlib.ticker as pltticker
import numpy as np
import pandas as pd
from datetime import date, timedelta

def compose(*funcs):
    return lambda x: reduce(lambda acc, f: f(acc), funcs, x)

pd.options.display.max_colwidth = 0
pd.options.display.show_dimensions = False
In [26]:
def line_break(s, n=40):  # line break every n chars
    return re.sub('(.{%d})' % n, '\\1\n', s)                                

This function will be used to plot the histograms of varying items. I'm not familiar with matplotlib so there are probably better ways to do this.

In [27]:
def expense_item_hist(row):
    amounts = pd.Series(row['amounts']).map(abs)
    # histogram of amounts
    counts, bins, patches = plt.hist(amounts, color=['lightgray'])
    # colour current amount bar red
    for patch, rightside, leftside in zip(patches, bins[1:], bins[:-1]):
        if leftside <= abs(row['Amount']) <= rightside:
    # add probability density estimate
    amounts.plot(kind="kde", secondary_y=True, style=':')
    # desc on top row
    # date - amount - catg on second row
    plt.title('%s\n%s' % (line_break(row['desc']),
                          row['Entered Date'].date()),
    plt.title('$%.2f' % abs(row['Amount']), 
    plt.title(row['catg'], loc='right')
def expense_histograms(table, ncols=3, width=16, height=3, hspace=1):
    nrows = ceil(len(table.index) / float(ncols))
    plt.figure(figsize=(width, nrows*height))
    # TODO: axis scale in $10, $100, $1000, $10000 increments
    #loc = pltticker.MaxNLocator()
    for i, (idx, row) in enumerate(table.iterrows()):
        ax = plt.subplot(nrows, ncols, i+1)

Import data

You'll need to supply a .csv file downloaded from your bank and edit the column names referenced in the code to match.

TODO: you'll probably get a new .csv file each time you check your statement, all of which will need to be loaded in and any duplicates where the files overlap will need to be removed.

In [41]:
statement_csv = "data/eg-statement.csv"
!head {statement_csv}

In [42]:
statement = pd.read_csv(statement_csv, parse_dates=[0, 1], dayfirst=True)
statement['Entered Date'].describe()
count     257
unique    171
first     2013-04-01 00:00:00
last      2014-05-08 00:00:00
top       2013-09-03 00:00:00
freq      5
Name: Entered Date, dtype: object

Process data

The processing basically strips reference and receipt numbers from the transaction descriptions so that items to the same payee can be matched together.

At the bottom of the notebook, any left over descriptions will be listed so you can see if additional clean up substitutions are required.

In [43]:
#statement['Transaction Description'].value_counts()
In [44]:
def re_sub(regex, repl): return lambda desc: re.sub(regex, repl, desc)

clean_up = [lambda desc: desc.lower(),
            re_sub('\s+', ' '),
            re_sub('atm#\d+', '<atm>'),
            re_sub('atm owner fee \d+', '<atm owner fee>'),
            re_sub('pos( including cash out)?\s*#\d+', '<pos>'),
            re_sub('receipt no \d+', '<receipt no>'),
            re_sub('crn#\d+', '<crn>'),
            re_sub('\(?ref\.?\d+\)?', '<ref>'),
            re_sub('(au|gb|us)#\d+', '<card>')]
In [45]:
statement['desc'] = statement['Transaction Description'].map(compose(*clean_up))

It also helps if similar items are grouped together, so you'll want to add your own bills etc to this list:

In [46]:
def categorise(desc):
    if '<atm>' in desc: return 'atm'
    if '<pos>' in desc: return 'pos'
    if 'paypal' in desc: return 'paypal'
    if 'bpay' in desc: return 'bpay'
    if '<atm owner fee>' in desc: return 'fee'
    if 'transaction fee' in desc: return 'fee'
    if 'visa access card fee' in desc: return 'fee'
    if 'member net transfer' in desc: return 'transfer'
    if 'visa-internode' in desc: return 'bill'
    if 'visa-linode.com' in desc: return 'bill'
    if 'visa-citylink' in desc: return 'bill'
    if 'visa-virgin mobile' in desc: return 'bill'
    return ''
In [47]:
statement['catg'] = statement['desc'].map(categorise)

Prepare for output

This code finds recurring items by grouping matching descriptions and then splits the statement into debits, credits etc.

In [48]:
stats = statement.groupby('desc')['Amount'].agg({#'total' : np.sum, 'average' : np.mean,
                                                 #'min': np.min, 'max': np.max,
                                                 'count': np.count_nonzero,
                                                 'amounts': lambda amounts: list(amounts)})
In [49]:
last_date = statement['Entered Date'].describe()['last']
start_date = last_date - timedelta(days=15)
table = statement.join(stats, on='desc')[statement['Entered Date'].gt(start_date)]
cols = ['Entered Date', 'catg', 'desc', 'Amount']
table.sort(['catg'], inplace=True)
In [50]:
new = table['count'].eq(1)  # only one instance of item description seen
not_new = new.eq(False)
debit = table['Amount'].lt(0)
credit = table['Amount'].gt(0)
In [51]:
# only consider it recurring if we've seen it more than twice,
# otherwise it could be a doubled-up transaction we want to catch
table['recurring'] = [amounts.count(amount) > 1 and len(amounts) > 2
                      for amount, amounts in zip(table['Amount'], table['amounts'])]
In [52]:
recurring = table['recurring'].eq(True)
not_recurring = table['recurring'].eq(False) & not_new

TODO: It would be nice to be able to add notes on items to avoid wasting time investigating the weird ones more than once. This could be a simple dict of dict of Entered Date+Transaction Description --> string


This is the useful bit, and you have to ask yourself: why can't the bank do something like this?

New items

The first thing to check for are debits that haven't appeared before - separating them out makes it easy.

In [53]:
table[cols][debit & new]
Entered Date catg desc Amount
1 2014-05-06 visa-masters 7722 hawthorn east<card><ref> -70.78
4 2014-05-04 visa-act*trailsplus macedon surry hills <card><ref> -50.00
8 2014-05-02 visa-target melbourne city vic <card><ref> -29.00
14 2014-04-24 visa-australia post 328874 carnegie <card><ref> -15.60
9 2014-05-01 bpay internet bpay to yarra valley water - biller code 123456 - <receipt no> <crn> -114.68
13 2014-04-24 paypal visa-paypal *hl5pts 123456789 <card><ref> -199.30
11 2014-04-30 pos <pos>-jaycar electronics melbourne au -69.95

Varying items

These payees have been seen before but the specific amount hasn't. The charts let you see at a glance which are truly problematic and which are normal variations.

In [54]:
expense_histograms(table[cols+['amounts']][debit & not_recurring])

Recurring items

These items have all been seen before, mostly here so everything is included, but it may be worth checking that recurring payments haven't been missed accidentally.

An item must be seen 3 or more times to show here, so that accidentally doubled up payments aren't ignored.

In [55]:
table[cols][debit & recurring]
Entered Date catg desc Amount
7 2014-05-02 atm <atm>-melbourne - 330 collins street -200.00
2 2014-05-06 bill visa-virgin mobile h/o macquarie prk<card><ref> -19.19
10 2014-04-30 fee visa access card fee <va -123456 > -1.65

The same breakdown but this time for credits.

New credits

In [56]:
table[cols][credit & new]

Varying credits

In [57]:
expense_histograms(table[cols+['amounts']][credit & not_recurring])

Recurring credits

In [58]:
table[cols][credit & recurring]

Item summaries

TODO: might be nice to show details of each (non-unique) item desc - average, totals, charts, frequency...

Left overs

The following are the unique item descriptions, check through these to see if any additional steps need to be added to clean_up.

In [59]:
left_overs = sorted(statement['desc'].unique())
len(left_overs), left_overs
 ['<atm owner fee> cba\\melbourne au',
  '<atm owner fee> westpacbox hill 2 o/s box hill 3 au',
  '<atm>-cba\\melbourne au',
  '<atm>-chadstone - 1341 dandenong rd',


  'visa-ww petrol 3093 camberwell <card><ref>'])