This data conversation uses the car parking meter data obtained via an FOI request to the the Isle of Wight Council for Pay and Display ticket machine transaction data from the ticket machines in the River Road Car Park, Yarmouth, Isle of Wight, for the financial year 2012-13.

The data includes an identifier for the ticket machine that issued the ticket, the time the ticket was issued, the tariff band (i.e. the nominal ticket value), and the amount of cash paid for the ticket.

The following conversation is a predomninantly visual one, where questions are asked of the data and responses provided in a graphical form - as charts - that then need interpreting.

Several lines of questioning naturally arise:

- when are car parks actually used, based on ticket purchases?
- are different ticket types purchased at different times of day or different days of the week?
- do customers ever pay more than they need to when purchasing a ticket?

This conversation with data has been created within an interactive IPython Notebook, using the *pandas* data wrangling library and the `ggplot`

graphics library.

In [15]:

```
#Import some necessary programming libraries that we'll use for the analysis
import pandas as pd
from ggplot import *
#And some housekeeping
import warnings
warnings.simplefilter(action = "ignore", category = FutureWarning)
```

In [3]:

```
#See what data files we have available
!ls data/iw_parkingMeterData/
```

In [4]:

```
#I'm going to start by just looking at data from the period Dec 2012 to March 2013.
#Read in a single spreadsheet and tidy it
df=pd.read_excel("data/iw_parkingMeterData/4_5_Transaction Report RR Dec 2012 March 2013.xls",skiprows=6)
#We need to clean the data a little, dropping empty columns, identifying timestamps as such
df.dropna(how='all',axis=1,inplace=True)
df.Date=pd.to_datetime(df.Date, format="%Y-%m-%d %H:%M:%S",coerce=True)
df.dropna(subset=["Date"],inplace=True)
#So what does the data look like?
df[:5]
```

Out[4]:

In [133]:

```
#What are the separate tariff bands?
df['Description.1'].unique()
```

Out[133]:

In [21]:

```
#It's possibly easier to work with the Tariff code, so what code applies to which description?
from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())
dfx=df[["Tariff","Description.1"]].copy()
dfx.rename(columns=lambda x: x.replace('.','_'), inplace=True)
q="SELECT DISTINCT Tariff, Description_1 FROM dfx"
pysqldf(q)
```

Out[21]:

In [22]:

```
#We can use this information to generate a mapping from the description or tariff to the tariff price
#[Really should automate the extraction of the amount from the description]
tariffMap={'01A':0.6, '01B':1,'01C':1.9, '01D':3.4,'01E':4.5,'01F':6.6,'01G':3,'01H':10}
df["Tariff_val"]=df['Tariff'].apply(lambda x: tariffMap[x])
df[:3]
```

Out[22]:

In [7]:

```
#How much cash was taken over this period in total?
df[['Cash']].sum()
```

Out[7]:

In [62]:

```
#If people paid exactly the tariff price, how much would have been taken?
df[['Tariff_val']].sum()
```

Out[62]:

In [69]:

```
#So for this one car park, over four off season months, how much was overpaid?
round(float(df[['Cash']].sum())-float(df[['Tariff_val']].sum()),2)
```

Out[69]:

In [9]:

```
#How much cash was taken over this period for each machine?
df[['Machine','Cash']].groupby('Machine').sum()
```

Out[9]:

In [11]:

```
#How much cash was taken over this period for each machine and tariff?
df[['Machine','Tariff','Cash']].groupby(['Machine','Tariff']).sum()
```

Out[11]:

That total cash amounts are interesting, but if we want to know how busy the car parks were, we need to count the number of tickets issued.

In [12]:

```
#So how many tickets of each tariff type were issued by each machine?
df[["Tariff","Machine"]].groupby(['Tariff',"Machine"]).agg(len).sort_index()
```

Out[12]:

In [18]:

```
#Can you show me that graphically?
p = ggplot(aes(x='Tariff'), data=df)
p + geom_bar() + ggtitle("Number of Tickets per Tariff") + labs("Tariff Code", "Count") + facet_wrap('Machine',scales='fixed')
```

Out[18]:

It looks as if YARR02 is used slightly less - is the area of the car park it covers "further away" from where people are likely to want to go?

There's possibly a diagnostic here too - if the sales from one machine fall off and the other runs at a higher rate than normal, it suggests a possible problem with the former machine? We want explore that here, but we could explore it in a more detailed investigation.

In [52]:

```
#Here's the same question asked another way
p = ggplot(aes(x='Tariff',fill="Machine"), data=df)
p + geom_bar() + ggtitle("Number of Tickets per Tariff") + labs("Tariff Code", "Count")
#Ideally these bars would be "dodged" - placed side-by-side, but the charting library doesn't support that at the moment
```

Out[52]:

I'm now going to start exploring *when* there is most activity. One way of doing this is to summarise the data and look for activity around particular days of the week or hours of the day.

In [19]:

```
#We can get a designation different time components as follows
# /via http://pandas-docs.github.io/pandas-docs-travis/timeseries.html
# minute: the minutes of the datetime
# weekday OR dayofweek: the day of the week with Monday=0, Sunday=6
# week: the week ordinal of the year
df['weekday']=df['Date'].apply(lambda x: x.dayofweek)
# hour: the hour of the datetime
df['hour']=df['Date'].apply(lambda x: x.hour)
#Let's just check that's worked:
df[:3]
```

Out[19]:

How many transactions are issued by day of week? Let's plot them as a bar chart.

In [20]:

```
ggplot(df, aes(x='factor(weekday)'))+geom_bar()
```

Out[20]:

In [ ]:

```
# this, or similar, should be supported at some point? +scale_x_discrete(labels=["Mon","Tues","Weds","Thurs","Fri","Sat","Sun"])
```

So Saturday apppears to be the most popular day of the week, and Monday the quietest.

In [23]:

```
#How many transactions occured by hour of day?
ggplot(df, aes(x='hour'))+geom_bar(binwidth=1)
```

Out[23]:

In [24]:

```
#Can we split that up to see whether it's different across days of the week?
ggplot(df, aes(x='hour'))+geom_bar(binwidth=1)+facet_wrap('weekday',scales='fixed')
```

Out[24]:

In distribution terms, it looks as if this concentrates more the middle of the day at the weekends, compared to weekdays. (We could run statistical tests to check this.)

In [85]:

```
#Can we probe that distribution a little further, perhaps seeing how the hourly counts are made up from different tariff counts?
ggplot(df, aes(x='hour',fill='Tariff'))+geom_bar(binwidth=1)+facet_wrap('weekday',scales='fixed')
```

Out[85]:

So that's not too clear - and we need a legend. But the grey-blue band doesn't appear to be used much in the afternoon... And there's a burst of red band activity last thing on a Saturday. The light blue also seems quite popular on a Saturday?

In [68]:

```
#Let's try to dig into that a little more. For a given day of the week, how do the tariff bands get used over the day?
ggplot(df[df['weekday']==2], aes(x='hour'))+geom_bar(binwidth=1)+facet_grid('Tariff')+ggtitle('Wednesday')
```

Out[68]:

So the longer ticket 01F is bought in the morning (reasonable) and late in the day (to cover the next morning). 01B and 01C (up to an hour and 1-2 hours) are popular throughout the day. There is maybe a burst in sales of the short 30 minute 01A ticket at the end of the day?

So how does another day compare?

In [36]:

```
#Let's see what activity for Saturday looks like:
ggplot(df[df['weekday']==5], aes(x='hour'))+geom_bar(binwidth=1)+facet_grid('Tariff')+ggtitle('Saturday')
```

Out[36]:

There definitely seems to be an upswing in short term ticket sales at the end of the day: people going out for the evening?

In [39]:

```
#Let's try to look over all the data to see how the tariff bands compare by hour of day
ggplot(df[(df['Tariff']!='01H') & (df['Tariff']!='01G') ], aes(x='hour'))+geom_bar(binwidth=1)+facet_wrap('Tariff',scales='fixed')
```

Out[39]:

To what extent do people pay more for their parking than they need to - at least in terms of paying more for a ticket than its actual marked price?

In [54]:

```
#Let's plot a count of cash payments using bins that ar 5 pence wide
p = ggplot(aes(x='Cash'), data=df)
p + geom_histogram(binwidth=0.05)
```

Out[54]:

Note the "echo peaks" at £2.00 and £3.50 - representing 10p overpayments on the £1.90 01C tariff and £3.40 01D tariff. Clever, eh? Set the tariff just below natural coinage, perhaps in the expectation you'll get the 'natural' amount a good proportion of the time.

In [41]:

```
#The Overpayment column is a boolean that specifies whether there was an overpayment or not
df["Overpayment"]=(df["Cash"]!=df["Tariff_val"])
#The OverpaymentVal identifies how much, if anything, was overpaid
df["OverpaymentVal"]=df["Cash"]-df["Tariff_val"]
```

In [48]:

```
df[1220:1223]
```

Out[48]:

In [49]:

```
#So how common are overpayents by tariff type?
df[["Tariff","Overpayment"]].groupby(['Tariff',"Overpayment"]).agg(len)
```

Out[49]:

Seems like 01C has a getting on for almost 50% overpayment!

How does revenue come in over the data collection period?

In [56]:

```
#Let's order the data by timestamp, then add up the cumulative revenue
df.sort(['Date'],inplace=True)
df['Cash_cumul'] = df.Cash.cumsum()
df[:3]
```

Out[56]:

In [57]:

```
#How does it look?
g = ggplot(aes(x="Date",y="Cash_cumul"), data=df )+ geom_line()
g
```

Out[57]:

In [59]:

```
#We can also calculate the accumulated amount within each tariff band
#Let's group=df[['Tariff','Cash']].groupby('Tariff')
#For group of rows, apply the transformation to each row in the group
#The number of rows in the response will be the same as the number of rows in the original data frame
df['Cash_cumul2']=group.transform(cumsum)['Cash']
```

In [60]:

```
#Here's how it looks:
ggplot(df,aes(x="Date",y="Cash_cumul2",colour="Tariff"))+geom_line()
```

Out[60]:

In [61]:

```
#We can also split the amounts out into separate charts
ggplot(df, aes(x="Date",y="Cash_cumul2")) + geom_line() \
+ ggtitle("Payments made over time") \
+ labs("Transaction Date", "Transaction amount (£)") \
+ facet_wrap("Tariff",scales = "fixed")
```

Out[61]:

Here's the start of a conversation with the full data set. It's a little scrappier at the moment, and in rather more of a quickfire, note form, but you're hopefully in the swing of it now...

In [110]:

```
dfx=pd.DataFrame()
for fn in ['4_10_River Road Transaction Report April 2012.xls',
'4_11_Transaction Report RR Aug 2012.xls',
'4_5_Transaction Report RR Dec 2012 March 2013.xls',
'4_6_Transaction Report RR July 2012.xls',
'4_7_Transaction Report RR June 2012.xls',
'4_8_Transaction Report RR May 2012.xls',
'4_9_Transaction Report RR Sept Nov 2012.xls']:
dfx=pd.concat([dfx,pd.read_excel('data/iw_parkingMeterData/'+fn,skiprows=6)])
dfx.dropna(how='all',axis=1,inplace=True)
dfx.Date=pd.to_datetime(dfx.Date, format="%Y-%m-%d %H:%M:%S",coerce=True)
dfx.dropna(subset=["Date"],inplace=True)
```

In [111]:

```
dfx['weekday']=dfx['Date'].apply(lambda x: x.dayofweek)
ggplot(dfx, aes(x='factor(weekday)'))+geom_bar()
#0-Mon 6-Sun
```

Out[111]:

In [112]:

```
dfx['month']=dfx['Date'].apply(lambda x: x.month)
```

In [113]:

```
dfx['week']=dfx['Date'].apply(lambda x: x.week)
```

In [118]:

```
dfx['hour']=dfx['Date'].apply(lambda x: x.hour)
```

In [117]:

```
#How much activity is there by week of year? Note: the data was collected from a financial year,
#so the scale actually runs Jan-Mar 13, then Apr-Dec 12.
ggplot(dfx, aes(x='week'))+geom_bar()+facet_wrap('Tariff',scales='free_y')
```

Out[117]:

In [176]:

```
#Is there any evidence of folk paying just what they need as it gets closer to free parking time at 18.00?
ggplot(dfx[(dfx['Tariff']=='01A') | (dfx['Tariff']=='01B')|(dfx['Tariff']=='01C')], aes(x='hour')) \
+geom_bar(binwidth=1)+facet_wrap('Tariff',scales='free_y')
#Note that the bin widths are 1 hour wide for doing setting the bars.
```

Out[176]:

In [140]:

```
tariffMap2={'01A':0.6, '01B':1,'01C':1.9, '01D':3.4,'01E':4.5,'01F':6.6,'01G':3,'01H':10,'01I':13,
'02A':0.6, '02B':1,'02C':1.9, '02D':3.4,'02E':4.5,'02F':6.6,}
dfx["Tariff_val"]=dfx['Tariff'].apply(lambda x: tariffMap2[x])
```

In [145]:

```
#Set a boolean to say whether or not a line item was an overpayment
dfx["Overpayment"]=(dfx["Cash"]!=dfx["Tariff_val"])
```

In [149]:

```
#Calculate amount of overpayment (if any) for each transaction
dfx["OverpaymentVal"]=dfx["Cash"]-dfx["Tariff_val"]
```

In [152]:

```
#What's the total amount of overpayment?
dfx["OverpaymentVal"].sum()
```

Out[152]:

In [153]:

```
#How much was overpaid at the 01C/ £1.90 tariff level?
dfx[dfx['Tariff']=='01C']["OverpaymentVal"].sum()
#Note - I think 02C is the same level and there were also overpayments at that level.
```

Out[153]:

In [183]:

```
#Total revenue over the year:
dfx["Cash"].sum()
```

Out[183]:

In [159]:

```
#How many people paid £2 on the 01C tariff?
dfx[(dfx['Tariff']=='01C') & (dfx['Cash']==2)]["OverpaymentVal"].count()
```

Out[159]:

In [175]:

```
dfx['OverpaymentValRounded']=dfx['OverpaymentVal'].apply(lambda x: round(x,2))
#This crosstab counts the occurrences of one column value or index value with respect to another
#So we can get count of the number of overpayments of a particular size by Tariff
pd.crosstab(dfx['OverpaymentValRounded'],dfx['Tariff'], margins=True)
```

Out[175]:

In [ ]:

```
```