Pandas is a "library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language." I use it instead of dumping data out to Excel. For me, having data manipulations well documented and repeatable is much easier to accomplish through Python code than a set of instructions writen in Word and completed in Excel. I know Excel is great, and I've used it for years, but it has become uneeded for most of my work now thanks to Pandas. Recognizing Excel is still in heavy use, Pandas does provide easy utilities to read and write to Excel files.
We won't get deep into the data analysis part of Pandas and what it can really do in this notebook. We are going to create a simple graph representing aged WIP for a single timekeeper. Such a graph can be published to a static report, sent via email, or embedded in a PDF as part of a timekeeper's factsheet for the compensation committee.
As before, we can easily install the latest Pandas and Matplotlib libraries if they are not already installed.
!{sys.executable} -m pip install --upgrade pandas
!{sys.executable} -m pip install --upgrade matplotlib
From our Interacting with SQL notebook, a reminder:
As we're publishing this to the web, I've placed my server name, database name, user and password in environment variables so they don't appear here in the Notebook. I use Linux and Systemd to start my Jupyter Hub and set the environment variables in my /etc/systemd/system/jupyter.service file. In my server's jupyterhub_config.py file I add any of the variables I want exposed to the individual users in the c.Spawner.env_keep option. Any users that I'm opening up Jupyter Hub to likely know these read only accounts anyway as they've had some SQL need in the past. Generally this would be other developers, report writers, and analysts.
import os
import pyodbc
import pandas as pd
import matplotlib.pyplot as plt
elite_server = os.environ['ELITE_PROD_HOST']
elite_db = os.environ['ELITE_PROD_DB']
elite_user = os.environ['ELITE_RO_NAME']
elite_pass = os.environ['ELITE_RO_PASS']
conn_str = (f'DRIVER={{FreeTDS}};SERVER={elite_server};'
f'PORT=1433;DATABASE={elite_db};UID={elite_user};PWD={elite_pass};'
f'TDS_Version=8.0;ClientCharset=UTF-8')
e3_conn = pyodbc.connect(conn_str)
The SQL statement below will create some typical aging buckets. You could link out to AgingScheme
and AgingSchemeBuckets
and build this dynamically, however, for easier reading and potentially better performance, I'll present it like this and leave the specifics for your firm up to to you.
There is one parameter in the statement, and that's to hold the timekeeper index (not number).
aged_wip_sql = """
declare @tkpr as integer = ? -- The timekeeper index [this can be optional]
declare @today as datetime = convert(datetime, floor(convert(float, getdate())))
select
timekeeper.Number,
timekeeper.DisplayName,
sum(case when aging_grouped.bucket = 1 then sum_WIPAmt else 0 end) '0-30',
sum(case when aging_grouped.bucket = 2 then sum_WIPAmt else 0 end) '31-60',
sum(case when aging_grouped.bucket = 3 then sum_WIPAmt else 0 end) '61-90',
sum(case when aging_grouped.bucket = 4 then sum_WIPAmt else 0 end) '91-120',
sum(case when aging_grouped.bucket = 5 then sum_WIPAmt else 0 end) 'Over 120',
sum(sum_WIPAmt) 'Total'
from (
select aging_detail.Timekeeper, aging_detail.bucket, sum(aging_detail.WIPAmt) sum_WIPAmt
from (
select timecard.Timekeeper, timecard.WIPAmt,
case
when DATEDIFF(day,timecard.workdate,@today) < 31 then 1
when DATEDIFF(day,timecard.workdate,@today) between 31 and 60 then 2
when DATEDIFF(day,timecard.workdate,@today) between 61 and 90 then 3
when DATEDIFF(day,timecard.workdate,@today) between 91 and 120 then 4
else 5
end 'bucket'
from timecard
where timecard.IsActive = 1 and timecard.WIPRemoveDate is null and timecard.IsNB = 0 and timecard.WorkAmt <> 0
and timecard.Timekeeper = @tkpr -- comment this line if we want all timekeepers
) aging_detail
group by aging_detail.Timekeeper, aging_detail.bucket
) aging_grouped
join timekeeper on timekeeper.TkprIndex = aging_grouped.Timekeeper
group by
timekeeper.Number,
timekeeper.DisplayName
"""
# change this to an index for a timekeeper in your firm
timekeeper_index = 1234
Here is where Pandas starts to make things easier. We create a dataframe
to house the results from the query above and simultaneously query the database with the specified Timekeeper using read_sql
. We set the index of the dataframe to the Timekeeper's Number.
df_agedwip = pd.read_sql(aged_wip_sql, e3_conn, params=(timekeeper_index,), index_col='Number')
If we want to see what that dataframe
has in it we can view it like this. The data may get truncated if you have many columns or many rows. This technique of displaying is mostly to ensure you have the type of data you were expecting.
df_agedwip
So there's our data.
But I promised a chart. We will use matplotlib
for this. Pandas' plot
method is a wrapper around matplotlib's pyplot.plot()
so when looking at the documentation, the two have many interchangeable things, but some are not available in the wrapper and you need to use calls like plt.show()
or plt.savefig()
. There are lots of options and settings, but by example here is how a bar chart could be created. The main function plot
manages the visualization.
Like much of the time, the data isn't exactly as we need it. Let's massage it a little bit. We don't want the Total
amount or the DisplayName
so we drop
those first. The plot
method of the dataframe expects the data with the categories as rows, so do need a call to transpose
which is much like Excel's Paste Special -> Transpose to flip flop or rotate the data. Then we plot
it.
It may appear complicated on the first look, but we could have structured the SQL statement to have exactly the format we needed, but I didn't, and thought it made for an opportunity to show how to chain some methods together to quicly alter the dataset being plotted. The beauty of Pandas is that it efficiently handles large datasets in the same way.
# good habit to close any existing figures
plt.close('all')
# pick some colors for the bars
colors = ['#8BC34A', '#7CB342', '#FFB300', '#E53935', '#B71C1C']
# build the plot
ax = df_agedwip.drop(['Total','DisplayName'], axis=1).transpose().plot(kind="bar",
figsize=(5,2),
color=[colors],
legend=False, rot=0, grid=True )
# label the axis
ax.set_xlabel("Days")
ax.set_ylabel("Dollars")
# put the grid lines behind the colour bars
ax.set_axisbelow(True)
# personal preference to remove the top and right axix splines
ax.spines['right'].set_color('none')
ax.spines['top'].set_color('none')
# every chart deserves a title
ax.set_title(f"Aged WIP (Days)", fontdict={'fontsize': 14, 'fontweight': 'heavy'})
# format the dollar amounts
ax.set_yticklabels([f"${label:,.0f}" for label in ax.get_yticks() ])
# add some text and kind of hide it over to the right giving the run time
ax.text(1,.5, f'As at {pd.datetime.now():%Y-%b-%d %I:%M %p}',
verticalalignment='center',
rotation='vertical',
fontdict={'fontstyle': 'italic', 'fontsize':7, 'color': '#cccccc'},
transform=ax.transAxes
)
# let's take a look at our work!
plt.show()
# for interest sake. Note that this doesn't change our original, only displays the output from the `transpose` method.
df_agedwip.transpose()
In this Notebook, we've set up a connection to our Elite database server, run a SQL query using Pandas to create a dataframe
with the results, performed some simple 'massaging' of the data, and created a graph with the results.
-30-