I ran my column detection script across all 199 volumes and 70,000+ pages of the Sydney Stock Exchange registers, generating a CSV file for each year from 1901 to 1950. See this notebook for some more details
This notebook combines the separate CSV files into a single dataframe and then visualises the column detection results using Altair.
import pandas as pd
import os
import altair as alt
from IPython.display import display, HTML
# Comment the next line out if using in JupyterLab
alt.renderers.enable('notebook')
alt.data_transformers.enable('json')
DataTransformerRegistry.enable('json')
# We're going to combine all of the CSV files into one big dataframe
# Create an empty dataframe
combined_df = pd.DataFrame()
# Loop through the range of years
for year in range(1901, 1951):
# Open the CSV file for that year as a dataframe
year_df = pd.read_csv('{}.csv'.format(year))
# Add the single year df to the combined df
combined_df = combined_df.append(year_df)
To visualise the results, we first need to move the data around a bit to get it into a form we can chart using Altair.
Because this loops through dataframes for each volume calculating column widths, it is a bit slow...
# Replace any NaNs with 0
combined_df['column_positions'].fillna('0', inplace=True)
# Group images by volume using the 'referenceCode' column
vol_groups = combined_df.groupby('referenceCode')
# Create an empty dataframe
df_volumes = pd.DataFrame()
# Just used to limit the number of volumes processed for testing
max_rows = 200
row = 0
# Loop through the volume groups
for vol, pages in vol_groups:
if row < max_rows:
# Pages is a dataframe with details of all the pages in a volume
pages = pages.copy()
# Convert the width to a string, so we concatenate with the column positions
pages['width'] = pages['width'].astype('str')
# Add the page width to the end of the column positions string
pages['column_positions'] = pages[['column_positions', 'width']].apply(lambda x: ','.join(x), axis=1)
# Create a new dataframe by exploding the column positions into separate rows, using the file name as the index
new_df = pd.DataFrame(pages['column_positions'].str.split(',').tolist(), index=pages['name']).stack()
# Change the file name from an index into a column
new_df = new_df.reset_index([0, 'name'])
# Rename the columns
new_df.columns = ['name', 'col_pos']
# Convert col_pos column to an integer
new_df['col_pos'] = new_df['col_pos'].astype('int')
# Extract the page number from the file name and save as a new field
new_df['page_num'] = new_df['name'].str.extract(r'_(\d+)\.').astype('int')
# Add the volume name
new_df['volume'] = vol
# Add the volume number
new_df['volume_num'] = int(vol[-3:])
# Sort pages by the volume number / page number
new_df = new_df.sort_values(by=['volume_num','page_num'])
# Number the columns in each row
new_df['order'] = new_df.groupby('page_num').cumcount()
# In this loop we're going to calculate the width of each column by subtracting
# the previous col_pos from the current one
# Loop through the dataframe
for i in range(1, len(new_df)):
# if it's the first column then the width is equal to the position
if new_df.loc[i, 'order'] == 0:
new_df.loc[i, 'col_width'] = new_df.loc[i, 'col_pos']
# Otherwise we have to get the previous value and subtract
else:
new_df.loc[i, 'col_width'] = new_df.loc[i, 'col_pos'] - new_df.loc[i-1, 'col_pos']
# Append to the combined volumes dataframe
df_volumes = df_volumes.append(new_df)
row += 1
Let's have a peek inside the new dataframe.
df_volumes.head()
name | col_pos | page_num | volume | volume_num | order | col_width | |
---|---|---|---|---|---|---|---|
0 | N193-001_0001.tif | 0 | 1 | N193-001 | 1 | 0 | NaN |
1 | N193-001_0001.tif | 1811 | 1 | N193-001 | 1 | 1 | 1811.0 |
2 | N193-001_0001.tif | 3222 | 1 | N193-001 | 1 | 2 | 1411.0 |
3 | N193-001_0001.tif | 6237 | 1 | N193-001 | 1 | 3 | 3015.0 |
4 | N193-001_0002.tif | 205 | 2 | N193-001 | 1 | 0 | 205.0 |
Now we have a dataframe that has one row for every column on every page in every volume. Let's visualise a single volume by creating a bar chart that stacks the column widths on top of each other to represent a page. In other words, there'll be one column in the chart for every page in the volume, and the total height of the column will represent the width of the digitised image. The widths of individual table columns will be shown by different coloured segments.
We'll also add a couple of extra features. When you hover over the chart, a tooltip will display the volume, page, and column details. And when you click, you'll download the image of that page from CloudStor (you'll need to be logged in to CloudStor for this to work).
def chart_volume(df):
chart = alt.Chart(df).transform_calculate(
# Create a url that will download a page image when clicked
url='https://cloudstor.aarnet.edu.au/plus/remote.php/webdav/Shared/ANU-Library/Sydney%20Stock%20Exchange%201901-1950/AU%20NBAC%20' + alt.datum.volume + '/' + alt.datum.name
).mark_bar(size=6).encode(
# X axis shows pages
x=alt.X('page_num:O', axis=alt.Axis(labels=False, ticks=False), title='', scale=alt.Scale(rangeStep=7)),
# Y axis stacks the table column widths together to show total page width (in pixels)
y=alt.Y('col_width:Q', axis=alt.Axis(labels=False, ticks=False), title=''),
# Individual table column widths are distinguished by color
color=alt.Color('order:N', legend=alt.Legend(title='Column')),
# Hover to see page and column details
tooltip=['volume', 'page_num', 'col_width'],
# On click, download the image
href='url:N',
# Order the coloured segments by column order
order=alt.Order('order', sort='ascending')
)
return chart
Create a chart! Change the volume
value as desired...
volume = 'N193-001'
chart = chart_volume(df_volumes.loc[df_volumes['volume'] == volume])
display(chart)
We can visualise all 199 volumes much the same way. To make the volumes a bit easier to read, we'll display them vertically. We'll also use Altair's 'facet' option to show lots of different charts at once, as it's more efficient that generating each chart individually. However, after a bit of trial and error, it seems there's a limit on the number of facets you can display. We'll deal with this by dividing the dataframe roughly into halves, and visualising each separately.
You can change the number of charts in each row by changing the num_columns
value.
# Much the same as above, but note the use of facet to generate multiple charts
def chart_volumes(df, num_columns=25):
chart = alt.Chart(df).mark_bar(size=3).encode(
y=alt.X('page_num:O', axis=alt.Axis(labels=False, ticks=False), title='', scale=alt.Scale(rangeStep=3.4)),
x=alt.Y('col_width:Q', axis=alt.Axis(labels=False, ticks=False), title=''),
color=alt.Color('order:N', legend=None),
tooltip=['volume', 'page_num', 'col_width'],
order=alt.Order('order', sort='ascending')
).properties(
# Set the width
width=100
).facet(
# This creates a separate chart for each volume
facet='volume:N',
# Number of charts in a row
columns=num_columns
)
return chart
# Display the first 100 volumes
chart1 = chart_volumes(df_volumes.loc[df_volumes['volume_num'] <= 100], num_columns=20)
# Display the chart
display(chart1)
# Save the chart as html
chart1.save('docs/chart-vols-1-100.html')
# Display the rest of the volumes
chart2 = chart_volumes(df_volumes.loc[df_volumes['volume_num'] > 100], num_columns=20)
# Display the chart
display(chart2)
# Save the chart as html
chart2.save('docs/chart-vols-101-199.html')
To make it easier to explore these charts, I've saved them as HTML files. I've saved them into a docs
directory, so that I can use GitHub pages to display them. However, for these to display properly, I also have to manually move the data files into the docs
directory. They look something like altair-data-...json
.
Here are links to the HTML files on GitHub: