Column detection results

This notebook analyses the results of running the column detection script across all of the Stock Exchange images on CloudStor.

The raw results are in CSV files, one for each year. See this notebook for more details.

See this notebook for some visualisations of this data.

In [32]:
import pandas as pd
import os
In [33]:
# We're going to combin all of the CSV files into one big dataframe

# Create an empty dataframe
combined_df = pd.DataFrame()

# Loop through the range of years
for year in range(1901, 1951):
    
    # Open the CSV file for that year as a dataframe
    year_df = pd.read_csv('{}.csv'.format(year))
    
    # Add the single year df to the combined df
    combined_df = combined_df.append(year_df)
In [34]:
# How many images do we have data for?
combined_df.shape
Out[34]:
(72932, 11)
In [35]:
# Have a look inside
combined_df.head()
Out[35]:
directory name path referenceCode startDate endDate year width height columns column_positions
0 AU NBAC N193-001/ N193-001_0001.tif Shared/ANU-Library/Sydney Stock Exchange 1901-... N193-001 1901-01-01 1901-03-01 1901 6237 5000 3 0,1811,3222
1 AU NBAC N193-001/ N193-001_0002.tif Shared/ANU-Library/Sydney Stock Exchange 1901-... N193-001 1901-01-01 1901-03-01 1901 6266 5000 3 205,1840,3259
2 AU NBAC N193-001/ N193-001_0003.tif Shared/ANU-Library/Sydney Stock Exchange 1901-... N193-001 1901-01-01 1901-03-01 1901 6237 5000 2 286,2068
3 AU NBAC N193-001/ N193-001_0004.tif Shared/ANU-Library/Sydney Stock Exchange 1901-... N193-001 1901-01-01 1901-03-01 1901 6236 5000 3 9,1821,3219
4 AU NBAC N193-001/ N193-001_0005.tif Shared/ANU-Library/Sydney Stock Exchange 1901-... N193-001 1901-01-01 1901-03-01 1901 6236 5000 3 288,1821,3220
In [5]:
combined_df['columns'].value_counts()
Out[5]:
3    41076
4    26917
2     4825
1       19
0        6
Name: columns, dtype: int64
In [6]:
combined_df.loc[combined_df['width'] == 0]
Out[6]:
directory name path referenceCode startDate endDate year width height columns column_positions
677 AU NBAC N193-055/ N193-055_0037.tif Shared/ANU-Library/Sydney Stock Exchange 1901-... N193-055 1914-07-01 1914-09-01 1914 0 0 0 NaN
1051 AU NBAC N193-064/ N193-064_0078.tif Shared/ANU-Library/Sydney Stock Exchange 1901-... N193-064 1916-10-01 1916-12-01 1916 0 0 0 NaN
44 AU NBAC N193-173/ N193-173_0045.tif Shared/ANU-Library/Sydney Stock Exchange 1901-... N193-173 1944-01-01 1944-03-01 1944 0 0 0 NaN
50 AU NBAC N193-173/ N193-173_0051.tif Shared/ANU-Library/Sydney Stock Exchange 1901-... N193-173 1944-01-01 1944-03-01 1944 0 0 0 NaN
52 AU NBAC N193-173/ N193-173_0053.tif Shared/ANU-Library/Sydney Stock Exchange 1901-... N193-173 1944-01-01 1944-03-01 1944 0 0 0 NaN
65 AU NBAC N193-173/ N193-173_0066.tif Shared/ANU-Library/Sydney Stock Exchange 1901-... N193-173 1944-01-01 1944-03-01 1944 0 0 0 NaN

Pages with 0 or 1 columns detected

There are 25 pages with 0 or 1 columns detected. Let's see what's up with them...

In [7]:
# Get the problem pages
problems = combined_df.loc[(combined_df['columns'] == 0) | (combined_df['columns'] == 1)]
problems
Out[7]:
directory name path referenceCode startDate endDate year width height columns column_positions
677 AU NBAC N193-055/ N193-055_0037.tif Shared/ANU-Library/Sydney Stock Exchange 1901-... N193-055 1914-07-01 1914-09-01 1914 0 0 0 NaN
1051 AU NBAC N193-064/ N193-064_0078.tif Shared/ANU-Library/Sydney Stock Exchange 1901-... N193-064 1916-10-01 1916-12-01 1916 0 0 0 NaN
515 AU NBAC N193-090/ N193-090_0210.tif Shared/ANU-Library/Sydney Stock Exchange 1901-... N193-090 1923-04-01 1923-06-01 1923 4264 5000 1 355
330 AU NBAC N193-109/ N193-109_0331.tif Shared/ANU-Library/Sydney Stock Exchange 1901-... N193-109 1928-01-01 1928-03-01 1928 4032 5000 1 26
856 AU NBAC N193-111/ N193-111_0216.tif Shared/ANU-Library/Sydney Stock Exchange 1901-... N193-111 1928-07-01 1928-09-01 1928 5732 5000 1 12
1590 AU NBAC N193-163/ N193-163_0427.tif Shared/ANU-Library/Sydney Stock Exchange 1901-... N193-163 1941-07-01 1941-09-01 1941 3642 2464 1 0
44 AU NBAC N193-173/ N193-173_0045.tif Shared/ANU-Library/Sydney Stock Exchange 1901-... N193-173 1944-01-01 1944-03-01 1944 0 0 0 NaN
50 AU NBAC N193-173/ N193-173_0051.tif Shared/ANU-Library/Sydney Stock Exchange 1901-... N193-173 1944-01-01 1944-03-01 1944 0 0 0 NaN
52 AU NBAC N193-173/ N193-173_0053.tif Shared/ANU-Library/Sydney Stock Exchange 1901-... N193-173 1944-01-01 1944-03-01 1944 0 0 0 NaN
65 AU NBAC N193-173/ N193-173_0066.tif Shared/ANU-Library/Sydney Stock Exchange 1901-... N193-173 1944-01-01 1944-03-01 1944 0 0 0 NaN
414 AU NBAC N193-173/ N193-173_0415.tif Shared/ANU-Library/Sydney Stock Exchange 1901-... N193-173 1944-01-01 1944-03-01 1944 5879 4618 1 0
415 AU NBAC N193-173/ N193-173_0416.tif Shared/ANU-Library/Sydney Stock Exchange 1901-... N193-173 1944-01-01 1944-03-01 1944 5879 4618 1 0
416 AU NBAC N193-173/ N193-173_0417.tif Shared/ANU-Library/Sydney Stock Exchange 1901-... N193-173 1944-01-01 1944-03-01 1944 5879 4689 1 9
417 AU NBAC N193-173/ N193-173_0418.tif Shared/ANU-Library/Sydney Stock Exchange 1901-... N193-173 1944-01-01 1944-03-01 1944 5879 4618 1 0
418 AU NBAC N193-173/ N193-173_0419.tif Shared/ANU-Library/Sydney Stock Exchange 1901-... N193-173 1944-01-01 1944-03-01 1944 5879 4618 1 0
419 AU NBAC N193-173/ N193-173_0420.tif Shared/ANU-Library/Sydney Stock Exchange 1901-... N193-173 1944-01-01 1944-03-01 1944 5879 4618 1 0
420 AU NBAC N193-173/ N193-173_0421.tif Shared/ANU-Library/Sydney Stock Exchange 1901-... N193-173 1944-01-01 1944-03-01 1944 5879 4570 1 0
421 AU NBAC N193-173/ N193-173_0422.tif Shared/ANU-Library/Sydney Stock Exchange 1901-... N193-173 1944-01-01 1944-03-01 1944 5879 4570 1 0
422 AU NBAC N193-173/ N193-173_0423.tif Shared/ANU-Library/Sydney Stock Exchange 1901-... N193-173 1944-01-01 1944-03-01 1944 5879 4570 1 0
423 AU NBAC N193-173/ N193-173_0424.tif Shared/ANU-Library/Sydney Stock Exchange 1901-... N193-173 1944-01-01 1944-03-01 1944 5879 4558 1 0
424 AU NBAC N193-173/ N193-173_0425.tif Shared/ANU-Library/Sydney Stock Exchange 1901-... N193-173 1944-01-01 1944-03-01 1944 5879 4558 1 0
426 AU NBAC N193-173/ N193-173_0427.tif Shared/ANU-Library/Sydney Stock Exchange 1901-... N193-173 1944-01-01 1944-03-01 1944 5879 4558 1 0
427 AU NBAC N193-173/ N193-173_0428.tif Shared/ANU-Library/Sydney Stock Exchange 1901-... N193-173 1944-01-01 1944-03-01 1944 5879 4558 1 0
431 AU NBAC N193-173/ N193-173_0432.tif Shared/ANU-Library/Sydney Stock Exchange 1901-... N193-173 1944-01-01 1944-03-01 1944 5879 4558 1 0
432 AU NBAC N193-173/ N193-173_0433.tif Shared/ANU-Library/Sydney Stock Exchange 1901-... N193-173 1944-01-01 1944-03-01 1944 5879 4558 1 0
In [9]:
# If running locally need to set up Cloudstor client to download images
# DON'T RUN THIS ON SWAN (or you'll get an error because webdav is not installed)
import webdav.client as wc
from webdav.client import RemoteResourceNotFound
from credentials import * # Storing my CloudStor credentials in another file
# Set the connection options. CLOUDSTOR_USER and CLOUDSTOR_PW are stored in a separate credentials file.
options = {
    'webdav_hostname': 'https://cloudstor.aarnet.edu.au',
    'webdav_login':    CLOUDSTOR_USER,
    'webdav_password': CLOUDSTOR_PW,
    'webdav_root': '/plus/remote.php/webdav/'
}
# Ok let's initiate the client.
client = wc.Client(options)
from PIL import Image
In [10]:
def download_image(image):
    try:
        client.download_sync(remote_path=image.path, local_path='problems/{}'.format(image.name))
    except RemoteResourceNotFound:
        print('Not found: {}'.format(image.name))
    else:
        filename, ext = os.path.splitext(image.name)
        if os.path.getsize('problems/{}'.format(image.name)) > 3000000:
            img = Image.open('problems/{}'.format(image.name))
            img.thumbnail((1000,1000), resample=Image.LANCZOS)
            img.save('problems/{}.jpg'.format(filename))
        else:
            print('Small: {}'.format(image.name))
        
for row in problems.itertuples():
    if not os.path.exists('problems/{}'.format(row.name)):
        download_image(row)
Not found: N193-055_0037.tif

Note that 6 of the pages have no width or height recorded. This means that the script couldn't open the images. I manually checked these:

  • N193-055_0037.tif – only 31mb, seems to be compressed (also has a .tiff file extension)
  • N193-064_0078.tif – seems ok
  • N193-173_0045.tif – seems ok
  • N193-173_0051.tif – seems ok
  • N193-173_0053.tif – seems ok
  • N193-173_0066.tif – seems ok

I downloaded the 5 that seemed ok, and ran the column detection script on them and the results were as expected. So I think there must have been some temporary problem on CloudStor when the script tried to access them.

I downloaded the rest and all of them were either rotated, or not the usual page format. These rotated:

  • N193-090_0210.tif
  • N193-109_0331.tif

Others:

  • N193-111_0216.tif – back of page
  • N193-163_0427.tif – page of publication

All the rest from N193-173 are hand-written register pages.

So, in summary, the column detector script seems to have worked as expected on all of these.

In [ ]: