CloudStor access via WebDAV

CloudStor is data storage service provided by AARNet. Individual researchers in AARNet connected institutions get 100gb of storage space for free, and research projects can apply for additional space.

We're using CloudStor to store and share high-resolution scans of Sydney Stock Exchange records from the Noel Butlin Archives at ANU. By my reckoning, there's 72,843 TIFF files, each weighing in at about 100mb. I'm going to be exploring ways of getting useful structured data out of the images, but as a first step I just wanted to be able to access data about the files.

CloudStor is an instance of OwnCloud, and OwnCloud provides WebDAV access, so I thought I'd have a go at using WebDAV to access file data on CloudStor.

It works, but there are a few tricks...

Software

I'm using a Python WebDAV client. I installed it using pip but ran into some problems with the dependencies. PyCurl complained that it didn't know what SSL library it was meant to use. Thanks to StackOverflow, I got it going with:

brew install curl --with-openssl
pip install --no-cache-dir --global-option=build_ext --global-option="-L/usr/local/opt/openssl/lib" --global-option="-I/usr/local/opt/openssl/include" --user pycurl
In [18]:
# Import what we need
import webdav.client as wc
import random
import pandas as pd
import time
import os
import pickle
from tqdm.auto import tqdm
from credentials import * # Storing my CloudStor credentials in another file

Configuration

This was the thing that caused me most confusion.

First of all, you have to create a password in CloudStor to use with WebDAV. This is not the password that you use to access the CloudStor web interface (via the AAF).

  • Log onto the CloudStor web interface (using your institutional credentials)
  • Click on Settings in the top menu
  • Enter your new password in the 'Password' box and click Change password

This is the password you'll use with the WebDAV client. The WebDAV username is the email address you've used to register with CloudStor.

On the bottom left of the CloudStor web interface is another Settings link. If you click it it displays the url to use with WebDAV: https://cloudstor.aarnet.edu.au/plus/remote.php/webdav/

Originally, I just plugged this link in below as the webdav_hostname and at first things seemed to work. I could list the contents of a directory, but I couldn't get resource information or download a file. Eventually, amongst the issues on the client's GitHub site, I found the answer. You have to separate the host from the path, and supply the path as webdav_root.

In [11]:
# Set the connection options. CLOUDSTOR_USER and CLOUDSTOR_PW are stored in a separate credentials file.
options = {
    'webdav_hostname': 'https://cloudstor.aarnet.edu.au',
    'webdav_login':    CLOUDSTOR_USER,
    'webdav_password': CLOUDSTOR_PW,
    'webdav_root': '/plus/remote.php/webdav/'
}

Getting file lists

In [12]:
# Ok let's initiate the client.
client = wc.Client(options)
In [13]:
# Use .list() to get a list of resources in the directory
# In this case it's a list of subdirectories
dirs = client.list('Shared/ANU-Library/Sydney Stock Exchange 1901-1950/')
# For some reason the parent directory is included in the list, let's filter it out
# We'll also remove some old directories we don't want
dirs = [d for d in dirs if (d[:2] == 'AU') and '_old' not in d]
In [15]:
# Loop through all the subdirectories and use .list() again to get all the filenames
files = []
directories = []
for d in tqdm(dirs, desc='Directories'):
    files = [f for f in client.list('Shared/ANU-Library/Sydney Stock Exchange 1901-1950/{}'.format(d)) if f[:1] == 'N']
    # print('{}: {} files'.format(d, len(files)))
    # Save the details for each subdirectory
    summary.append({'directory': d, 'number': len(files)})
    for f in tqdm(files, desc='Files', leave=False):
        path = 'Shared/ANU-Library/Sydney Stock Exchange 1901-1950/{}{}'.format(d, f)
        # This slows things down a lot, so disable for now
        # info = client.info(path)
        info = {}
        info['name'] = f
        info['directory'] = d
        info['path'] = path
        # print(info)
        details.append(info)
    time.sleep(0.5)
In [16]:
# How many files are there?
len(details)
Out[16]:
72932
In [17]:
# Get some information on individual files
client.info('Shared/ANU-Library/Sydney Stock Exchange 1901-1950/{}/{}'.format('AU NBAC N193-001', 'N193-001_0001.tif'))
Out[17]:
{'created': None,
 'name': None,
 'size': '106240746',
 'modified': 'Wed, 13 Jun 2018 01:56:48 GMT'}

Saving the results

I saved the results as CSV files — one for files and one for directories.

In [20]:
# Save previously downloaded data as CSV files so that I don't have to do it again
# I use Pandas for these conversions because it's easy
df_files = pd.DataFrame(details)
df_files.to_csv('files.csv', index=False)
df_dirs = pd.DataFrame(summary)
df_dirs.to_csv('directories.csv', index=False)
In [21]:
# Load previously harvested data
files = pd.read_csv('files.csv').to_dict('records')
directories = pd.read_csv('directories.csv').to_dict('records')

Getting a random sample of images

To do some testing on the images, I wanted to download a random sample.

In [22]:
# First we'll make a random selection from the list of file names.
random_files = random.sample(files, 2000)
In [24]:
# freeze this sample for reuse
with open('random_sample.pickle', 'wb') as pickle_file:
    pickle.dump(random_files, pickle_file)
In [25]:
# reload frozen sample
with open('random_sample.pickle', 'rb') as pickle_file:
    random_files = pickle.load(pickle_file)
In [ ]:
# Then we'll just loop through the randomly selected files and download them
for image in random_files:
    print('Downloading {}'.format(image['name']))
    dir = '/Volumes/bigdata/mydata/stockexchange/years/{}'.format(image['directory'].replace(' ', '-'))
    if not os.path.exists(dir):
        os.makedirs(dir)
    filename = '{}/{}'.format(dir, image['name'])
    if not os.path.exists(filename):
        client.download_sync(remote_path=image['path'], local_path=filename)

Download by year

In [4]:
def download_page(page, year, output_dir):
    image_dir = os.path.join(output_dir, str(year))
    os.makedirs(image_dir, exist_ok=True)
    filename = '{}/{}'.format(image_dir, page.name)
    if not os.path.exists(filename):
        client.download_sync(remote_path=page.path, local_path=filename)
    
def download_by_year(year, output_dir):
    df_dates = pd.read_csv('files_with_dates.csv')
    for page in df_dates.loc[df_dates['year'] == year].itertuples():
        download_page(page, year, output_dir)
In [5]:
download_by_year(1930, '/Volumes/bigdata/mydata/stockexchange/years')

But wait there's more...

Wondering how to access a public share? Have a look here...

In [ ]: