#!/usr/bin/env python # coding: utf-8 # # Using TroveHarvester to get newspaper articles in bulk #
#

If you haven't used one of these notebooks before, they're basically web pages in which you can write, edit, and run live code. They're meant to encourage experimentation, so don't feel nervous. Just try running a few cells and see what happens!.

# #

# Some tips: #

#

#
# The Trove Newspaper Harvester is a command line tool that helps you download large quantities of digitised newspaper articles from [Trove](http://trove.nla.gov.au/). # # Instead of working your way through page after page of search results using Trove’s web interface, the newspaper harvester will save the results of your search to a CSV (spreadsheet) file which you can then filter, sort, or analyse. # # Even better, the harvester can save the full OCRd (and possibly corrected) text of each article to an individual file. You could, for example, collect the text of thousands of articles on a particular topic and then feed them to a text analysis engine like [Voyant](http://voyant-tools.org/) to look for patterns in the language. # # If you'd like to install and run the TroveHarvester on your local system see [the installation instructions](http://timsherratt.org/digital-heritage-handbook/docs/trove-newspaper-harvester/). # # If you'd like to try before you buy, you can run a fully-functional version of the TroveHarvester from this very notebook! # ## Getting started # If you were running TroveHarvester on your local system, you could access the basic help information by entering this on the command line: # # ``` bash # troveharvester -h # ``` # # In this notebook you need to use the magic `%run` command to call the TroveHarvester script. Click on the cell below and hit `Shift+Enter` to view the TroveHarvester's basic options. # In[ ]: get_ipython().run_line_magic('run', '-m troveharvester -h') # Before we go any further you should make sure you have a Trove API key. For non-commercial projects, you just fill out a simple form and your API key is generated instantly. Follow the instructions in the Trove Help to [obtain your own Trove API Key](http://help.nla.gov.au/trove/building-with-trove/api). # # Once you've created a key, you can access it at any time on the 'For developers' tab of your Trove user profile. # # Copy your API key now, and paste it in the cell below, between the quotes. Then hit `Shift+Enter` to save your key as a variable called `api_key`. # In[3]: api_key = 'ju3rgk0jp354ikmh' print('Your API key is: {}'.format(api_key)) # ## What do you want to harvest? # The TroveHarvester translates queries from the Trove web interface into something that the API can understand. So all you need to do is construct your query using the web interface. Once you're happy with the results you're getting just copy the url. # # It's important to note that there are currently a few differences between the indexes used by the web interface and the API, so some queries won't translate directly. For example, the `state` facet doesn't exist in the API index. If you use the `state` facet the TroveHarvester will try to replace it with a list of newspapers from that state, but there are now so many newspaper titles that this could fail. Similarly, the API index won't recognise `has:corrections`. However, most queries should translate without any problems. # # Once you've constructed your query and copied the url, paste it between the quotes in the cell below and hit `Shift+Enter` to save it as a variable. # In[4]: query = 'https://trove.nla.gov.au/newspaper/result?q=cyclone+wragge&l-category=Article&l-decade=191' # ## Running the harvest # By default the harvester will save all the article metadata to a CSV formatted file called `results.csv`. If you'd like to save the full OCRd text of all the articles, just add the `--text` parameter. You can also save PDFs of all the articles by adding the `--pdf` parameter, but be warned that this will slow down your harvest considerably and can consume large amounts of disk space. So use with care! # # Now we're ready to start the harvest! Just run the code in the cell below. You can delete the `--text` parameter if you're not interested in saving the full text of every article. # In[ ]: get_ipython().run_line_magic('run', '-m troveharvester start $query $api_key --text') # You'll know the harvest is finished when the asterix in the square brackets of the cell above turns into a number. # # If the harvest stops before it's finished, you can restart it by running the cell below. # In[ ]: get_ipython().run_line_magic('run', '-m troveharvester restart') # If you want to check the details of a finished harvest, just run the cell below. # In[ ]: get_ipython().run_line_magic('run', '-m troveharvester report') # ## Harvest results # When you start a new harvest, the harvester looks for a directory called [data](data). Within this directory it creates another directory for your harvest. The name of this directory will be in the form of a unix timestamp – a very large number that represents the number of seconds since 1 January 1970. So this means the directory with the largest number will contain the most recent harvest. # # The harvester saves your results inside this directory. There will be at least two files created for each harvest: # # * `results.csv` – a text file containing the details of all harvested articles # * `metadata.json` – a configuration file which stores all the details of the harvest # # If you’ve asked for PDFs or text files, there will be additional directories containing those files. # # The `results.csv` file is a plain text CSV (Comma Separated Values) file. You can open it with any spreadsheet program. The details recorded for each article are: # # * `article_id` – a unique identifier for the article # * `title` – the title of the article # * `newspaper_id` – a unique identifier for the newspaper (this can be used to retrieve more information or build a link to the web interface) # * `newspaper_title` – the name of the newspaper # * `page` – page number (of course), but might also indicate the page is part of a supplement or special section # * `date` – in ISO format, YYYY-MM-DD # * `category` – one of ‘Article’, ‘Advertising’, ‘Detailed lists, results, guides’, ‘Family Notices’, or ‘Literature’ # * `words` – number of words in the article # * `illustrated` – is it illustrated (values are y or n) # * `corrections` – number of text corrections # * `url` – the persistent url for the article # * `page_url` – the persistent url of the page on which the article is published # # Files containing the OCRd text of the articles will be saved in a directory named `text`. These are just plain text files, stripped on any HTML. These files include some basic metadata in their file titles – the date of the article, the id number of the newspaper, and the id number of the article. So, for example, the filename `19460104-1002-206680758.txt` tells you: # # * `19460104` – the article was published on 4 January 1946 (YYYYMMDD) # * `1002` – the article was published in [*The Tribune*](https://trove.nla.gov.au/newspaper/title/1002) # * `206680758` – the [article's unique identifier](http://nla.gov.au/nla.news-article206680758) # # As you can see, you can use the newspaper and article ids to create direct links into Trove: # # * to a newspaper `https://trove.nla.gov.au/newspaper/title/[newspaper id]` # * to an article `http://nla.gov.au/nla.news-article[article id]` # # Browse the contents of [the data directory](data) to find the results of your harvest. # # # ## Download your data #
# If you're using this notebook through the MyBinder service (it'll say `mybinder` in the url) make sure you download your data once the harvest is finished as it will not be preserved! #
# Once your harvest is complete, you probably want to download the results. The easiest way to do this is to zip up the results folder. Run the following cell to zip up the folder containing all the data from your most recent harvest. # In[6]: import shutil import os # List all the harvest folders and sort by date harvests = sorted([d for d in os.listdir('data') if os.path.isdir(os.path.join('data', d))]) # Get the most recent timestamp = harvests[-1] # Zip up the folder shutil.make_archive(os.path.join('data', timestamp), 'zip', os.path.join('data', timestamp)) # Once your zip file has been created you can find it in the [data directory](data). Or just run the cell below to create a handy download link. # In[ ]: from IPython.core.display import display, HTML display(HTML('Download your harvest'.format(timestamp))) # ## Explore your data # Have a look at the [Exploring your TroveHarvest data](Exploring-your-TroveHarvester-data.ipynb) for some ideas. # ---- # # Created by [Tim Sherrratt](https://timsherratt.org) ([@wragge](https://twitter.com/wragge)) as part of the [OzGLAM workbench](https://github.com/wragge/ozglam-workbench). # # If you think this project is worthwhile you can [support it on Patreon](https://www.patreon.com/timsherratt). # In[ ]: