#!/usr/bin/env python
# coding: utf-8

# # Finding editorial cartoons in the Bulletin
# 
# In another notebook I showed how you could [download all the front pages of a digitised periodical](Get-page-images-from-a-Trove-journal.ipynb) as images. If you harvest all the front pages of *The Bulletin* you'll find a number of full page editorial cartoons under *The Bulletin*'s masthead. But you'll also find that many of the front pages are advertising wrap arounds. The full page editorial cartoons were a consistent feature of *The Bulletin* for many decades, but they moved around between pages one and eleven. That makes them hard to find.
# 
# I wanted to try and assemble a collection of all the editorial cartoons, but how? At one stage I was thinking about training an image classifier to identify the cartoons. The problem with that was I'd have to download lots of pages (they're about 20mb each), most of which I'd just discard. That didn't seem like a very efficient way of proceeding, so I started thinking about what I could do with the available metadata.
# 
# *The Bulletin* is indexed at article level in Trove, so I thought I might be able to construct a search that would find the cartoons. I'd noticed that the cartoons tended to have the title 'The Bulletin', so a search for `title:"The Bulletin"` should find them. The problem was, of course, that there were lots of false positives. I tried adding extra words that often appeared in the masthead to the search, but the results were unreliable. At that stage I was ready to give up.
# 
# Then I realised I was going about it backwards. If my aim was to get an editorial cartoon for each issue, I should start with a list of issues and process them individually to find the cartoon. I'd already worked out how to get a complete list of issues from the web interface, and found you could [extract useful metadata](Get-text-from-a-Trove-journal.ipynb) from each issue's landing page. Putting these approaches together gave me a way forward. The basic methodology was this.
# 
# * I manually selected all the cartoons from my harvest of front pages, as downloading them again would just make things slower
# * I harvested all the issue metadata
# * Looping through the list of issues I checked to see if I already had a cartoon from each issue, if not...
# * I grabbed the metadata from the issue landing page – for issues with OCRd text this includes a list of the articles and the pages that make up the issue
# * I looked through the list of articles to see if there was one with the *exact* title 'The Bulletin'
# * I then found the page on which the article was published
# * If the page was odd (almost all the cartoons were on odd pages), I downloaded the page
# 
# This did result in some false positives, but they were easy enough to remove manually. At the end I was able to see when the editorial cartoons started appearing, and when they stopped having a page to themselves. This gave me a range of 1886 to 1952 to focus on. Looking within that range it seemed that there were only about 400 issues (out of 3452) that I had no cartoon for. The results were much better than I'd hoped!
# 
# I then repeated this process several times, changing the title string and looping through the missing issues. I gradually widened the title match from exactly 'The Bulletin', to a string containing 'Bulletin', to a case-insensitive match etc. I also found and harvested a small group of cartoons from the 1940s that were published on an even numbered page! After a few repeats I had about 100 issues left without cartoons. These were mostly issues that didn't have OCRd text, so there were no articles to find. I thought I might need to process these manually, but then I though why not just go though the odd numbers from one to thirteen, harvesting these pages from the 'missing' issues and manually discarding the misses. This was easy and effective. Before too long I had a cartoon for every issue between 4 September 1886 and 17 September 1952. Yay!
# 
# If you have a look at the results you'll see that I ended up with more cartoons than issues. This is for two reasons:
# 
# * Some badly damaged issues were digitised twice, once from the hard copy and once from microfilm. Both versions are in Trove, so I thought I might as well keep both cartoons in my dataset.
# * The cartoons stopped appearing with the masthead in the 1920s, so they became harder to uniquely identify. In the 1940s, there were sometimes *two* full page cartoons by different artists commenting on current affairs. Where my harvesting found two such cartoons, I kept them both. However, because my aim was to find at least one cartoon from each issue, there are going to be other full page political cartoons that aren't included in my collection.
# 
# Because I kept refining and adjusting the code as I went through, I'm not sure it will be very useful. But it's all here just in case.
# 

# In[1]:


import glob
import io
import json
import os
import re
import time
import zipfile
from pathlib import Path

import pandas as pd
import requests_cache
from bs4 import BeautifulSoup
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from sqlite_utils import Database
from tqdm.auto import tqdm

s = requests_cache.CachedSession()
retries = Retry(total=5, backoff_factor=1, status_forcelist=[502, 503, 504])
s.mount("https://", HTTPAdapter(max_retries=retries))
s.mount("http://", HTTPAdapter(max_retries=retries))


# In[ ]:


def harvest_metadata(obj_id):
    """
    This calls an internal API from a journal landing page to extract a list of available issues.
    """
    start_url = "https://nla.gov.au/{}/browse?startIdx={}&rows=20&op=c"
    # The initial startIdx value
    start = 0
    # Number of results per page
    n = 20
    issues = []
    with tqdm(desc="Issues", leave=False) as pbar:
        # If there aren't 20 results on the page then we've reached the end, so continue harvesting until that happens.
        while n == 20:
            # Get the browse page
            response = s.get(start_url.format(obj_id, start), timeout=60)
            # Beautifulsoup turns the HTML into an easily navigable structure
            soup = BeautifulSoup(response.text, "lxml")
            # Find all the divs containing issue details and loop through them
            details = soup.find_all(class_="l-item-info")
            for detail in details:
                issue = {}
                title = detail.find("h3")
                if title:
                    issue["title"] = title.text
                    issue["id"] = title.parent["href"].strip("/")
                else:
                    issue["title"] = "No title"
                    issue["id"] = detail.find("a")["href"].strip("/")
                try:
                    # Get the issue details
                    issue["details"] = detail.find(
                        class_="obj-reference content"
                    ).string.strip()
                except (AttributeError, IndexError):
                    issue["details"] = "issue"
                # Get the number of pages
                try:
                    issue["pages"] = int(
                        re.search(
                            r"^(\d+)",
                            detail.find("a", attrs={"data-pid": issue["id"]}).text,
                            flags=re.MULTILINE,
                        ).group(1)
                    )
                except AttributeError:
                    issue["pages"] = 0
                issues.append(issue)
                # print(issue)
                if not response.from_cache:
                    time.sleep(0.5)
            # Increment the startIdx
            start += n
            # Set n to the number of results on the current page
            n = len(details)
            pbar.update(n)
    return issues


def get_page_number(work_data, article):
    page_number = 0
    page_id = article["existson"][0]["page"]
    for index, page in enumerate(work_data["children"]["page"]):
        if page["pid"] == page_id:
            page_number = index
            break
    # print(page_number)
    return page_number


def get_articles(work_data):
    articles = []
    for article in work_data["children"]["article"]:
        # if re.search(r'^The Bulletin\.*', article['title'], flags=re.IGNORECASE):
        if re.search(r"Bulletin", article["title"], flags=re.IGNORECASE):
            # if re.search(r'^No title\.*$', article['title']):
            articles.append(article)
    return articles


def get_work_data(url):
    """
    Extract work data in a JSON string from the work's HTML page.
    """
    response = s.get(url, allow_redirects=True, timeout=60)
    # print(response.url)
    try:
        work_data = re.search(
            r"var work = JSON\.parse\(JSON\.stringify\((\{.*\})", response.text
        ).group(1)
    except AttributeError:
        work_data = "{}"
    return json.loads(work_data)


def download_page(issue_id, page_number, image_dir):
    if not os.path.exists("{}/{}-{}.jpg".format(image_dir, issue_id, page_number + 1)):
        url = "https://nla.gov.au/{0}/download?downloadOption=zip&firstPage={1}&lastPage={1}".format(
            issue_id, page_number
        )
        # Get the file
        r = s.get(url, timeout=180)
        # The image is in a zip, so we need to extract the contents into the output directory
        try:
            z = zipfile.ZipFile(io.BytesIO(r.content))
            z.extractall(image_dir)
        except zipfile.BadZipFile:
            print("{}-{}".format(issue_id, page_number))


def download_issue_title_pages(issue_id, image_dir):
    url = "https://nla.gov.au/{}".format(issue_id)
    work_data = get_work_data(url)
    articles = get_articles(work_data)
    for article in articles:
        page_number = get_page_number(work_data, article)
        if page_number and (page_number % 2 == 0):
            download_page(issue_id, page_number, image_dir)
            time.sleep(1)


def get_downloaded_issues(image_dir):
    issues = []
    images = [i for i in os.listdir(image_dir) if i[-4:] == ".jpg"]
    for image in images:
        issue_id = re.search(r"(nla\.obj-\d+)", image).group(1)
        issues.append(issue_id)
    return issues


def download_all_title_pages(issues, image_dir):
    dl_issues = get_downloaded_issues(image_dir)
    for issue in tqdm(issues):
        if issue["issue_id"] not in dl_issues:
            download_issue_title_pages(issue["issue_id"], image_dir)
            dl_issues.append(issue["issue_id"])


# In[ ]:


# Get current list of issues
issues = harvest_metadata("nla.obj-68375465")


# In[ ]:


# First pass
download_all_title_pages(
    issues, "/Volumes/bigdata/mydata/Trove-text/Bulletin/covers/images"
)


# In[ ]:


# Change filenames of downloaded images to include date and issue number
# This makes it easier to sort and group them
df = pd.read_csv("journals/bulletin/bulletin_issues.csv")
all_issues = df.to_dict("records")

# Add dates / issue nums to file titles
for record in all_issues:
    pages = glob.glob(
        "/Volumes/bigdata/mydata/Trove-text/Bulletin/covers/images/{}-*.jpg".format(
            record["id"]
        )
    )
    for page in pages:
        page_number = re.search(r"nla\.obj-\d+-(\d+)\.jpg", page).group(1)
        date = record["date"].replace("-", "")
        os.rename(
            page,
            "/Volumes/bigdata/mydata/Trove-text/Bulletin/covers/images/{}-{}-{}-{}.jpg".format(
                date, record["number"], record["id"], page_number
            ),
        )


# In[ ]:


# Only select issues between 1886 and 1952
df = pd.read_csv(
    "journals/bulletin/bulletin_issues.csv",
    dtype={"number": "Int64"},
    keep_default_na=False,
)
df = df.rename(index=str, columns={"id": "issue_id"})
filtered = df.loc[(df["number"] >= 344) & (df["number"] <= 3788)]
filtered_issues = filtered.to_dict("records")


# In[ ]:


# After a round of harvesting, check to see what's missing
missing = []
multiple = []
for issue in tqdm(filtered_issues):
    pages = glob.glob(
        "/Volumes/bigdata/mydata/Trove-text/Bulletin/covers/images/*{}-*.jpg".format(
            issue["issue_id"]
        )
    )
    if len(pages) == 0:
        missing.append(issue)
    elif len(pages) > 1:
        multiple.append(issue)


# In[ ]:


# Adjust the matching settings and run again with just the missing issues
# Rinse and repeat
download_all_title_pages(
    missing, "/Volumes/bigdata/mydata/Trove-text/Bulletin/covers/missing"
)


# In[ ]:


# How many are mssing now?
len(missing)


# In[ ]:


# How many images have been harvested
dl = get_downloaded_issues("/Volumes/bigdata/mydata/Trove-text/Bulletin/covers/images")


# In[ ]:


len(dl)


# In[ ]:


len(multiple)


# In[ ]:


# Create links for missing issues so I can look at them on Trove
for m in missing:
    print("https://nla.gov.au/{}".format(m["issue_id"]))


# In[ ]:


# Download odd pages (they start at zero, so 0=1 etc)
for m in missing:
    download_page(
        m["issue_id"], 14, "/Volumes/bigdata/mydata/Trove-text/Bulletin/covers/missing"
    )


# ## Compile metadata
# 
# Once all the images have been saved, we can compile information about them and create a database for researchers to explore.

# In[600]:


def get_metadata(id):
    """
    Extract work data in a JSON string from the work's HTML page.
    """
    if not id.startswith("http"):
        id = "https://nla.gov.au/" + id
    response = s.get(id)
    try:
        work_data = re.search(
            r"var work = JSON\.parse\(JSON\.stringify\((\{.*\})", response.text
        ).group(1)
    except AttributeError:
        work_data = "{}"
    if not response.from_cache:
        time.sleep(0.2)
    return json.loads(work_data)


def download_text(issue_id, page):
    """
    Download the OCRd text from a page.
    """
    page_index = int(page) - 1
    url = f"https://trove.nla.gov.au/{issue_id}/download?downloadOption=ocr&firstPage={page_index}&lastPage={page_index}"
    response = s.get(url)
    response.encoding = "utf-8"
    if not response.from_cache:
        time.sleep(0.2)
    return " ".join(response.text.splitlines())


# Loop through harvested images, extracting metadata and compiling dataset
images = []
for image in tqdm(
    Path("/home/tim/mydata/Trove-text/Bulletin/covers/bulletin_cartoons").glob("*.jpg")
):
    parts = image.name[0:-4].split("-")
    date = f"{parts[0][:4]}-{parts[0][4:6]}-{parts[0][6:8]}"
    issue = parts[1]
    issue_id = f"{parts[2]}-{parts[3]}"
    page = parts[4]
    metadata = get_metadata(issue_id)
    pages = metadata["children"]["page"]
    page_id = pages[int(page) - 1]["pid"]
    record = {
        "id": page_id,
        "date": date,
        "issue_number": issue,
        "issue_id": issue_id,
        "page": page,
        "url": f"https://nla.gov.au/{page_id}",
        "download_image": f"https://nla.gov.au/{page_id}/image",
        "text": download_text(issue_id, page),
    }
    images.append(record)


# Save the metadata as a CSV file.

# In[601]:


df = pd.DataFrame(images)
df.to_csv("bulletin-editorial-cartoons.csv", index=False)


# Add the metadata to a SQLite db for use with Datasette-Lite.

# In[602]:


df.insert(
    0,
    "thumbnail",
    df["url"].apply(lambda x: f'{{"img_src": "{x}-t"}}' if not pd.isnull(x) else ""),
)

db = Database("bulletin-editorial-cartoons.db", recreate=True)
db["cartoons"].insert_all(df.to_dict(orient="records"), pk="id")
db["cartoons"].enable_fts(["text"])


# ----
# 
# Created by [Tim Sherratt](https://timsherratt.org/) for the [GLAM Workbench](https://glam-workbench.github.io/).