Pinboard¶

Pinboard is a social bookmarking site where people share links to content and tag them by assigning a word that describes the content. These tags are free-form, and each user decides which ones to use.

Pinboard has a nice API for interacting with your own bookmarks, but not for getting all public bookmarks for a tag. Pinboard also makes all tag pages available as RSS, e.g. https://feeds.pinboard.in/rss/t:covid-19 but it unfortunately doesn't allow paging back in time.

So unfortunately we're going to have to scrape the pages. But fortunately this won't be too difficult with the requests_html module because Pinboard has done such a nice job of using semantic html.

In [1]:

import time
import requests_html
import dateutil.parser

def pinboard(hashtag):
    http = requests_html.HTMLSession()
    pinboard_url = 'https://pinboard.in/t:{}'.format(hashtag)
    while True:
        resp = http.get(pinboard_url)
        bookmarks = resp.html.find('.bookmark')
        for b in bookmarks:
            a = b.find('.bookmark_title', first=True)
            yield {
                'url': a.attrs['href'],
                'title': a.text,
                'created': dateutil.parser.parse(b.find('.when', first=True).attrs['title'])
            }
    
        a = resp.html.find('#top_earlier', first=True)
        if not a:
            break
    
        next_url = 'https://pinboard.in' + a.attrs['href']
        if pinboard_url == next_url:
            break
        
        time.sleep(1)
        pinboard_url = next_url

In [2]:

next(pinboard('covid-19'))

Out[2]:

{'url': 'https://healthweather.us/',
 'title': 'US Health Weather Map by Kinsa',
 'created': datetime.datetime(2020, 3, 25, 10, 0, 11)}

Now we can write all the results to a CSV file. But lets look for a few variants: covid-19, covid_19, covid19. To avoid repeating the same urls we can keep track of them and only write them once.

In [4]:

import csv

urls_seen = set()
with open('data/pinboard.csv', 'w') as fh:
    out = csv.DictWriter(fh, fieldnames=['url', 'created', 'title'])
    out.writeheader()
    for hashtag in ['covid-19', 'covid_19', 'covid19']:
        for bookmark in pinboard(hashtag):
            if bookmark['url'] not in urls_seen:
                out.writerow(bookmark)
                urls_seen.add(bookmark['url'])            

In [5]:

import pandas

# prevent dataframe columns from being truncated
pandas.set_option('display.max_columns', None)
pandas.set_option('display.width', None)
pandas.set_option('display.max_colwidth', None)

df = pandas.read_csv('data/pinboard.csv')
df

Out[5]:

	url	created	title
0	https://www.seriouseats.com/2020/03/food-safety-and-coronavirus-a-comprehensive-guide.html	2020-03-25 10:02:34	Food Safety and Coronavirus: A Comprehensive Guide \| Serious Eats
1	https://healthweather.us/	2020-03-25 10:00:11	US Health Weather Map by Kinsa
2	https://loinc.org/sars-coronavirus-2/	2020-03-25 09:35:57	SARS Coronavirus 2 – LOINC
3	https://twitter.com/katemclennan1/status/1242656904913932290?s=09	2020-03-25 09:22:56	Kate McLennan on Twitter: "We were asked to deliver a PSA from the Australian govermnent… "
4	https://valor.globo.com/empresas/noticia/2020/03/25/para-dono-da-innova-crise-deixara-mais-falidos-que-falecidos.ghtml	2020-03-25 09:20:22	Para dono da Innova, crise deixará mais falidos que falecidos \| Empresas \| Valor Econômico
...	...	...	...
810	https://www.youtube.com/watch?v=mwrMtJ3DYXg&feature=youtu.be	2020-03-23 01:23:16	How to cope when the world is canceled: 6 critical skills - YouTube
811	https://hunch.net/?p=13762539	2020-03-23 01:04:58	What is the most effective policy response to the new coronavirus pandemic? – Machine Learning (Theory)
812	https://docs.google.com/spreadsheets/d/1sJM9dFwbSluv9JsoYA9o5EP7TOcCPf83SO_p23hCyCc/edit#gid=0	2020-03-23 01:04:47	Medical Mask Pattern Comparison-comment only - Google Sheets
813	https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data	2020-03-23 01:04:36	COVID-19/csse_covid_19_data at master · CSSEGISandData/COVID-19
814	https://www.instructables.com/id/AB-Mask-for-a-Nurse-by-a-Nurse/	2020-03-23 01:01:43	A.B. Mask - for a Nurse by a Nurse : 15 Steps (with Pictures) - Instructables

815 rows × 3 columns

Just out of curiousity is there currently any overlap with the IIPC seeds?

In [8]:

iipc = pandas.read_csv('data/iipc.csv')
overlap = set(iipc.url).intersection(set(df.url))
overlap

Out[8]:

{'https://gisanddata.maps.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6',
 'https://github.com/CSSEGISandData/COVID-19',
 'https://www.brookings.edu/research/the-global-macroeconomic-impacts-of-covid-19-seven-scenarios/',
 'https://www.nytimes.com/interactive/2020/world/coronavirus-maps.html'}

Nice, there are a few!

In [ ]: