IIPC¶

This notebook explores the seeds that are being crawled in the Novel Coronavirus COVID-19 Archive-It collection. It uses the Archive-It Parnter API which does not seem to require a key for public collections (yay). More context for this collecting effort can be found in this IIPC blog post.

0. Import¶

First let's import some things we're going to need later. It's useful to do them all here at the beginning in case you want to skip parts of the data collection and use the data that is already present in the repository.

In [1]:

import csv
import altair
import pandas
import wayback
import datetime
import requests

1. Get the Seeds¶

First lets download the seeds in the collection and save them as a CSV. If you want to use the CSV that's already here you can move on to Section 2. We're going to write out the data to a file called iipc.csv. You can see the type of data that is returned by looking at this API response. The Archive-It Partner API has a route for returning seeds for a given collection that is indicated with the collection parameter. We can use the limit and offset parameters to walk through the results page by page without getting all of them at once.

In [4]:

url = 'https://partner.archive-it.org/api/seed'
params = {
    "collection": 13529,
    "limit": 100
}

Now we can create a loop that keeps fetching results and incrementing the offset until there are no more seeds. We could have used the CSV output, but it is useful to normalize some of the structured metadata. This will likely take a few minutes to run.

In [16]:

out = csv.writer(open('data/iipc.csv', 'w'))
out.writerow([
    "id",
    "url",
    "creator",
    "created",
    "updated",
    "crawl_definition",
    "title",
    "description",
    "language",
    "tld"
])

def first_val(meta, name):
    return meta[name][0]["value"] if name in meta else None

params['offset'] = 0

while True:
    resp = requests.get(url, params=params)
    seeds = resp.json()
    if len(seeds) == 0: break

    for seed in seeds:
        meta = seed["metadata"]
        out.writerow([
            seed["id"],
            seed["url"],
            seed["created_by"],
            seed["created_date"],
            seed["last_updated_date"],
            seed["crawl_definition"],
            first_val(meta, "Title"),
            first_val(meta, "Description"),
            first_val(meta, "Language"),
            first_val(meta, "Top-Level Domain")
        ])

    params['offset'] += 100

In [13]:

url = 'https://partner.archive-it.org/api/seed'
params = {
    "collection": 13529,
    "offset": 0,
    "limit": 100
}

while True:
    resp = requests.get(url, params=params)
    seeds = resp.json()
    if len(seeds) == 0: break
    for seed in seeds:
        if seed['url'] == 'https://www.health.govt.nz/our-work/diseases-and-conditions/covid-19-novel-coronavirus/':
            print(seed['url'])
    params['offset'] += len(seeds)

So now you should hopefully see an updated seeds.csv!

2. Display the Seeds¶

First lets load our seeds.csv into a Pandas DataFrame where we can more easily manipulate it.

In [17]:

seeds = pandas.read_csv('data/iipc.csv', parse_dates=["created", "updated"])
seeds

Out[17]:

	id	url	creator	created	updated	crawl_definition	title	description	language	tld
0	2147692	http://coronavirus.fr/	alext	2020-02-21 03:43:18.662353+00:00	2020-03-16 19:53:45.860949+00:00	31104294373	Epicorem. Ecoépidémiologie	Medical/Scientific aspects	French	.fr
1	2147693	http://english.whiov.cas.cn/	alext	2020-02-21 03:43:18.706571+00:00	2020-03-16 19:52:28.575749+00:00	31104294373	Wuhan Institute of Virulogy, official page in ...	Health Organisation	English	.cn
2	2147694	http://www.china-embassy.or.jp/chn/	alext	2020-02-21 03:43:18.739126+00:00	2020-03-16 19:53:03.086729+00:00	31104294373	中华人民共和国驻日本大使馆	Embassy	Chinese	.jp
3	2147695	http://www.china-embassy.or.jp/jpn/	alext	2020-02-21 03:43:18.766308+00:00	2020-03-16 19:54:02.280945+00:00	31104294373	中華人民共和国駐日本国大使館	Embassy	Japanese	.jp
4	2147696	https://cadenaser.com/tag/ncov/a/	alext	2020-02-21 03:43:18.791716+00:00	2020-03-16 19:54:19.694418+00:00	31104294373	Coronavirus de Wuhan	Cadena Ser	Spanish	.com
...	...	...	...	...	...	...	...	...	...	...
2794	2173031	https://www.suntrust.com/resource-center/comme...	nicolab	2020-03-26 15:41:06.629121+00:00	2020-03-26 15:41:06.629220+00:00	31104300763	NaN	NaN	NaN	NaN
2795	2148539	https://www.eluniversal.com/economia/60496/cor...	alext	2020-02-21 04:11:12.713039+00:00	2020-03-16 19:53:55.500654+00:00	31104294373	Coronavirus afecta economía mundial y rutas co...	political aspects,economic aspects, diplomacy	Spanish	.com
2796	2149377	https://ue.delegfrance.org/coronavirus-activat...	alext	2020-02-21 04:28:15.941569+00:00	2020-03-16 19:53:43.544395+00:00	31104294373	Délégation France UE. Coronavirus : Activation...	Institutional website	French	.org
2797	2149468	https://www.healthdirect.gov.au/coronavirus	alext	2020-02-21 04:29:30.095448+00:00	2020-03-16 19:52:16.008948+00:00	31104297068	Coronavirus disease (COVID-19)	Government health information	English	.au
2798	2149123	https://www.youtube.com/watch?v=N6BAkWzrses	alext	2020-02-21 04:18:08.093558+00:00	2020-03-16 19:53:35.401711+00:00	31104294373	CORONAVIRUS PLANO WARNING - YouTube	video, containment efforts	Portuguese	.com

2799 rows × 10 columns

We can sort them by created time in ascending order, and save them again. This might make it easier to compare them over time with git diff.

In [19]:

seeds = seeds.sort_values('created')
seeds.to_csv('data/iipc.csv')
seeds.head(10)

Out[19]:

	id	url	creator	created	updated	crawl_definition	title	description	language	tld
0	2147692	http://coronavirus.fr/	alext	2020-02-21 03:43:18.662353+00:00	2020-03-16 19:53:45.860949+00:00	31104294373	Epicorem. Ecoépidémiologie	Medical/Scientific aspects	French	.fr
636	2147692	http://coronavirus.fr/	alext	2020-02-21 03:43:18.662353+00:00	2020-03-16 19:53:45.860949+00:00	31104294373	Epicorem. Ecoépidémiologie	Medical/Scientific aspects	French	.fr
1	2147693	http://english.whiov.cas.cn/	alext	2020-02-21 03:43:18.706571+00:00	2020-03-16 19:52:28.575749+00:00	31104294373	Wuhan Institute of Virulogy, official page in ...	Health Organisation	English	.cn
2153	2147693	http://english.whiov.cas.cn/	alext	2020-02-21 03:43:18.706571+00:00	2020-03-16 19:52:28.575749+00:00	31104294373	Wuhan Institute of Virulogy, official page in ...	Health Organisation	English	.cn
849	2147694	http://www.china-embassy.or.jp/chn/	alext	2020-02-21 03:43:18.739126+00:00	2020-03-16 19:53:03.086729+00:00	31104294373	中华人民共和国驻日本大使馆	Embassy	Chinese	.jp
2	2147694	http://www.china-embassy.or.jp/chn/	alext	2020-02-21 03:43:18.739126+00:00	2020-03-16 19:53:03.086729+00:00	31104294373	中华人民共和国驻日本大使馆	Embassy	Chinese	.jp
1557	2147695	http://www.china-embassy.or.jp/jpn/	alext	2020-02-21 03:43:18.766308+00:00	2020-03-16 19:54:02.280945+00:00	31104294373	中華人民共和国駐日本国大使館	Embassy	Japanese	.jp
3	2147695	http://www.china-embassy.or.jp/jpn/	alext	2020-02-21 03:43:18.766308+00:00	2020-03-16 19:54:02.280945+00:00	31104294373	中華人民共和国駐日本国大使館	Embassy	Japanese	.jp
4	2147696	https://cadenaser.com/tag/ncov/a/	alext	2020-02-21 03:43:18.791716+00:00	2020-03-16 19:54:19.694418+00:00	31104294373	Coronavirus de Wuhan	Cadena Ser	Spanish	.com
1757	2147697	https://doktor.frettabladid.is/sjukdomur/27626-2/	alext	2020-02-21 03:43:18.814377+00:00	2020-03-16 19:54:20.668796+00:00	31104294373	Allt sem þú þarft að vita um Kóróna veirur (co...	Health care information	Icelandic	.is

3. Languages¶

We can see that there are a large number of Portuguese seeds. I guess because someone involved in web archiving in Portugal or Brazil got busy.

In [20]:

altair.Chart(seeds).mark_bar().encode(
    altair.X('language', title='Language'),
    altair.Y('count(id)')
)

Out[20]:

4. Created¶

We can see that most of the vast majority of these seeds were entered into Archive-It on February 20, 2020, presumably from the spreadsheet sitting behind the Google Form.

In [21]:

altair.Chart(seeds).mark_bar().encode(
    altair.X('monthdate(created)', title='Created'),
    altair.Y('count(id)')
)

Out[21]:

5. Last Update¶

Similarly we can look to see when the last update time was for each seed.

In [22]:

altair.Chart(seeds).mark_bar().encode(
    altair.X('monthdate(updated)', title='Updates'),
    altair.Y('count(id)')
)

Out[22]:

It looks like most of the seeds were last updated a few days ago. But does this mean that was the last time they were crawled?

6. Get the Crawls¶

Oddly I couldn't seem to get any of the crawl related Partner API endpoints to work. Maybe I need to have created the crawls? At any rate, I can use the URL to look directly in Wayback machine to see what is available. The EDGI folks have created a nice Wayback module that lets you easily look up URLs in the Wayback Machine (it uses their CDX API behind the scenes).

This can take some time, so I'm going to save off the results in a crawls.csv. If you prefer to use the stored crawls.csv you skip ahead to Section 7. This will collect crawl information for these URLs from 2019-10-01 on so we can look at their coverage before and after the project started.

In [23]:

out = csv.writer(open('data/crawls.csv', 'w'))
out.writerow(['timestamp', 'url', 'status_code', 'archive_url'])
wb = wayback.WaybackClient()

for index, row in seeds.iterrows():
    try:
        for crawl in wb.search(row.url, from_date=datetime.datetime(2019, 10, 1)):
            out.writerow([
                crawl.timestamp.isoformat(),
                crawl.url,
                crawl.status_code,
                crawl.view_url
            ])
    except Exception as e:
        print(e)

403 Client Error: Forbidden for url: http://web.archive.org/cdx/search/cdx?url=https%3A%2F%2Fa2larm.cz%2F2020%2F02%2Fslavoj-zizek-melancholicka-krasa-virove-pandemie%2F&from=20191001000000&showResumeKey=true&resolveRevisits=true
403 Client Error: Forbidden for url: http://web.archive.org/cdx/search/cdx?url=https%3A%2F%2Fnationalpost.com%2Fhealth%2Fbio-warfare-experts-question-why-canada-was-sending-lethal-viruses-to-china&from=20191001000000&showResumeKey=true&resolveRevisits=true
403 Client Error: Forbidden for url: http://web.archive.org/cdx/search/cdx?url=https%3A%2F%2Fwww.cagle.com%2Fdave-granlund%2F2020%2F01%2Fcoronavirus-usa&from=20191001000000&showResumeKey=true&resolveRevisits=true
403 Client Error: Forbidden for url: http://web.archive.org/cdx/search/cdx?url=https%3A%2F%2Fpoliticalcartoons.com%2F%3Fs%3Dcoronavirus&from=20191001000000&showResumeKey=true&resolveRevisits=true

It's interesting that some of the URLs are forbidden for viewing. I'm not sure what's going on there. One important thing to keep in mind is that these URLs could have been crawled by other users of Archive-It or by the Internet Archive's own crawlers.

8. View the Crawls¶

Now lets load in the crawls.csv as a DataFrame and look at the number of crawls over time. It's actually useful to save a sorted version of the crawls.csv so that it can easily be diffed with previous versions.

In [24]:

crawls = pandas.read_csv('data/crawls.csv', parse_dates=['timestamp'])
crawls = crawls.sort_values('timestamp')
crawls.to_csv('data/crawls.csv')
crawls

Out[24]:

	timestamp	url	status_code	archive_url
15742	2019-10-01 01:24:55	http://www.dw.com/	NaN	http://web.archive.org/web/20191001012455/http...
15743	2019-10-01 01:24:55	https://www.dw.com/	NaN	http://web.archive.org/web/20191001012455/http...
69764	2019-10-01 01:56:12	https://www.colorado.gov/cdphe	200.0	http://web.archive.org/web/20191001015612/http...
69195	2019-10-01 02:38:36	https://www.healthlinkbc.ca/	200.0	http://web.archive.org/web/20191001023836/http...
26215	2019-10-01 03:14:00	https://cn.ambafrance.org/	200.0	http://web.archive.org/web/20191001031400/http...
...	...	...	...	...
67879	2020-03-27 15:01:46	https://www.ecdc.europa.eu/en/novel-coronaviru...	200.0	http://web.archive.org/web/20200327150146/http...
65286	2020-03-27 15:02:42	https://news.ifeng.com/c/special/7tPlDSzDgVk	200.0	http://web.archive.org/web/20200327150242/http...
21852	2020-03-27 15:03:11	https://www.nbcnews.com/health/coronavirus	200.0	http://web.archive.org/web/20200327150311/http...
65287	2020-03-27 15:11:04	https://news.ifeng.com/c/special/7tPlDSzDgVk	200.0	http://web.archive.org/web/20200327151104/http...
65288	2020-03-27 15:19:26	https://news.ifeng.com/c/special/7tPlDSzDgVk	200.0	http://web.archive.org/web/20200327151926/http...

70752 rows × 4 columns

In [25]:

crawls_per_day = crawls.set_index('timestamp').resample('1D')['url'].count()
crawls_per_day = crawls_per_day.reset_index()
crawls_per_day.columns = ['date', 'crawls']
crawls_per_day

Out[25]:

	date	crawls
0	2019-10-01	22
1	2019-10-02	49
2	2019-10-03	22
3	2019-10-04	52
4	2019-10-05	37
...	...	...
174	2020-03-23	1417
175	2020-03-24	1405
176	2020-03-25	1242
177	2020-03-26	1998
178	2020-03-27	1055

179 rows × 2 columns

In [26]:

altair.Chart(crawls_per_day, width=800).mark_bar().encode(
    altair.X('date', title='Crawl Date'),
    altair.Y('crawls', title='Crawls')
)

Out[26]:

9. Missing Crawls¶

We can definitely see these URLs are being crawled a whole lot more since the start of the project. But the graph shows what has been crawled (irrespective of who did it). It also doesn't show what seed URLs have not been crawled yet.

To see what might be missing lets first group our crawl data by url, and count how many crawls there have been for that url.

In [27]:

crawls_by_url = crawls.groupby('url').count().timestamp
crawls_by_url.name = 'crawls'
crawls_by_url.head()

Out[27]:

url
http://9news.com.au/coronavirus                                                                                          2
http://abcnews.go.com/Health/1300-people-died-flu-year/story?id=67754182                                                71
http://abola.pt/africa/2020-02-01/angola-entre-os-paises-africanos-com-maior-risco-de-contagio-do-coronavirus/827264     1
http://abola.pt/nnh/2020-02-03/formula-1-coronavirus-ameaca-gp-da-china/827542                                           1
http://albertahealthservices.ca/                                                                                         2
Name: crawls, dtype: int64

Next we can take our seeds DataFrame, index it by URL, so that we can add our crawls_by_url series to it, since it is also indexed by url. It is kinda nice how pandas makes this join easy. The use of fillna there is to convert any null values (where there has been no crawls yet) to 0.

In [28]:

seeds_by_url = seeds.set_index('url')
seeds_by_url['crawls'] = crawls_by_url
seeds_by_url.crawls = seeds_by_url.crawls.fillna(0)
seeds_by_url.head()

Out[28]:

	id	creator	created	updated	crawl_definition	title	description	language	tld	crawls
url
http://coronavirus.fr/	2147692	alext	2020-02-21 03:43:18.662353+00:00	2020-03-16 19:53:45.860949+00:00	31104294373	Epicorem. Ecoépidémiologie	Medical/Scientific aspects	French	.fr	4.0
http://coronavirus.fr/	2147692	alext	2020-02-21 03:43:18.662353+00:00	2020-03-16 19:53:45.860949+00:00	31104294373	Epicorem. Ecoépidémiologie	Medical/Scientific aspects	French	.fr	4.0
http://english.whiov.cas.cn/	2147693	alext	2020-02-21 03:43:18.706571+00:00	2020-03-16 19:52:28.575749+00:00	31104294373	Wuhan Institute of Virulogy, official page in ...	Health Organisation	English	.cn	70.0
http://english.whiov.cas.cn/	2147693	alext	2020-02-21 03:43:18.706571+00:00	2020-03-16 19:52:28.575749+00:00	31104294373	Wuhan Institute of Virulogy, official page in ...	Health Organisation	English	.cn	70.0
http://www.china-embassy.or.jp/chn/	2147694	alext	2020-02-21 03:43:18.739126+00:00	2020-03-16 19:53:03.086729+00:00	31104294373	中华人民共和国驻日本大使馆	Embassy	Chinese	.jp	306.0

So now we can see which seeds still need to be crawled, or to have their crawls made public?

In [29]:

missing = seeds_by_url[seeds_by_url.crawls == 0.0]
print("{0} URLS are missing crawls, which is {1:.2f}% of the total seeds.".format(
    len(missing),
    len(missing) / len(seeds_by_url) * 100
))

438 URLS are missing crawls, which is 15.65% of the total seeds.

In [ ]: