This notebook explores the seeds that are being crawled in the Novel Coronavirus COVID-19 Archive-It collection. It uses the Archive-It Parnter API which does not seem to require a key for public collections (yay). More context for this collecting effort can be found in this IIPC blog post.
First let's import some things we're going to need later. It's useful to do them all here at the beginning in case you want to skip parts of the data collection and use the data that is already present in the repository.
import csv
import altair
import pandas
import wayback
import datetime
import requests
First lets download the seeds in the collection and save them as a CSV. If you want to use the CSV that's already here you can move on to Section 2. We're going to write out the data to a file called iipc.csv
. You can see the type of data that is returned by looking at this API response. The Archive-It Partner API has a route for returning seeds for a given collection that is indicated with the collection
parameter. We can use the limit
and offset
parameters to walk through the results page by page without getting all of them at once.
url = 'https://partner.archive-it.org/api/seed'
params = {
"collection": 13529,
"limit": 100
}
Now we can create a loop that keeps fetching results and incrementing the offset until there are no more seeds. We could have used the CSV output, but it is useful to normalize some of the structured metadata. This will likely take a few minutes to run.
out = csv.writer(open('data/iipc.csv', 'w'))
out.writerow([
"id",
"url",
"creator",
"created",
"updated",
"crawl_definition",
"title",
"description",
"language",
"tld"
])
def first_val(meta, name):
return meta[name][0]["value"] if name in meta else None
params['offset'] = 0
while True:
resp = requests.get(url, params=params)
seeds = resp.json()
if len(seeds) == 0: break
for seed in seeds:
meta = seed["metadata"]
out.writerow([
seed["id"],
seed["url"],
seed["created_by"],
seed["created_date"],
seed["last_updated_date"],
seed["crawl_definition"],
first_val(meta, "Title"),
first_val(meta, "Description"),
first_val(meta, "Language"),
first_val(meta, "Top-Level Domain")
])
params['offset'] += 100
url = 'https://partner.archive-it.org/api/seed'
params = {
"collection": 13529,
"offset": 0,
"limit": 100
}
while True:
resp = requests.get(url, params=params)
seeds = resp.json()
if len(seeds) == 0: break
for seed in seeds:
if seed['url'] == 'https://www.health.govt.nz/our-work/diseases-and-conditions/covid-19-novel-coronavirus/':
print(seed['url'])
params['offset'] += len(seeds)
So now you should hopefully see an updated seeds.csv
!
First lets load our seeds.csv
into a Pandas DataFrame where we can more easily manipulate it.
seeds = pandas.read_csv('data/iipc.csv', parse_dates=["created", "updated"])
seeds
id | url | creator | created | updated | crawl_definition | title | description | language | tld | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 2147692 | http://coronavirus.fr/ | alext | 2020-02-21 03:43:18.662353+00:00 | 2020-03-16 19:53:45.860949+00:00 | 31104294373 | Epicorem. Ecoépidémiologie | Medical/Scientific aspects | French | .fr |
1 | 2147693 | http://english.whiov.cas.cn/ | alext | 2020-02-21 03:43:18.706571+00:00 | 2020-03-16 19:52:28.575749+00:00 | 31104294373 | Wuhan Institute of Virulogy, official page in ... | Health Organisation | English | .cn |
2 | 2147694 | http://www.china-embassy.or.jp/chn/ | alext | 2020-02-21 03:43:18.739126+00:00 | 2020-03-16 19:53:03.086729+00:00 | 31104294373 | 中华人民共和国驻日本大使馆 | Embassy | Chinese | .jp |
3 | 2147695 | http://www.china-embassy.or.jp/jpn/ | alext | 2020-02-21 03:43:18.766308+00:00 | 2020-03-16 19:54:02.280945+00:00 | 31104294373 | 中華人民共和国駐日本国大使館 | Embassy | Japanese | .jp |
4 | 2147696 | https://cadenaser.com/tag/ncov/a/ | alext | 2020-02-21 03:43:18.791716+00:00 | 2020-03-16 19:54:19.694418+00:00 | 31104294373 | Coronavirus de Wuhan | Cadena Ser | Spanish | .com |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2794 | 2173031 | https://www.suntrust.com/resource-center/comme... | nicolab | 2020-03-26 15:41:06.629121+00:00 | 2020-03-26 15:41:06.629220+00:00 | 31104300763 | NaN | NaN | NaN | NaN |
2795 | 2148539 | https://www.eluniversal.com/economia/60496/cor... | alext | 2020-02-21 04:11:12.713039+00:00 | 2020-03-16 19:53:55.500654+00:00 | 31104294373 | Coronavirus afecta economía mundial y rutas co... | political aspects,economic aspects, diplomacy | Spanish | .com |
2796 | 2149377 | https://ue.delegfrance.org/coronavirus-activat... | alext | 2020-02-21 04:28:15.941569+00:00 | 2020-03-16 19:53:43.544395+00:00 | 31104294373 | Délégation France UE. Coronavirus : Activation... | Institutional website | French | .org |
2797 | 2149468 | https://www.healthdirect.gov.au/coronavirus | alext | 2020-02-21 04:29:30.095448+00:00 | 2020-03-16 19:52:16.008948+00:00 | 31104297068 | Coronavirus disease (COVID-19) | Government health information | English | .au |
2798 | 2149123 | https://www.youtube.com/watch?v=N6BAkWzrses | alext | 2020-02-21 04:18:08.093558+00:00 | 2020-03-16 19:53:35.401711+00:00 | 31104294373 | CORONAVIRUS PLANO WARNING - YouTube | video, containment efforts | Portuguese | .com |
2799 rows × 10 columns
We can sort them by created time in ascending order, and save them again. This might make it easier to compare them over time with git diff
.
seeds = seeds.sort_values('created')
seeds.to_csv('data/iipc.csv')
seeds.head(10)
id | url | creator | created | updated | crawl_definition | title | description | language | tld | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 2147692 | http://coronavirus.fr/ | alext | 2020-02-21 03:43:18.662353+00:00 | 2020-03-16 19:53:45.860949+00:00 | 31104294373 | Epicorem. Ecoépidémiologie | Medical/Scientific aspects | French | .fr |
636 | 2147692 | http://coronavirus.fr/ | alext | 2020-02-21 03:43:18.662353+00:00 | 2020-03-16 19:53:45.860949+00:00 | 31104294373 | Epicorem. Ecoépidémiologie | Medical/Scientific aspects | French | .fr |
1 | 2147693 | http://english.whiov.cas.cn/ | alext | 2020-02-21 03:43:18.706571+00:00 | 2020-03-16 19:52:28.575749+00:00 | 31104294373 | Wuhan Institute of Virulogy, official page in ... | Health Organisation | English | .cn |
2153 | 2147693 | http://english.whiov.cas.cn/ | alext | 2020-02-21 03:43:18.706571+00:00 | 2020-03-16 19:52:28.575749+00:00 | 31104294373 | Wuhan Institute of Virulogy, official page in ... | Health Organisation | English | .cn |
849 | 2147694 | http://www.china-embassy.or.jp/chn/ | alext | 2020-02-21 03:43:18.739126+00:00 | 2020-03-16 19:53:03.086729+00:00 | 31104294373 | 中华人民共和国驻日本大使馆 | Embassy | Chinese | .jp |
2 | 2147694 | http://www.china-embassy.or.jp/chn/ | alext | 2020-02-21 03:43:18.739126+00:00 | 2020-03-16 19:53:03.086729+00:00 | 31104294373 | 中华人民共和国驻日本大使馆 | Embassy | Chinese | .jp |
1557 | 2147695 | http://www.china-embassy.or.jp/jpn/ | alext | 2020-02-21 03:43:18.766308+00:00 | 2020-03-16 19:54:02.280945+00:00 | 31104294373 | 中華人民共和国駐日本国大使館 | Embassy | Japanese | .jp |
3 | 2147695 | http://www.china-embassy.or.jp/jpn/ | alext | 2020-02-21 03:43:18.766308+00:00 | 2020-03-16 19:54:02.280945+00:00 | 31104294373 | 中華人民共和国駐日本国大使館 | Embassy | Japanese | .jp |
4 | 2147696 | https://cadenaser.com/tag/ncov/a/ | alext | 2020-02-21 03:43:18.791716+00:00 | 2020-03-16 19:54:19.694418+00:00 | 31104294373 | Coronavirus de Wuhan | Cadena Ser | Spanish | .com |
1757 | 2147697 | https://doktor.frettabladid.is/sjukdomur/27626-2/ | alext | 2020-02-21 03:43:18.814377+00:00 | 2020-03-16 19:54:20.668796+00:00 | 31104294373 | Allt sem þú þarft að vita um Kóróna veirur (co... | Health care information | Icelandic | .is |
We can see that there are a large number of Portuguese seeds. I guess because someone involved in web archiving in Portugal or Brazil got busy.
altair.Chart(seeds).mark_bar().encode(
altair.X('language', title='Language'),
altair.Y('count(id)')
)
We can see that most of the vast majority of these seeds were entered into Archive-It on February 20, 2020, presumably from the spreadsheet sitting behind the Google Form.
altair.Chart(seeds).mark_bar().encode(
altair.X('monthdate(created)', title='Created'),
altair.Y('count(id)')
)
Similarly we can look to see when the last update time was for each seed.
altair.Chart(seeds).mark_bar().encode(
altair.X('monthdate(updated)', title='Updates'),
altair.Y('count(id)')
)
It looks like most of the seeds were last updated a few days ago. But does this mean that was the last time they were crawled?
Oddly I couldn't seem to get any of the crawl related Partner API endpoints to work. Maybe I need to have created the crawls? At any rate, I can use the URL to look directly in Wayback machine to see what is available. The EDGI folks have created a nice Wayback module that lets you easily look up URLs in the Wayback Machine (it uses their CDX API behind the scenes).
This can take some time, so I'm going to save off the results in a crawls.csv
. If you prefer to use the stored crawls.csv
you skip ahead to Section 7. This will collect crawl information for these URLs from 2019-10-01 on so we can look at their coverage before and after the project started.
out = csv.writer(open('data/crawls.csv', 'w'))
out.writerow(['timestamp', 'url', 'status_code', 'archive_url'])
wb = wayback.WaybackClient()
for index, row in seeds.iterrows():
try:
for crawl in wb.search(row.url, from_date=datetime.datetime(2019, 10, 1)):
out.writerow([
crawl.timestamp.isoformat(),
crawl.url,
crawl.status_code,
crawl.view_url
])
except Exception as e:
print(e)
403 Client Error: Forbidden for url: http://web.archive.org/cdx/search/cdx?url=https%3A%2F%2Fa2larm.cz%2F2020%2F02%2Fslavoj-zizek-melancholicka-krasa-virove-pandemie%2F&from=20191001000000&showResumeKey=true&resolveRevisits=true 403 Client Error: Forbidden for url: http://web.archive.org/cdx/search/cdx?url=https%3A%2F%2Fnationalpost.com%2Fhealth%2Fbio-warfare-experts-question-why-canada-was-sending-lethal-viruses-to-china&from=20191001000000&showResumeKey=true&resolveRevisits=true 403 Client Error: Forbidden for url: http://web.archive.org/cdx/search/cdx?url=https%3A%2F%2Fwww.cagle.com%2Fdave-granlund%2F2020%2F01%2Fcoronavirus-usa&from=20191001000000&showResumeKey=true&resolveRevisits=true 403 Client Error: Forbidden for url: http://web.archive.org/cdx/search/cdx?url=https%3A%2F%2Fpoliticalcartoons.com%2F%3Fs%3Dcoronavirus&from=20191001000000&showResumeKey=true&resolveRevisits=true
It's interesting that some of the URLs are forbidden for viewing. I'm not sure what's going on there. One important thing to keep in mind is that these URLs could have been crawled by other users of Archive-It or by the Internet Archive's own crawlers.
Now lets load in the crawls.csv
as a DataFrame and look at the number of crawls over time. It's actually useful to save a sorted version of the crawls.csv so that it can easily be diffed with previous versions.
crawls = pandas.read_csv('data/crawls.csv', parse_dates=['timestamp'])
crawls = crawls.sort_values('timestamp')
crawls.to_csv('data/crawls.csv')
crawls
timestamp | url | status_code | archive_url | |
---|---|---|---|---|
15742 | 2019-10-01 01:24:55 | http://www.dw.com/ | NaN | http://web.archive.org/web/20191001012455/http... |
15743 | 2019-10-01 01:24:55 | https://www.dw.com/ | NaN | http://web.archive.org/web/20191001012455/http... |
69764 | 2019-10-01 01:56:12 | https://www.colorado.gov/cdphe | 200.0 | http://web.archive.org/web/20191001015612/http... |
69195 | 2019-10-01 02:38:36 | https://www.healthlinkbc.ca/ | 200.0 | http://web.archive.org/web/20191001023836/http... |
26215 | 2019-10-01 03:14:00 | https://cn.ambafrance.org/ | 200.0 | http://web.archive.org/web/20191001031400/http... |
... | ... | ... | ... | ... |
67879 | 2020-03-27 15:01:46 | https://www.ecdc.europa.eu/en/novel-coronaviru... | 200.0 | http://web.archive.org/web/20200327150146/http... |
65286 | 2020-03-27 15:02:42 | https://news.ifeng.com/c/special/7tPlDSzDgVk | 200.0 | http://web.archive.org/web/20200327150242/http... |
21852 | 2020-03-27 15:03:11 | https://www.nbcnews.com/health/coronavirus | 200.0 | http://web.archive.org/web/20200327150311/http... |
65287 | 2020-03-27 15:11:04 | https://news.ifeng.com/c/special/7tPlDSzDgVk | 200.0 | http://web.archive.org/web/20200327151104/http... |
65288 | 2020-03-27 15:19:26 | https://news.ifeng.com/c/special/7tPlDSzDgVk | 200.0 | http://web.archive.org/web/20200327151926/http... |
70752 rows × 4 columns
crawls_per_day = crawls.set_index('timestamp').resample('1D')['url'].count()
crawls_per_day = crawls_per_day.reset_index()
crawls_per_day.columns = ['date', 'crawls']
crawls_per_day
date | crawls | |
---|---|---|
0 | 2019-10-01 | 22 |
1 | 2019-10-02 | 49 |
2 | 2019-10-03 | 22 |
3 | 2019-10-04 | 52 |
4 | 2019-10-05 | 37 |
... | ... | ... |
174 | 2020-03-23 | 1417 |
175 | 2020-03-24 | 1405 |
176 | 2020-03-25 | 1242 |
177 | 2020-03-26 | 1998 |
178 | 2020-03-27 | 1055 |
179 rows × 2 columns
altair.Chart(crawls_per_day, width=800).mark_bar().encode(
altair.X('date', title='Crawl Date'),
altair.Y('crawls', title='Crawls')
)
We can definitely see these URLs are being crawled a whole lot more since the start of the project. But the graph shows what has been crawled (irrespective of who did it). It also doesn't show what seed URLs have not been crawled yet.
To see what might be missing lets first group our crawl data by url, and count how many crawls there have been for that url.
crawls_by_url = crawls.groupby('url').count().timestamp
crawls_by_url.name = 'crawls'
crawls_by_url.head()
url http://9news.com.au/coronavirus 2 http://abcnews.go.com/Health/1300-people-died-flu-year/story?id=67754182 71 http://abola.pt/africa/2020-02-01/angola-entre-os-paises-africanos-com-maior-risco-de-contagio-do-coronavirus/827264 1 http://abola.pt/nnh/2020-02-03/formula-1-coronavirus-ameaca-gp-da-china/827542 1 http://albertahealthservices.ca/ 2 Name: crawls, dtype: int64
Next we can take our seeds
DataFrame, index it by URL, so that we can add our crawls_by_url
series to it, since it is also indexed by url
. It is kinda nice how pandas makes this join easy. The use of fillna
there is to convert any null values (where there has been no crawls yet) to 0.
seeds_by_url = seeds.set_index('url')
seeds_by_url['crawls'] = crawls_by_url
seeds_by_url.crawls = seeds_by_url.crawls.fillna(0)
seeds_by_url.head()
id | creator | created | updated | crawl_definition | title | description | language | tld | crawls | |
---|---|---|---|---|---|---|---|---|---|---|
url | ||||||||||
http://coronavirus.fr/ | 2147692 | alext | 2020-02-21 03:43:18.662353+00:00 | 2020-03-16 19:53:45.860949+00:00 | 31104294373 | Epicorem. Ecoépidémiologie | Medical/Scientific aspects | French | .fr | 4.0 |
http://coronavirus.fr/ | 2147692 | alext | 2020-02-21 03:43:18.662353+00:00 | 2020-03-16 19:53:45.860949+00:00 | 31104294373 | Epicorem. Ecoépidémiologie | Medical/Scientific aspects | French | .fr | 4.0 |
http://english.whiov.cas.cn/ | 2147693 | alext | 2020-02-21 03:43:18.706571+00:00 | 2020-03-16 19:52:28.575749+00:00 | 31104294373 | Wuhan Institute of Virulogy, official page in ... | Health Organisation | English | .cn | 70.0 |
http://english.whiov.cas.cn/ | 2147693 | alext | 2020-02-21 03:43:18.706571+00:00 | 2020-03-16 19:52:28.575749+00:00 | 31104294373 | Wuhan Institute of Virulogy, official page in ... | Health Organisation | English | .cn | 70.0 |
http://www.china-embassy.or.jp/chn/ | 2147694 | alext | 2020-02-21 03:43:18.739126+00:00 | 2020-03-16 19:53:03.086729+00:00 | 31104294373 | 中华人民共和国驻日本大使馆 | Embassy | Chinese | .jp | 306.0 |
So now we can see which seeds still need to be crawled, or to have their crawls made public?
missing = seeds_by_url[seeds_by_url.crawls == 0.0]
print("{0} URLS are missing crawls, which is {1:.2f}% of the total seeds.".format(
len(missing),
len(missing) / len(seeds_by_url) * 100
))
438 URLS are missing crawls, which is 15.65% of the total seeds.