nCovMemory¶

The nCovMemory project is a GitHub repository where people are curating stories about COVID-19 in the media and social media. You can see it mentioned in a short NYTimes video documentary about censorship in China: China Is Censoring Coronavirus Stories: These Citizens Are Fighting Back by Christoph Koettl, Muyi Xiao, Nilo Tabrizy and Dmitriy Khavin.

They make their data available at this static website but also as CSV data in their GitHub repository. We can check their data to see if any of them need to be added to the IIPC collection.

GitHub Data¶

We can download their latest CSV data directly from the web.

In [13]:

import pandas

url = 'https://raw.githubusercontent.com/2019ncovmemory/nCovMemory/master/data/data.csv'
ncovmem = pandas.read_csv(url, index_col='id')
ncovmem

Out[13]:

	category	update	media	date	title	title_en	url	translation_en	is_deleted	alternative	archive
id
4241	non_fiction	2020-03-23	人间theLivings	2020-03-23	海外疫区里的中国留学生：要学位，还是保命？	NaN	https://mp.weixin.qq.com/s/HkJQ01ZBkerky7BC-xAhiA	NaN	NaN	NaN	http://archive.is/XD3Nz
4240	narrative	2020-03-23	在人间living	2020-03-23	今天，武汉封城两个月了	NaN	https://mp.weixin.qq.com/s/mrWF9nFUxtXnNnyNEf-Ibw	NaN	NaN	NaN	http://archive.is/KPoAT
4239	non_fiction	2020-03-23	中国经营报	2020-03-23	“108好汉”为何注射新冠疫苗，这位00后的回答刷屏…	NaN	https://mp.weixin.qq.com/s/GinzGhKnNHZrtlDuVIrkKw	NaN	NaN	NaN	http://archive.is/tHBxO
4238	non_fiction	2020-03-23	中国经营报	2020-03-23	新加坡、澳大利亚“封国”！意大利全国“停产”，美国确诊人数突破3万...	NaN	https://mp.weixin.qq.com/s/hE8J7D-GrkB92GoPnmpcsg	NaN	NaN	NaN	http://archive.is/QcWU5
4237	narrative	2020-03-22	WUXU	2020-03-23	[四十日谈] 条条大路“不”通罗马，老猫的曲折回意之路	NaN	https://mp.weixin.qq.com/s/fLlGjOcZcotS-QybkqZjcw	NaN	NaN	NaN	http://archive.ph/ehsAk
...	...	...	...	...	...	...	...	...	...	...	...
5	non_fiction	2020-02-06	GQ报道	2020-01-29	孝感前线医生：武汉更难，我们下面不好意思提要求	NaN	https://mp.weixin.qq.com/s/uGaFeqrqmLBQe5qdRSTeSQ	NaN	NaN	NaN	https://archive.ph/MnZrn
4	non_fiction	2020-02-06	GQ报道	2020-01-29	疫情危机中不被看见的人们：武汉周边城市百姓的自救行动	NaN	https://mp.weixin.qq.com/s/D8Ob8pNmecHKXg7yR7EWFg	NaN	NaN	NaN	https://archive.ph/vDSj5
3	non_fiction	2020-02-06	GQ报道	2020-01-28	我家离华南海鲜市场很近：返乡、封城、过年，一位武汉大学生的过去一周	NaN	https://mp.weixin.qq.com/s/n7dXGHh-79d6VEzDhhOUbQ	NaN	NaN	NaN	https://archive.ph/RSmFx
2	non_fiction	2020-02-06	GQ报道	2020-01-28	武汉隔离：疫区、信息孤岛与一辆鄂A车的漂流	NaN	https://mp.weixin.qq.com/s/M-hVivF7NQmZHlu8YMnL_w	NaN	NaN	NaN	http://archive.is/3XKZD
1	non_fiction	2020-02-06	GQ报道	2020-01-27	10000个临时发往武汉的口罩	NaN	https://mp.weixin.qq.com/s/p-uPky_zB6XKcAetthqkKg	NaN	NaN	NaN	https://archive.ph/9s1ug

4227 rows × 11 columns

Seeds¶

Now we need the IIPC seed list. We can get that right here, since we saved it when we ran the Seeds notebook.

In [15]:

seeds = pandas.read_csv('data/iipc.csv')
seeds

Out[15]:

	id	url	creator	created	updated	crawl_definition	title	description	language	tld
0	2147692	http://coronavirus.fr/	alext	2020-02-21T03:43:18.662353Z	2020-03-16T19:53:45.860949Z	31104294373	Epicorem. Ecoépidémiologie	Medical/Scientific aspects	French	.fr
1	2147693	http://english.whiov.cas.cn/	alext	2020-02-21T03:43:18.706571Z	2020-03-16T19:52:28.575749Z	31104294373	Wuhan Institute of Virulogy, official page in ...	Health Organisation	English	.cn
2	2147694	http://www.china-embassy.or.jp/chn/	alext	2020-02-21T03:43:18.739126Z	2020-03-16T19:53:03.086729Z	31104294373	中华人民共和国驻日本大使馆	Embassy	Chinese	.jp
3	2147695	http://www.china-embassy.or.jp/jpn/	alext	2020-02-21T03:43:18.766308Z	2020-03-16T19:54:02.280945Z	31104294373	中華人民共和国駐日本国大使館	Embassy	Japanese	.jp
4	2147696	https://cadenaser.com/tag/ncov/a/	alext	2020-02-21T03:43:18.791716Z	2020-03-16T19:54:19.694418Z	31104294373	Coronavirus de Wuhan	Cadena Ser	Spanish	.com

Massage¶

There are some rows in the original dataset that lack a value for the url.

In [64]:

ncovmem = ncovmem[ncovmem.url.notna()]

Domains¶

Lets take a look at the domains that are present in this data.

In [65]:

import altair
from urllib.parse import urlparse

ncovmem['domain'] = ncovmem.url.map(lambda u: urlparse(u).netloc, na_action='ignore')

altair.Chart(ncovmem.reset_index(), title="Coronavirus Subreddit Posts", width=800).mark_bar().encode(
    altair.X('domain', title='Time (Days)'),
    altair.Y('count(id)', title='Posts per Day')
)

Out[65]:

So the dataset is almost entirely links to qq.com, or the TenCent instant messaging platform.

Overlap?¶

Now let's see if any of the nCovMemory URLs are present in the IIPC one.

In [66]:

len(set(seeds.url).intersection(ncovmem.url))

Out[66]:

As you can see there's no overlap at all between the two sets of URLs. So the nCovMem dataset should be useful to add to the IIPC collection.

Let's save them off for use later.

In [67]:

ncovmem.to_csv('data/ncovmem.csv')

In [ ]: