The nCovMemory project is a GitHub repository where people are curating stories about COVID-19 in the media and social media. You can see it mentioned in a short NYTimes video documentary about censorship in China: China Is Censoring Coronavirus Stories: These Citizens Are Fighting Back by Christoph Koettl, Muyi Xiao, Nilo Tabrizy and Dmitriy Khavin.
They make their data available at this static website but also as CSV data in their GitHub repository. We can check their data to see if any of them need to be added to the IIPC collection.
We can download their latest CSV data directly from the web.
import pandas
url = 'https://raw.githubusercontent.com/2019ncovmemory/nCovMemory/master/data/data.csv'
ncovmem = pandas.read_csv(url, index_col='id')
ncovmem
category | update | media | date | title | title_en | url | translation_en | is_deleted | alternative | archive | |
---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||
4241 | non_fiction | 2020-03-23 | 人间theLivings | 2020-03-23 | 海外疫区里的中国留学生:要学位,还是保命? | NaN | https://mp.weixin.qq.com/s/HkJQ01ZBkerky7BC-xAhiA | NaN | NaN | NaN | http://archive.is/XD3Nz |
4240 | narrative | 2020-03-23 | 在人间living | 2020-03-23 | 今天,武汉封城两个月了 | NaN | https://mp.weixin.qq.com/s/mrWF9nFUxtXnNnyNEf-Ibw | NaN | NaN | NaN | http://archive.is/KPoAT |
4239 | non_fiction | 2020-03-23 | 中国经营报 | 2020-03-23 | “108好汉”为何注射新冠疫苗,这位00后的回答刷屏… | NaN | https://mp.weixin.qq.com/s/GinzGhKnNHZrtlDuVIrkKw | NaN | NaN | NaN | http://archive.is/tHBxO |
4238 | non_fiction | 2020-03-23 | 中国经营报 | 2020-03-23 | 新加坡、澳大利亚“封国”!意大利全国“停产”,美国确诊人数突破3万... | NaN | https://mp.weixin.qq.com/s/hE8J7D-GrkB92GoPnmpcsg | NaN | NaN | NaN | http://archive.is/QcWU5 |
4237 | narrative | 2020-03-22 | WUXU | 2020-03-23 | [四十日谈] 条条大路“不”通罗马,老猫的曲折回意之路 | NaN | https://mp.weixin.qq.com/s/fLlGjOcZcotS-QybkqZjcw | NaN | NaN | NaN | http://archive.ph/ehsAk |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
5 | non_fiction | 2020-02-06 | GQ报道 | 2020-01-29 | 孝感前线医生:武汉更难,我们下面不好意思提要求 | NaN | https://mp.weixin.qq.com/s/uGaFeqrqmLBQe5qdRSTeSQ | NaN | NaN | NaN | https://archive.ph/MnZrn |
4 | non_fiction | 2020-02-06 | GQ报道 | 2020-01-29 | 疫情危机中不被看见的人们:武汉周边城市百姓的自救行动 | NaN | https://mp.weixin.qq.com/s/D8Ob8pNmecHKXg7yR7EWFg | NaN | NaN | NaN | https://archive.ph/vDSj5 |
3 | non_fiction | 2020-02-06 | GQ报道 | 2020-01-28 | 我家离华南海鲜市场很近:返乡、封城、过年,一位武汉大学生的过去一周 | NaN | https://mp.weixin.qq.com/s/n7dXGHh-79d6VEzDhhOUbQ | NaN | NaN | NaN | https://archive.ph/RSmFx |
2 | non_fiction | 2020-02-06 | GQ报道 | 2020-01-28 | 武汉隔离:疫区、信息孤岛与一辆鄂A车的漂流 | NaN | https://mp.weixin.qq.com/s/M-hVivF7NQmZHlu8YMnL_w | NaN | NaN | NaN | http://archive.is/3XKZD |
1 | non_fiction | 2020-02-06 | GQ报道 | 2020-01-27 | 10000个临时发往武汉的口罩 | NaN | https://mp.weixin.qq.com/s/p-uPky_zB6XKcAetthqkKg | NaN | NaN | NaN | https://archive.ph/9s1ug |
4227 rows × 11 columns
Now we need the IIPC seed list. We can get that right here, since we saved it when we ran the Seeds notebook.
seeds = pandas.read_csv('data/iipc.csv')
seeds
id | url | creator | created | updated | crawl_definition | title | description | language | tld | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 2147692 | http://coronavirus.fr/ | alext | 2020-02-21T03:43:18.662353Z | 2020-03-16T19:53:45.860949Z | 31104294373 | Epicorem. Ecoépidémiologie | Medical/Scientific aspects | French | .fr |
1 | 2147693 | http://english.whiov.cas.cn/ | alext | 2020-02-21T03:43:18.706571Z | 2020-03-16T19:52:28.575749Z | 31104294373 | Wuhan Institute of Virulogy, official page in ... | Health Organisation | English | .cn |
2 | 2147694 | http://www.china-embassy.or.jp/chn/ | alext | 2020-02-21T03:43:18.739126Z | 2020-03-16T19:53:03.086729Z | 31104294373 | 中华人民共和国驻日本大使馆 | Embassy | Chinese | .jp |
3 | 2147695 | http://www.china-embassy.or.jp/jpn/ | alext | 2020-02-21T03:43:18.766308Z | 2020-03-16T19:54:02.280945Z | 31104294373 | 中華人民共和国駐日本国大使館 | Embassy | Japanese | .jp |
4 | 2147696 | https://cadenaser.com/tag/ncov/a/ | alext | 2020-02-21T03:43:18.791716Z | 2020-03-16T19:54:19.694418Z | 31104294373 | Coronavirus de Wuhan | Cadena Ser | Spanish | .com |
There are some rows in the original dataset that lack a value for the url
.
ncovmem = ncovmem[ncovmem.url.notna()]
Lets take a look at the domains that are present in this data.
import altair
from urllib.parse import urlparse
ncovmem['domain'] = ncovmem.url.map(lambda u: urlparse(u).netloc, na_action='ignore')
altair.Chart(ncovmem.reset_index(), title="Coronavirus Subreddit Posts", width=800).mark_bar().encode(
altair.X('domain', title='Time (Days)'),
altair.Y('count(id)', title='Posts per Day')
)
So the dataset is almost entirely links to qq.com, or the TenCent instant messaging platform.
Now let's see if any of the nCovMemory URLs are present in the IIPC one.
len(set(seeds.url).intersection(ncovmem.url))
0
As you can see there's no overlap at all between the two sets of URLs. So the nCovMem dataset should be useful to add to the IIPC collection.
Let's save them off for use later.
ncovmem.to_csv('data/ncovmem.csv')