So we've done some work in other notebooks to collect URLs related to COVID-19 in social bookmarking sites/projects. Let's use this notebook to aggregate it together into a single dataset.
import pandas
reddit = pandas.read_csv('data/reddit.csv')
pinboard = pandas.read_csv('data/pinboard.csv')
ncovmem = pandas.read_csv('data/ncovmem.csv')
iipc = pandas.read_csv('data/iipc.csv')
While some of the details of these datasets are different they all contain columns for url
, title
and created
. In the case of ncovmem the created time is stored in a column called updated
so lets update that.
ncovmem.columns = ncovmem.columns.map(lambda c: 'created' if c == 'update' else c)
Next lets add a column to each dataframe that indicates the source so when we combine them together we will know where the data came from.
reddit['source'] = 'reddit'
pinboard['source'] = 'pinboard'
ncovmem['source'] = 'ncovmem'
iipc['source'] = 'iipc'
def prune(df):
for col in df.columns:
if col not in ['url', 'title', 'created', 'source']:
df = df.drop(col, 1)
return df
reddit = prune(reddit)
ncovmem = prune(ncovmem)
pinboard = prune(pinboard)
iipc = prune(iipc)
Now we are ready to combine them together!
seeds = pandas.concat([iipc, ncovmem, pinboard, reddit], ignore_index=True)
seeds
url | created | title | source | |
---|---|---|---|---|
0 | http://coronavirus.fr/ | 2020-02-21T03:43:18.662353Z | Epicorem. Ecoépidémiologie | iipc |
1 | http://english.whiov.cas.cn/ | 2020-02-21T03:43:18.706571Z | Wuhan Institute of Virulogy, official page in ... | iipc |
2 | http://www.china-embassy.or.jp/chn/ | 2020-02-21T03:43:18.739126Z | 中华人民共和国驻日本大使馆 | iipc |
3 | http://www.china-embassy.or.jp/jpn/ | 2020-02-21T03:43:18.766308Z | 中華人民共和国駐日本国大使館 | iipc |
4 | https://cadenaser.com/tag/ncov/a/ | 2020-02-21T03:43:18.791716Z | Coronavirus de Wuhan | iipc |
... | ... | ... | ... | ... |
143330 | https://twitter.com/DarrenPlymouth/status/1220... | 2020-01-23 16:48:54 | Can anyone confirm if this is real? | |
143331 | https://www.reddit.com/r/Coronavirus/comments/... | 2020-01-23 17:08:53 | Doctor at Wuhan hospital states “ the virus is... | |
143332 | https://www.nature.com/news/inside-the-chinese... | 2020-01-23 17:18:46 | This raises a question to me as to the true or... | |
143333 | https://www.reddit.com/r/Coronavirus/comments/... | 2020-01-23 17:26:39 | Would the flu shot provide any protection agai... | |
143334 | https://www.reddit.com/r/Coronavirus/comments/... | 2020-01-23 16:18:13 | Package from South Korea. |
143335 rows × 4 columns
There are actually a large number of posts that don't link out to the web and are just questions and comments.
len(seeds[seeds.url.str.contains('reddit.com')])
18730
Since we are mostly interested in archiving the web, and not Reddit specifically we can remove these.
seeds = seeds[~seeds.url.str.contains('reddit.com')]
len(seeds)
124605