View this on GitHub | NBviewer | Binder
Auhor: Leon Yin
Updated on: 2018-10-01
This notebook will walk through using links as data with the URLexpander package.
urlexpander.tweet_utils.get_link
)urlexpander.expand
)urlexpander.tweet_utils.count_matrix
, urlexpander.html_utils.get_webpage_meta
)Software for this tutorial is found in this requirements.txt
file, and can be downloaded as follows:
pip install -r requirements.txt
Download data here:
python download_data.py
NOTE: at the time of this writing, download_data.py
does not work! Please go to OSF in the meantime.
If you have yet to look at the backend of a Tweet, here you go: https://bit.ly/tweet_anatomy_link.
In addition to hashtags and the like, Tweets contain metadata fields for urls. The code below will show you how to extract and work with links from Tweets.
We're working with Tweets from members of congress collected by Greg Eady.
import os
import json
import glob
import itertools
from multiprocessing import Pool
from tqdm import tqdm
import pandas as pd
import urlexpander
from smappdragon import JsonCollection
from config import INTERMEDIATE_DIRECTORY, \
RAW_TWEETS_DIRECTORY, \
CONGRESS_METADATA_DIRECTORY
# config setting
pd.options.display.float_format = '{:.0f}'.format
# these are the files we'll be producing here
file_raw_links = os.path.join(INTERMEDIATE_DIRECTORY, 'links_raw.csv')
file_cache = os.path.join(INTERMEDIATE_DIRECTORY, 'cache.json')
file_expanded_links = os.path.join(INTERMEDIATE_DIRECTORY, 'links_expanded_all.csv')
# this is the raw data we're working with
files = glob.glob(os.path.join(RAW_TWEETS_DIRECTORY, '*.json.bz2'))
len(files)
1950
Let's preview one file. The file is saved as a newline-delimited json file like this
{"tweet_id" : "123", "more_data" : {"here" : "it is"}
{"tweet_id" : "124", "more_data" : {"here" : "it is again"}
and bzip2 compressed!
f = files[2]
f
'data/tweets-raw/1089859058__2018-03.json.bz2'
The file structure is new-line delimited JSON. We developed software (like smappdragon) to work with Tweets like this:
collect = JsonCollection(f, compression='bz2', throw_error=False, verbose=1)
smappdragon's JsonCollection
class reads through JSON files as a generator.
collect
<smappdragon.collection.json_collection.JsonCollection at 0x10d28f588>
The con is that generators are hard to interpret, the pro is that they don't store any data in memory. We access the data on a row-by-row basis by iterating through the collect
object. Here we only get the first row, we can see the contents by printing row:
for row in collect.get_iterator():
break
#print(json.dumps(row, indent=2))
Each Tweet can have more than one link, thus we need to unpack those values! urlexpander has a function to do just this:
?urlexpander.tweet_utils.get_link
Signature: urlexpander.tweet_utils.get_link(tweet) Docstring: Returns a generator containing tweet metadata about media. The metadata dict contains the following columns: columns = { 'link_domain' : 'the domain of the URL', 'link_url_long' : 'the URL (this can be short!)', 'link_url_short' : 'The t.co URL', 'tweet_created_at' : 'When the tweet was created', 'tweet_id' : 'The ID of the tweet', 'tweet_text' : 'The Full text of the tweet', 'user_id' : 'The Twitter ID of the tweeter' } :input tweet: a nested dictionary of a Tweet either from the streaming or search API. :returns: a generator of dictionaries File: ~/anaconda3/lib/python3.6/site-packages/urlexpander/core/tweet_utils.py Type: function
Once again we have another generator
# returns a genrator, which is uninterpretable!
urlexpander.tweet_utils.get_link(row)
<generator object get_link at 0x11b8a3f10>
# we can access the data by iterating through it.
for link_meta in urlexpander.tweet_utils.get_link(row):
print(link_meta)
{'user_id': 1089859058, 'tweet_id': 976517212063322112, 'tweet_created_at': 'Wed Mar 21 17:53:40 +0000 2018', 'tweet_text': None, 'link_url_long': 'http://bit.ly/2FC7bMz', 'link_domain': 'bit.ly', 'link_url_short': 'https://t.co/5P1JAaxwQV'}
# to unwrap this we'll do this mess of code
list(itertools.chain.from_iterable([urlexpander.tweet_utils.get_link(row)]))
[{'user_id': 1089859058, 'tweet_id': 976517212063322112, 'tweet_created_at': 'Wed Mar 21 17:53:40 +0000 2018', 'tweet_text': None, 'link_url_long': 'http://bit.ly/2FC7bMz', 'link_domain': 'bit.ly', 'link_url_short': 'https://t.co/5P1JAaxwQV'}]
First let's a generalize the workflow into a function. Below is a boilerplate function you can use this as a starting place for your own workflow.
def read_file_extract_links(f):
'''
This function takes in a Tweet file that bzip2-compressed,
newline-deliminted json, and returns a list of dictionaries
for link data.
'''
# read the json file into a generator
collection = JsonCollection(f, compression='bz2', throw_error=False)
# iterate through the json file, extract links, flatten the generator of links
# into a list, and store into a Pandas dataframe
df_ = pd.DataFrame(list(
itertools.chain.from_iterable(
[urlexpander.tweet_utils.get_link(t)
for t in collection.get_iterator()
if t]
)))
df_['file'] = f
return df_.to_dict(orient='records')
We can iterate through files and run the function iteratively. From there, we can instantiate a Pandas DataFrame
.
data = []
for f in tqdm(files[:2]):
# read the json file into a generator
data.extend(read_file_extract_links(f)
df_links = pd.DataFrame(data)
100%|██████████| 2/2 [00:01<00:00, 1.03s/it]
The for loop is is slow! This task is not memory intensive (because we're using generators).
We can parallelize this task if we will use the Pool
class from the Mulitprocessing package to have each core on our computer read a JSON file of Tweets and filter for links. The "if" statement is to prevent repeating work once we have already read the files once. We cache this intermediate in the file_raw_links
file path declared at the beginning of the notebook.
N_CPU = 4 # this is the number of cores we canuse to parallelize the fast
if not os.path.exists(file_raw_links):
data = []
with Pool(N_CPU) as pool:
iterable = pool.imap_unordered(read_file_extract_links, files)
for link_data in tqdm(iterable, total=len(files)):
data.extend(link_data)
df_links = pd.DataFrame(data)
df_links.to_csv(file_raw_links, index=False)
else:
df_links = pd.read_csv(file_raw_links)
df_links.head(2)
file | link_domain | link_url_long | link_url_short | tweet_created_at | tweet_id | tweet_text | user_id | |
---|---|---|---|---|---|---|---|---|
0 | /scratch/olympus/projects/mediascore/Data/json... | frc.org | https://www.frc.org/wwlivewithtonyperkins/rep-... | https://t.co/l9dXT0L7oT | Fri Mar 23 14:38:34 +0000 2018 | 977192888781168640 | nan | 2966758114 |
1 | /scratch/olympus/projects/mediascore/Data/json... | thehill.com | http://thehill.com/379188-watch-fund-governmen... | https://t.co/YbdvepWNQ3 | Thu Mar 22 15:21:32 +0000 2018 | 976841314024206339 | nan | 2966758114 |
The bulk of URLs we encounter in the wild are sent through a link shortener. Link shorteners record transactional information whenever that link is clicked. Unfortunately it makes it hard for us to see what was being shared.
links = df_links['link_url_long'].tolist()
links[-5:]
['http://goo.gl/kDUwP', 'http://bit.ly/12clU3p', 'http://nyti.ms/Z4rdlU', 'http://goo.gl/LxkrY', 'http://www.huffingtonpost.com/rep-diana-degette/reducing-gun-violence-mea_b_3018506.html']
This is why urlexpander
was made. We can run the expand
function on single URLs, as well as a list of URLs.
urlexpander.expand(links[-5])
'https://degette.house.gov/index.php?option=com_content&view=article&id=1260:congressional-lgbt-equality-caucus-praises-re-introduction-of-employment-non-discrimination-act&catid=76:press-releases-&Itemid=227'
By default, urlexpander will expand every URL it is shown. However you can pass a boolean function (one the returns True or False based on an inputted string) to the filter_function
parameter.
urlexpander.expand(links[-5:], filter_function=urlexpander.is_short)
['https://degette.house.gov/index.php?option=com_content&view=article&id=1260:congressional-lgbt-equality-caucus-praises-re-introduction-of-employment-non-discrimination-act&catid=76:press-releases-&Itemid=227', 'http://www.rollcall.com/news/house_hydro_bill_tests_water_for_broad_energy_deals-224277-1.html', 'http://nyti.ms/Z4rdlU', 'http://www.civiccenterconservancy.org/history-2012-nhl-designation_25.html', 'http://www.huffingtonpost.com/rep-diana-degette/reducing-gun-violence-mea_b_3018506.html']
['abc.com/123', 'bbc.co.uk/123', 'abc.com/123', 'bit.ly/cbc23']
--> Remove duplicates
['abc.com/123', 'bbc.co.uk/123', 'bit.ly/cbc23']
--> Filter for shortened URLs
['bit.ly/cbs23']
--> Check the cache file, did we already expand this? Unshorten new urls
[{'original_url': 'bit.ly/cbs23', 'resolved' : 'cspan.com/123'}]
--> join back in
['abc.com/123', 'bbc.co.uk/123', 'abc.com/123', 'cspan.com/123']
urlexpander parallelizes, filters and caches the input, which is essential for social media data.
?urlexpander.expand
Signature: urlexpander.expand(links_to_unshorten, chunksize=1280, n_workers=1, cache_file=None, random_seed=303, verbose=0, filter_function=None, **kwargs) Docstring: Calls expand with multiple (``n_workers``) threads to unshorten a list of urls. Unshortens all urls by default, unless one sets a ``filter_function``. :param links_to_unshorten: (list, str) either an idividual or list (str) of urls to unshorten :param chunksize: (int) chunks links_to_unshorten, which makes computation quicker with larger inputs :param n_workers: (int) how many threads :param cache_file: (str) a path to a json file to read and write results in :param random_seed: (int) initializes the random state for shuffling the input :param verbose: (int) whether to print updates and errors. 0 is silent. 1 is progress bar. 2 is progress bar and errors. :param filter_function: (func) a boolean used to filter url shorteners out :returns: (list) a list of resolved urls File: ~/anaconda3/lib/python3.6/site-packages/urlexpander/core/api.py Type: function
The above is a toy example with 5 links, let's see how this works for 1.7 Mil links. For reference, this took me an hour on an 8-core computer with reliable internet connection.
resolved_urls = urlexpander.expand(links,
filter_function=urlexpander.is_short,
n_workers=64,
chunksize=1280,
cache_file=file_cache,
verbose=1)
len(resolved_urls)
1700150
df_links['link_resolved'] = resolved_urls
df_links['link_resolved_domain'] = df_links['link_resolved'].apply(urlexpander.get_domain)
df_links.to_csv(file_expanded_links, index=False)
With the links resolved, how can we use links as data?
df_links = pd.read_csv(file_expanded_links)
df_links.head(2)
file | link_domain | link_url_long | link_url_short | tweet_created_at | tweet_id | tweet_text | user_id | link_resolved | link_resolved_domain | |
---|---|---|---|---|---|---|---|---|---|---|
0 | /scratch/olympus/projects/mediascore/Data/json... | frc.org | https://www.frc.org/wwlivewithtonyperkins/rep-... | https://t.co/l9dXT0L7oT | Fri Mar 23 14:38:34 +0000 2018 | 977192888781168640 | nan | 2,966,758,114 | https://www.frc.org/wwlivewithtonyperkins/rep-... | frc.org |
1 | /scratch/olympus/projects/mediascore/Data/json... | thehill.com | http://thehill.com/379188-watch-fund-governmen... | https://t.co/YbdvepWNQ3 | Thu Mar 22 15:21:32 +0000 2018 | 976841314024206339 | nan | 2,966,758,114 | https://thehill.com/379188-watch-fund-governme... | thehill.com |
We can get an overview of the most frequently shared domains:
df_links['link_resolved_domain'].value_counts().head(15)
twitter.com 255532 house.gov 199218 youtube.com 93986 facebook.com 90061 senate.gov 78645 washingtonpost.com 29886 instagram.com 28460 nytimes.com 25014 thehill.com 22925 politico.com 13488 foxnews.com 12045 cnn.com 11611 wsj.com 11289 twimg.com 9633 ow.ly 9463 Name: link_resolved_domain, dtype: int64
We can also look at the contents of each URL. Twitter provides URL metadata (if you pay), we provided a workaround!
url = 'http://www.rollcall.com/news/house_hydro_bill_tests_water_for_broad_energy_deals-224277-1.html'
meta = urlexpander.html_utils.get_webpage_meta(url)
meta
OrderedDict([('url', 'http://www.rollcall.com/news/house_hydro_bill_tests_water_for_broad_energy_deals-224277-1.html'), ('title', 'House Hydro Bill Tests Water for Broad Energy Deals'), ('description', ' In February, the House did something rare: It passed an energy bill unanimously. Unlike the previous Congress’ standard fare of anti-EPA, pro-drilling measures, the first energy bill of the 113th Congress promoted small-scale hydropower projects and the electrification of existing dams.'), ('paragraphs', ['', '', 'In February, the House did something rare: It passed an energy bill unanimously. Unlike the previous Congress’ standard fare of anti-EPA, pro-drilling measures, the first energy bill of the 113th Congress promoted small-scale hydropower projects and the electrification of existing dams.', 'In other words, the Republican-controlled House passed a clean-energy bill.', 'Of course, few Republicans could object on ideological grounds to legislation that aimed to expedite or remove regulatory requirements for expanding hydropower facilities. And it certainly helped that the chamber passed similar legislation in 2012 by a large margin. ', 'Nevertheless, hydropower proponents say that developing the resource represents “low-hanging fruit” that members tired of the partisanship that has permeated energy policy can all agree is worth advancing to President Barack Obama’s desk. Hydropower represented nearly two-thirds — the largest share by far — of domestic renewable-energy production and 8 percent of total U.S. electricity generation in 2011, according to the Energy Information Administration. More than half of that total powered the Pacific Northwest region, and hydro is one of the few renewable resources that can provide baseload electricity — power that is available at all times — to the grid.', 'But just 3 percent of the 80,000 dams in the United States generate power, representing great potential for growing the resource, according to legislation championed by Reps. Cathy McMorris Rodgers, R-Wash., and Diana DeGette, D-Colo.', '“If you can work on regulatory reform for those projects, then you can have small hydro throughout this country,” DeGette said. ', 'The House lawmakers’ legislation would let small hydroelectric facilities generating up to 10 megawatts of power bypass Federal Energy Regulatory Commission licensing requirements that currently apply to projects producing more than 5 megawatts. The bill also would require FERC to study the feasibility of carrying out a two-year hydropower licensing pilot program at unpowered dams and would allow the commission to extend preliminary permits for two additional years. ', 'Jeff Leahey, government affairs director at the National Hydropower Association, said House leaders probably moved the bill so they could promote energy legislation that “checked the boxes” on encouraging the development of a resource that is renewable and reliable. It also helped that the bill — along with another measure passed April 10 that would designate an Interior Department agency as the lead regulator of small federal conduits — moved through the House last Congress and didn’t need much additional work, he said.', 'A significant factor in the refocused spotlight on the “original renewable” is the new leadership on the Senate Energy and Natural Resources Committee and its representation of key hydropower-producing states. Chairman Ron Wyden, D-Ore., promised industry representatives at the hydropower association’s annual conference this week that he plans to “quickly” mark up hydropower legislation after a panel hearing Tuesday. ', 'Wyden attributes the rise in hydro’s profile to better environmental stewardship on the part of facility operators and a more cooperative relationship between hydropower lobbyists and environmental groups focused on protecting river ecosystems. The effect of dams on fisheries and riverine habitat, as well as operational costs, has compelled organizations to promote the removal of dams in some instances.', '“Hydro’s environmental performance has improved dramatically,” Wyden said. ', 'Association President David Moller of Pacific Gas and Electric Co. acknowledged the role that historically low natural-gas prices have played in limiting hydropower expansion in recent years. But he said opportunities for hydropower still flourish because of state renewable portfolio mandates, coal-fired power plants being pushed into retirement and technological advances in powering existing dams and water channels.', '“The price of natural gas has dropped, but it will never match hydropower’s fuel price of zero, or its attributes of being renewable and non-carbon-based,” he said. ', 'Hydro proponents in the private sector and in Congress said this week that they will continue to promote hydropower development, possibly in future legislation. That could include examining additional regulatory issues that contribute to long lead times for completing electrified projects or adjusting current benefits that exist for clean power in the tax code. Prospects for he latter — which would involve extending the production tax credit for a multiyear period or making it permanent as Obama proposes — are dim outside a comprehensive tax code overhaul.', '“I think at the end of the day, it’s all about making sure that hydropower and the benefits that come from hydropower projects are competitive in the marketplace with other energy projects,” Leahey said. ', 'Whether the bipartisan camaraderie that has surrounded promoting hydropower can translate to moving broader energy legislation is anyone’s guess. But members close to the debate express optimism that the current spate of legislative action could beget compromise in the future.', '“I’m not sure we’re any closer to that comprehensive policy, but I’d think that common ground on these issues like hydro can only be helpful,” DeGette said.', '', '×', '$${CardTitle}', '$${CardTitle}'])])
If we want to know more about what kinds of information are being shared by members of congress, we can enrich the dataset by joining metadata about domains. In this example we will use the Local News Dataset to inspect the local media outlets that members of congress share:
local_news_url = 'https://raw.githubusercontent.com/yinleon/LocalNewsDataset/master/data/local_news_dataset_2018.csv'
df_local = pd.read_csv(local_news_url)
df_local = df_local[~(df_local.domain.isnull()) &
(df_local.domain != 'facebook.com')]
df_local.head()
name | state | website | domain | youtube | owner | medium | source | collection_date | |||
---|---|---|---|---|---|---|---|---|---|---|---|
0 | KWHE | HI | http://www.kwhe.com/ | kwhe.com | NaN | NaN | NaN | LeSea | TV station | stationindex | 2018-08-02 14:55:24.612585 |
1 | WGVK | MI | http://www.wgvu.org/ | wgvu.org | NaN | NaN | NaN | Grand Valley State University | TV station | stationindex | 2018-08-02 14:55:24.612585 |
3 | KTUU | AK | http://www.ktuu.com/ | ktuu.com | NaN | NaN | NaN | Schurz Communications | TV station | stationindex | 2018-08-02 14:55:24.612585 |
4 | KTBY | AK | http://www.ktbytv.com/ | ktbytv.com | NaN | NaN | NaN | Coastal Television Broadcasting | TV station | stationindex | 2018-08-02 14:55:24.612585 |
5 | KYES | AK | http://www.kyes.com/ | kyes.com | NaN | NaN | NaN | Fireweed Communications | TV station | stationindex | 2018-08-02 14:55:24.612585 |
# this is a SQL-like join in Pandas that merges the two datasets based on domain name!
df_links_state = df_links.merge(df_local,
left_on='link_resolved_domain',
right_on='domain')
This dataset unlocks insights regarding the locality of news articles shared, as well as media ownership.
df_links_state.state.value_counts().head(20)
TX 43962 CA 35029 NJ 33442 NY 33145 MI 28912 FL 21519 NC 19841 PA 19060 OH 18857 MD 16533 GA 13419 MO 12660 TN 12324 AZ 11052 MA 10767 WA 10325 NV 10261 LA 9519 MN 9510 OR 9404 Name: state, dtype: int64
df_links_state.owner.value_counts().head(25)
Nexstar 7543 Advance Local 4464 Tegna Media 4265 Sinclair 3457 Hearst 2571 Tribune 2086 Gray Television 1817 Fox Television Stations 1580 Hearst Television 1528 Acvance Local 1522 Raycom 1396 NBC Universal 1344 Georgia Public Broadcasting 1248 New Jersey Public Broadcasting Authority 1156 The Philadelphia Inquirer 1103 ABC 978 Evening Post Publishing 820 Meredith 750 Oregon Public Broadcasting 740 Georgia Public Telecommunications Commission 624 Graham Media Group 588 E. W. Scripps Company 540 Cox Enterprises 526 WGBH Educational Foundation 425 Piedmont Television 333 Name: owner, dtype: int64
df_links_state[df_links_state.owner == 'Sinclair']['user_id'].value_counts().head(10)
66891808 100 818948638890217472 68 1065995022 66 1058345042 64 90651198 63 368948092 56 27676828 48 2929491549 48 2987671552 44 58579942 44 Name: user_id, dtype: int64
This is one example of a dataset enrichment, you can create your own categorizations and join them in. Alexa.com is a good starting place.
The frequency of domains shared per-user make rich features. We can aggregate the data using the following utility function:
count_matrix = urlexpander.tweet_utils.count_matrix(
df_links,
user_col='user_id',
domain_col='link_resolved_domain',
min_freq=20,
)
count_matrix.head()
1011fmtheanswer.com | 10best.com | 10tv.com | 11alive.com | 123formbuilder.com | 12news.com | 12newsnow.com | 13abc.com | 13wham.com | 13wmaz.com | ... | yorkdispatch.com | youarecurrent.com | youcaring.com | youngcons.com | yourconroenews.com | yourdailyjournal.com | youtube.com | zeldinforcongress.com | zeldinforsenate.com | zpolitics.com | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
user_id | |||||||||||||||||||||
813286 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 198 | 0 | 0 | 0 |
939091 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 106 | 0 | 0 | 0 |
5496932 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 427 | 0 | 0 | 0 |
5511752 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 73 | 0 | 0 | 0 |
5558312 | 16 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 53 | 0 | 0 | 0 |
5 rows × 2376 columns
Let's see how the count_matrix
features fair in machine learning. To do so, we need to enrich our data with an output variable. For our purposes, we will try to predict political affiliation.
%matplotlib inline
import umap
import matplotlib.pyplot as plt
/Users/leonyin/anaconda3/lib/python3.6/site-packages/numba/errors.py:102: UserWarning: Insufficiently recent colorama version found. Numba requires colorama >= 0.3.9 warnings.warn(msg)
These files have the affiliation of each account.
meta = []
for f in glob.glob(os.path.join(CONGRESS_METADATA_DIRECTORY, '*')):
_df = pd.read_csv(f)
_df = _df[~_df.twitter_id.isnull()]
meta.extend(_df.to_dict(orient='records'))
df_meta = pd.DataFrame(meta)
df_meta.twitter_id = df_meta.twitter_id.astype(float, errors='ignore').astype(str)
df_links.user_id = df_links.user_id.astype(float, errors='ignore').astype(str)
look_up = df_meta[df_meta['twitter_id'].isin(df_links.user_id)].drop_duplicates(subset=['twitter_id'])[['twitter_id', 'affiliation']]
df_links_ = df_links[df_links.user_id.isin(look_up['twitter_id'])]
len(df_links_.user_id.unique()), len(look_up)
(971, 971)
This will be an exploratory step, where we will try to visualze the dataset of counts of links shared by user. We will reduce the high-dimensional data (where each domain is one dimension) to two-dimensions for visualization using the UMAP algorithm (much like the populat TSNE algorithm).
def color(party):
if party == 'Democrat':
return 'blue'
elif party == 'Republican':
return 'red'
else:
return 'black'
def viz_umap_embed(count_matrix, title="Members of Congress Embedded by UMAP", threshold=5, **kwargs):
'''
Visualizes the count matrix in 2 dimensions using UMAP.
'''
if threshold:
count_matrix = count_matrix[count_matrix.sum(axis=1) >= threshold]
parties = look_up.set_index('twitter_id').loc[count_matrix.index].affiliation
embedding = umap.UMAP(n_components=2, **kwargs).fit_transform(count_matrix.values)
plt.figure(figsize=(14,8))
ax = plt.scatter(x = embedding[:,0],
y = embedding[:,1],
s = 100,
c = parties.apply(color),
alpha = .4)
plt.title(title)
plt.axis('off')
plt.show()
# domains to exclude in our count matrix
exclude = ['youtube.com', 'twitter.com', 'fb.com',
'facebook.com', 'instagram.com', 'ow.ly',
'house.gov', 'senate.gov', 'usa.gov']
count_matrix = urlexpander.tweet_utils.count_matrix(
df_links_,
user_col='user_id',
domain_col='link_resolved_domain',
min_freq=5,
exclude_domain_list=exclude,
)
count_matrix.head(2)
frc.org | thehill.com | iheart.com | c-span.org | news9.com | speaker.gov | foxnews.com | frcaction.org | koco.com | okcfox.com | ... | mullinforcongress.com | therepublicanstandard.com | barbaracomstockforcongress.com | thefriendshipchallenge.com | tomreedforcongress.com | sincomillas.com | detodopr.com | thedowneypatriot.com | garretgravesforcongress.com | about.com | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
user_id | |||||||||||||||||||||
1004855106.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1009269193.0 | 0 | 12 | 0 | 16 | 0 | 2 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 rows × 5301 columns
viz_umap_embed(count_matrix,
title="Link Sharing of Members of Congress Embedded by UMAP",
threshold=5,
# umap params
n_neighbors=50,
min_dist=0.1,
metric='dice',
random_state=303)
/Users/leonyin/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py:475: DataConversionWarning: Data with input dtype float32 was converted to bool by check_pairwise_arrays. warnings.warn(msg, DataConversionWarning)
In the unsupervised case, we already see Democrats and Republicans sectioned off. Here we will fit a logistic regression model to predict whether a congress member is a Democrat or a Republican based on the count_matrix
we just created.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
# filter out independents, and accounts that sent less than one link!
dems_repubs = look_up[look_up.affiliation != 'Independent'].twitter_id
count_matrix_ = count_matrix[(count_matrix.index.isin(dems_repubs)) &
(count_matrix.sum(axis=1) >= 1)]
parties = look_up.set_index('twitter_id').loc[count_matrix_.index].affiliation
# create the training set
X, y = count_matrix_.values, parties
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=303, test_size=.15)
len(X_train), len(X_test)
(820, 145)
logreg = LogisticRegression(penalty='l2', C=.7,
solver='liblinear',
random_state=303)
logreg.fit(X_train, y_train)
logreg.score(X_test, y_test)
0.9517241379310345
import numpy as np
from sklearn.metrics import confusion_matrix
def plot_confusion_matrix(cm, classes,
normalize=False,
title='Confusion matrix',
cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print("Normalized confusion matrix")
else:
print('Confusion matrix, without normalization')
print(cm)
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)
fmt = '.2f' if normalize else 'd'
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], fmt),
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.tight_layout()
y_pred = logreg.predict(X_test)
class_names = logreg.classes_
# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred)
np.set_printoptions(precision=2)
# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names,
title='Confusion matrix, without normalization')
# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names, normalize=True,
title='Normalized confusion matrix')
plt.show()
Confusion matrix, without normalization [[49 2] [ 5 89]] Normalized confusion matrix [[0.96 0.04] [0.05 0.95]]
# these are what we got wrong!
df_meta.set_index('twitter_id').loc[
y_test[y_test != y_pred].index
][['twitter_name']]
twitter_name | |
---|---|
twitter_id | |
9.410800851211756e+17 | SenDougJones |
23820360.0 | billhuizenga |
136526394.0 | WebsterCongress |
4615689368.0 | GeneGreen29 |
19726613.0 | SenatorCollins |
242376736.0 | RepCharlieDent |
16056306.0 | JeffFlake |
indep = look_up[look_up.affiliation == 'Independent'].twitter_id
count_matrix_indep_ = count_matrix[(count_matrix.index.isin(indep)) &
(count_matrix.sum(axis=1) >= 2)]
df_ind = df_meta.set_index('twitter_id').loc[count_matrix_indep_.index][['twitter_name']]
df_ind['preds'] = logreg.predict(count_matrix_indep_)
df_ind
twitter_name | preds | |
---|---|---|
twitter_id | ||
1068481578.0 | SenAngusKing | Republican |
216776631.0 | BernieSanders | Democrat |
2915095729.0 | AkGovBillWalker | Republican |
29442313.0 | SenSanders | Democrat |
3196634042.0 | GovernorMapp | Republican |
from sklearn.model_selection import cross_val_score
logreg_cv = LogisticRegression(penalty='l2', C=.7,
solver='liblinear',
random_state=303)
scores = cross_val_score(logreg_cv, X, y, cv=5)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
Accuracy: 0.96 (+/- 0.02)