Tools for Link Analysis: urlExpander¶

View this on GitHub | NBviewer | Binder
Auhor: Leon Yin
Updated on: 2018-10-01

Intro¶

This notebook will walk through using links as data with the URLexpander package.

What kind of link data does Twitter provide?
How to extract link data from Tweets (urlexpander.tweet_utils.get_link)
Processing data by expanding shortened URLs (urlexpander.expand)
Avenues of analysis with link data (urlexpander.tweet_utils.count_matrix, urlexpander.html_utils.get_webpage_meta)
Using links as features to predict political affiliation

Software for this tutorial is found in this requirements.txt file, and can be downloaded as follows:

pip install -r requirements.txt

Download data here:

python download_data.py

NOTE: at the time of this writing, download_data.py does not work! Please go to OSF in the meantime.

What Kind of Link Data Does Twitter Provide?¶

If you have yet to look at the backend of a Tweet, here you go: https://bit.ly/tweet_anatomy_link.
In addition to hashtags and the like, Tweets contain metadata fields for urls. The code below will show you how to extract and work with links from Tweets.

We're working with Tweets from members of congress collected by Greg Eady.

In [1]:

import os
import json
import glob
import itertools
from multiprocessing import Pool

from tqdm import tqdm
import pandas as pd
import urlexpander
from smappdragon import JsonCollection

from config import INTERMEDIATE_DIRECTORY, \
                   RAW_TWEETS_DIRECTORY, \
                   CONGRESS_METADATA_DIRECTORY

In [51]:

# config setting
pd.options.display.float_format = '{:.0f}'.format

# these are the files we'll be producing here
file_raw_links = os.path.join(INTERMEDIATE_DIRECTORY, 'links_raw.csv')
file_cache = os.path.join(INTERMEDIATE_DIRECTORY, 'cache.json')
file_expanded_links = os.path.join(INTERMEDIATE_DIRECTORY, 'links_expanded_all.csv')

# this is the raw data we're working with
files = glob.glob(os.path.join(RAW_TWEETS_DIRECTORY, '*.json.bz2'))
len(files)

Out[51]:

Let's preview one file. The file is saved as a newline-delimited json file like this

{"tweet_id" : "123", "more_data" : {"here" : "it is"}
{"tweet_id" : "124", "more_data" : {"here" : "it is again"}

and bzip2 compressed!

In [3]:

f = files[2]
f

Out[3]:

'data/tweets-raw/1089859058__2018-03.json.bz2'

The file structure is new-line delimited JSON. We developed software (like smappdragon) to work with Tweets like this:

In [4]:

collect = JsonCollection(f, compression='bz2', throw_error=False, verbose=1)

smappdragon's JsonCollection class reads through JSON files as a generator.

In [5]:

collect

Out[5]:

<smappdragon.collection.json_collection.JsonCollection at 0x10d28f588>

The con is that generators are hard to interpret, the pro is that they don't store any data in memory. We access the data on a row-by-row basis by iterating through the collect object. Here we only get the first row, we can see the contents by printing row:

In [6]:

for row in collect.get_iterator():
    break

In [7]:

#print(json.dumps(row, indent=2))

How do we get the links?¶

Each Tweet can have more than one link, thus we need to unpack those values! urlexpander has a function to do just this:

In [24]:

?urlexpander.tweet_utils.get_link

Signature: urlexpander.tweet_utils.get_link(tweet)
Docstring:
Returns a generator containing tweet metadata about media.

The metadata dict contains the following columns:

columns = {
  'link_domain' : 'the domain of the URL', 
  'link_url_long' : 'the URL (this can be short!)', 
  'link_url_short' : 'The t.co URL', 
  'tweet_created_at' : 'When the tweet was created', 
  'tweet_id' : 'The ID of the tweet', 
  'tweet_text' : 'The Full text of the tweet', 
  'user_id' : 'The Twitter ID of the tweeter'
}

:input tweet: a nested dictionary of a Tweet either from the streaming or search API.
:returns: a generator of dictionaries
File:      ~/anaconda3/lib/python3.6/site-packages/urlexpander/core/tweet_utils.py
Type:      function

Once again we have another generator

In [9]:

# returns a genrator, which is uninterpretable!
urlexpander.tweet_utils.get_link(row)

Out[9]:

<generator object get_link at 0x11b8a3f10>

In [10]:

# we can access the data by iterating through it.
for link_meta in urlexpander.tweet_utils.get_link(row):
    print(link_meta)

{'user_id': 1089859058, 'tweet_id': 976517212063322112, 'tweet_created_at': 'Wed Mar 21 17:53:40 +0000 2018', 'tweet_text': None, 'link_url_long': 'http://bit.ly/2FC7bMz', 'link_domain': 'bit.ly', 'link_url_short': 'https://t.co/5P1JAaxwQV'}

In [13]:

# to unwrap this we'll do this mess of code
list(itertools.chain.from_iterable([urlexpander.tweet_utils.get_link(row)]))

Out[13]:

[{'user_id': 1089859058,
  'tweet_id': 976517212063322112,
  'tweet_created_at': 'Wed Mar 21 17:53:40 +0000 2018',
  'tweet_text': None,
  'link_url_long': 'http://bit.ly/2FC7bMz',
  'link_domain': 'bit.ly',
  'link_url_short': 'https://t.co/5P1JAaxwQV'}]

First let's a generalize the workflow into a function. Below is a boilerplate function you can use this as a starting place for your own workflow.

In [15]:

def read_file_extract_links(f):
    '''
    This function takes in a Tweet file that bzip2-compressed, 
    newline-deliminted json, and returns a list of dictionaries
    for link data.
    '''
    # read the json file into a generator
    collection = JsonCollection(f, compression='bz2', throw_error=False)
    
    # iterate through the json file, extract links, flatten the generator of links
    # into a list, and store into a Pandas dataframe
    df_ = pd.DataFrame(list(
            itertools.chain.from_iterable(
                [urlexpander.tweet_utils.get_link(t) 
                 for t in collection.get_iterator() 
                 if t]
            )))
    df_['file'] = f
    return df_.to_dict(orient='records')

We can iterate through files and run the function iteratively. From there, we can instantiate a Pandas DataFrame.

In [14]:

data = []
for f in tqdm(files[:2]):
    # read the json file into a generator
    data.extend(read_file_extract_links(f)
df_links = pd.DataFrame(data)

100%|██████████| 2/2 [00:01<00:00,  1.03s/it]

Advanced (but practical) usage¶

The for loop is is slow! This task is not memory intensive (because we're using generators).

We can parallelize this task if we will use the Pool class from the Mulitprocessing package to have each core on our computer read a JSON file of Tweets and filter for links. The "if" statement is to prevent repeating work once we have already read the files once. We cache this intermediate in the file_raw_links file path declared at the beginning of the notebook.

In [16]:

N_CPU = 4 # this is the number of cores we canuse to parallelize the fast
if not os.path.exists(file_raw_links):
    data = []
    with Pool(N_CPU) as pool:
        iterable = pool.imap_unordered(read_file_extract_links, files)
        for link_data in tqdm(iterable, total=len(files)):
            data.extend(link_data)
    df_links = pd.DataFrame(data)
    df_links.to_csv(file_raw_links, index=False)

else:
    df_links = pd.read_csv(file_raw_links)

df_links.head(2)

Out[16]:

	file	link_domain	link_url_long	link_url_short	tweet_created_at	tweet_id	tweet_text	user_id
0	/scratch/olympus/projects/mediascore/Data/json...	frc.org	https://www.frc.org/wwlivewithtonyperkins/rep-...	https://t.co/l9dXT0L7oT	Fri Mar 23 14:38:34 +0000 2018	977192888781168640	nan	2966758114
1	/scratch/olympus/projects/mediascore/Data/json...	thehill.com	http://thehill.com/379188-watch-fund-governmen...	https://t.co/YbdvepWNQ3	Thu Mar 22 15:21:32 +0000 2018	976841314024206339	nan	2966758114

How useful is this data?¶

The bulk of URLs we encounter in the wild are sent through a link shortener. Link shorteners record transactional information whenever that link is clicked. Unfortunately it makes it hard for us to see what was being shared.

In [17]:

links = df_links['link_url_long'].tolist()
links[-5:]

Out[17]:

['http://goo.gl/kDUwP',
 'http://bit.ly/12clU3p',
 'http://nyti.ms/Z4rdlU',
 'http://goo.gl/LxkrY',
 'http://www.huffingtonpost.com/rep-diana-degette/reducing-gun-violence-mea_b_3018506.html']

This is why urlexpander was made. We can run the expand function on single URLs, as well as a list of URLs.

In [18]:

urlexpander.expand(links[-5])

Out[18]:

'https://degette.house.gov/index.php?option=com_content&view=article&id=1260:congressional-lgbt-equality-caucus-praises-re-introduction-of-employment-non-discrimination-act&catid=76:press-releases-&Itemid=227'

By default, urlexpander will expand every URL it is shown. However you can pass a boolean function (one the returns True or False based on an inputted string) to the filter_function parameter.

In [19]:

urlexpander.expand(links[-5:], filter_function=urlexpander.is_short)

Out[19]:

['https://degette.house.gov/index.php?option=com_content&view=article&id=1260:congressional-lgbt-equality-caucus-praises-re-introduction-of-employment-non-discrimination-act&catid=76:press-releases-&Itemid=227',
 'http://www.rollcall.com/news/house_hydro_bill_tests_water_for_broad_energy_deals-224277-1.html',
 'http://nyti.ms/Z4rdlU',
 'http://www.civiccenterconservancy.org/history-2012-nhl-designation_25.html',
 'http://www.huffingtonpost.com/rep-diana-degette/reducing-gun-violence-mea_b_3018506.html']

What's happening behind the scenes?¶

['abc.com/123', 'bbc.co.uk/123', 'abc.com/123', 'bit.ly/cbc23']
--> Remove duplicates
['abc.com/123', 'bbc.co.uk/123', 'bit.ly/cbc23']
--> Filter for shortened URLs
['bit.ly/cbs23']
--> Check the cache file, did we already expand this? Unshorten new urls
[{'original_url': 'bit.ly/cbs23', 'resolved' : 'cspan.com/123'}]
--> join back in
['abc.com/123', 'bbc.co.uk/123', 'abc.com/123', 'cspan.com/123']

urlexpander parallelizes, filters and caches the input, which is essential for social media data.

In [23]:

?urlexpander.expand

Signature: urlexpander.expand(links_to_unshorten, chunksize=1280, n_workers=1, cache_file=None, random_seed=303, verbose=0, filter_function=None, **kwargs)
Docstring:
Calls expand with multiple (``n_workers``) threads to unshorten a list of urls. Unshortens all urls by default, unless one sets a ``filter_function``.

:param links_to_unshorten: (list, str) either an idividual or list (str) of urls to unshorten
:param chunksize: (int) chunks links_to_unshorten, which makes computation quicker with larger inputs
:param n_workers: (int) how many threads
:param cache_file: (str) a path to a json file to read and write results in
:param random_seed: (int) initializes the random state for shuffling the input
:param verbose: (int) whether to print updates and errors. 0 is silent. 1 is progress bar. 2 is progress bar and errors.
:param filter_function: (func) a boolean used to filter url shorteners out
    

:returns: (list) a list of resolved urls
File:      ~/anaconda3/lib/python3.6/site-packages/urlexpander/core/api.py
Type:      function

The above is a toy example with 5 links, let's see how this works for 1.7 Mil links. For reference, this took me an hour on an 8-core computer with reliable internet connection.

In [30]:

resolved_urls = urlexpander.expand(links, 
                                   filter_function=urlexpander.is_short,
                                   n_workers=64,
                                   chunksize=1280,
                                   cache_file=file_cache,
                                   verbose=1)

In [51]:

len(resolved_urls)

Out[51]:

In [52]:

df_links['link_resolved'] = resolved_urls
df_links['link_resolved_domain'] = df_links['link_resolved'].apply(urlexpander.get_domain)

In [254]:

df_links.to_csv(file_expanded_links, index=False)

Analytics¶

With the links resolved, how can we use links as data?

In [31]:

df_links = pd.read_csv(file_expanded_links)
df_links.head(2)

Out[31]:

	file	link_domain	link_url_long	link_url_short	tweet_created_at	tweet_id	tweet_text	user_id	link_resolved	link_resolved_domain
0	/scratch/olympus/projects/mediascore/Data/json...	frc.org	https://www.frc.org/wwlivewithtonyperkins/rep-...	https://t.co/l9dXT0L7oT	Fri Mar 23 14:38:34 +0000 2018	977192888781168640	nan	2,966,758,114	https://www.frc.org/wwlivewithtonyperkins/rep-...	frc.org
1	/scratch/olympus/projects/mediascore/Data/json...	thehill.com	http://thehill.com/379188-watch-fund-governmen...	https://t.co/YbdvepWNQ3	Thu Mar 22 15:21:32 +0000 2018	976841314024206339	nan	2,966,758,114	https://thehill.com/379188-watch-fund-governme...	thehill.com

We can get an overview of the most frequently shared domains:

In [32]:

df_links['link_resolved_domain'].value_counts().head(15)

Out[32]:

twitter.com           255532
house.gov             199218
youtube.com            93986
facebook.com           90061
senate.gov             78645
washingtonpost.com     29886
instagram.com          28460
nytimes.com            25014
thehill.com            22925
politico.com           13488
foxnews.com            12045
cnn.com                11611
wsj.com                11289
twimg.com               9633
ow.ly                   9463
Name: link_resolved_domain, dtype: int64

Text-based URL Metadata¶

We can also look at the contents of each URL. Twitter provides URL metadata (if you pay), we provided a workaround!

In [33]:

url = 'http://www.rollcall.com/news/house_hydro_bill_tests_water_for_broad_energy_deals-224277-1.html'
meta = urlexpander.html_utils.get_webpage_meta(url)
meta

Out[33]:

OrderedDict([('url',
              'http://www.rollcall.com/news/house_hydro_bill_tests_water_for_broad_energy_deals-224277-1.html'),
             ('title', 'House Hydro Bill Tests Water for Broad Energy Deals'),
             ('description',
              ' In February, the House did something rare: It passed an energy bill unanimously. Unlike the previous Congress’ standard fare of anti-EPA, pro-drilling measures, the first energy bill of the 113th Congress promoted small-scale hydropower projects and the electrification of existing dams.'),
             ('paragraphs',
              ['',
               '',
               'In February, the House did something rare: It passed an energy bill unanimously. Unlike the previous Congress’ standard fare of anti-EPA, pro-drilling measures, the first energy bill of the 113th Congress promoted small-scale hydropower projects and the electrification of existing dams.',
               'In other words, the Republican-controlled House passed a clean-energy bill.',
               'Of course, few Republicans could object on ideological grounds to legislation that aimed to expedite or remove regulatory requirements for expanding hydropower facilities. And it certainly helped that the chamber passed similar legislation in 2012 by a large margin.               ',
               'Nevertheless, hydropower proponents say that developing the resource represents “low-hanging fruit” that members tired of the partisanship that has permeated energy policy can all agree is worth advancing to President Barack Obama’s desk. Hydropower represented nearly two-thirds — the largest share by far — of domestic renewable-energy production and 8 percent of total U.S. electricity generation in 2011, according to the Energy Information Administration. More than half of that total powered the Pacific Northwest region, and hydro is one of the few renewable resources that can provide baseload electricity — power that is available at all times — to the grid.',
               'But just 3 percent of the 80,000 dams in the United States generate power, representing great potential for growing the resource, according to legislation championed by Reps. Cathy McMorris Rodgers, R-Wash., and  Diana DeGette, D-Colo.',
               '“If you can work on regulatory reform for those projects, then you can have small hydro throughout this country,” DeGette said. ',
               'The House lawmakers’ legislation would let small hydroelectric facilities generating up to 10 megawatts of power bypass Federal Energy Regulatory Commission licensing requirements that currently apply to projects producing more than 5 megawatts. The bill also would require FERC to study the feasibility of carrying out a two-year hydropower licensing pilot program at unpowered dams and would allow the commission to extend preliminary permits for two additional years.  ',
               'Jeff Leahey, government affairs director at the National Hydropower Association, said House leaders probably moved the bill so they could promote energy legislation that “checked the boxes” on encouraging the development of a resource that is renewable and reliable. It also helped that the bill — along with another measure passed April 10 that would designate an Interior Department agency as the lead regulator of small federal conduits — moved through the House last Congress and didn’t need much additional work, he said.',
               'A significant factor in the refocused spotlight on the “original renewable” is the new leadership on the Senate Energy and Natural Resources Committee and its representation of key hydropower-producing states. Chairman Ron Wyden, D-Ore., promised industry representatives at the hydropower association’s annual conference this week that he plans to “quickly” mark up hydropower legislation after a panel hearing Tuesday. ',
               'Wyden attributes the rise in hydro’s profile to better environmental stewardship on the part of facility operators and a more cooperative relationship between hydropower lobbyists and environmental groups focused on protecting river ecosystems. The effect of dams on fisheries and riverine habitat, as well as operational costs, has compelled organizations to promote the removal of dams in some instances.',
               '“Hydro’s environmental performance has improved dramatically,” Wyden said. ',
               'Association President David Moller of Pacific Gas and Electric Co. acknowledged the role that historically low natural-gas prices have played in limiting hydropower expansion in recent years. But he said opportunities for hydropower still flourish because of state renewable portfolio mandates, coal-fired power plants being pushed into retirement and technological advances in powering existing dams and water channels.',
               '“The price of natural gas has dropped, but it will never match hydropower’s fuel price of zero, or its attributes of being renewable and non-carbon-based,” he said. ',
               'Hydro proponents in the private sector and in Congress said this week that they will continue to promote hydropower development, possibly in future legislation. That could include examining additional regulatory issues that contribute to long lead times for completing electrified projects or adjusting current benefits that exist for clean power in the tax code. Prospects for he latter — which would involve extending the production tax credit for a multiyear period or making it permanent as Obama proposes — are dim outside a comprehensive tax code overhaul.',
               '“I think at the end of the day, it’s all about making sure that hydropower and the benefits that come from hydropower projects are competitive in the marketplace with other energy projects,” Leahey said. ',
               'Whether the bipartisan camaraderie that has surrounded promoting hydropower can translate to moving broader energy legislation is anyone’s guess. But members close to the debate express optimism that the current spate of legislative action could beget compromise in the future.',
               '“I’m not sure we’re any closer to that comprehensive policy, but I’d think that common ground on these issues like hydro can only be helpful,” DeGette said.',
               '',
               '×',
               '$${CardTitle}',
               '$${CardTitle}'])])

Categorizing what is shared¶

If we want to know more about what kinds of information are being shared by members of congress, we can enrich the dataset by joining metadata about domains. In this example we will use the Local News Dataset to inspect the local media outlets that members of congress share:

In [36]:

local_news_url = 'https://raw.githubusercontent.com/yinleon/LocalNewsDataset/master/data/local_news_dataset_2018.csv'
df_local = pd.read_csv(local_news_url)
df_local = df_local[~(df_local.domain.isnull()) & 
                    (df_local.domain != 'facebook.com')]
df_local.head()

Out[36]:

	name	state	website	domain	twitter	youtube	facebook	owner	medium	source	collection_date
0	KWHE	HI	http://www.kwhe.com/	kwhe.com	NaN	NaN	NaN	LeSea	TV station	stationindex	2018-08-02 14:55:24.612585
1	WGVK	MI	http://www.wgvu.org/	wgvu.org	NaN	NaN	NaN	Grand Valley State University	TV station	stationindex	2018-08-02 14:55:24.612585
3	KTUU	AK	http://www.ktuu.com/	ktuu.com	NaN	NaN	NaN	Schurz Communications	TV station	stationindex	2018-08-02 14:55:24.612585
4	KTBY	AK	http://www.ktbytv.com/	ktbytv.com	NaN	NaN	NaN	Coastal Television Broadcasting	TV station	stationindex	2018-08-02 14:55:24.612585
5	KYES	AK	http://www.kyes.com/	kyes.com	NaN	NaN	NaN	Fireweed Communications	TV station	stationindex	2018-08-02 14:55:24.612585

In [40]:

# this is a SQL-like join in Pandas that merges the two datasets based on domain name!
df_links_state = df_links.merge(df_local, 
                                left_on='link_resolved_domain', 
                                right_on='domain')

This dataset unlocks insights regarding the locality of news articles shared, as well as media ownership.

In [54]:

df_links_state.state.value_counts().head(20)

Out[54]:

TX    43962
CA    35029
NJ    33442
NY    33145
MI    28912
FL    21519
NC    19841
PA    19060
OH    18857
MD    16533
GA    13419
MO    12660
TN    12324
AZ    11052
MA    10767
WA    10325
NV    10261
LA     9519
MN     9510
OR     9404
Name: state, dtype: int64

In [43]:

df_links_state.owner.value_counts().head(25)

Out[43]:

Nexstar                                         7543
Advance Local                                   4464
Tegna Media                                     4265
Sinclair                                        3457
Hearst                                          2571
Tribune                                         2086
Gray Television                                 1817
Fox Television Stations                         1580
Hearst Television                               1528
Acvance Local                                   1522
Raycom                                          1396
NBC Universal                                   1344
Georgia Public Broadcasting                     1248
New Jersey Public Broadcasting Authority        1156
The Philadelphia Inquirer                       1103
ABC                                              978
Evening Post Publishing                          820
Meredith                                         750
Oregon Public Broadcasting                       740
Georgia Public Telecommunications Commission     624
Graham Media Group                               588
E. W. Scripps Company                            540
Cox Enterprises                                  526
WGBH Educational Foundation                      425
Piedmont Television                              333
Name: owner, dtype: int64

In [53]:

df_links_state[df_links_state.owner == 'Sinclair']['user_id'].value_counts().head(10)

Out[53]:

66891808              100
818948638890217472     68
1065995022             66
1058345042             64
90651198               63
368948092              56
27676828               48
2929491549             48
2987671552             44
58579942               44
Name: user_id, dtype: int64

This is one example of a dataset enrichment, you can create your own categorizations and join them in. Alexa.com is a good starting place.

User-Aggregated Acvitity¶

The frequency of domains shared per-user make rich features. We can aggregate the data using the following utility function:

In [55]:

count_matrix = urlexpander.tweet_utils.count_matrix(
      df_links,
      user_col='user_id',
      domain_col='link_resolved_domain',
      min_freq=20,
)

count_matrix.head()

Out[55]:

	1011fmtheanswer.com	10best.com	10tv.com	11alive.com	123formbuilder.com	12news.com	12newsnow.com	13abc.com	13wham.com	13wmaz.com	...	yorkdispatch.com	youarecurrent.com	youcaring.com	youngcons.com	yourconroenews.com	yourdailyjournal.com	youtube.com	zeldinforcongress.com	zeldinforsenate.com	zpolitics.com
user_id
813286	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	198	0	0	0
939091	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	106	0	0	0
5496932	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	427	0	0	0
5511752	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	73	0	0	0
5558312	16	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	53	0	0	0

5 rows × 2376 columns

Are links good features to predict political affiliation?¶

Let's see how the count_matrix features fair in machine learning. To do so, we need to enrich our data with an output variable. For our purposes, we will try to predict political affiliation.

In [56]:

%matplotlib inline
import umap
import matplotlib.pyplot as plt

/Users/leonyin/anaconda3/lib/python3.6/site-packages/numba/errors.py:102: UserWarning: Insufficiently recent colorama version found. Numba requires colorama >= 0.3.9
  warnings.warn(msg)

These files have the affiliation of each account.

In [57]:

meta = []
for f in glob.glob(os.path.join(CONGRESS_METADATA_DIRECTORY, '*')):
    _df = pd.read_csv(f)
    _df = _df[~_df.twitter_id.isnull()]
    meta.extend(_df.to_dict(orient='records'))

In [58]:

df_meta = pd.DataFrame(meta)
df_meta.twitter_id = df_meta.twitter_id.astype(float, errors='ignore').astype(str)
df_links.user_id = df_links.user_id.astype(float, errors='ignore').astype(str)
look_up = df_meta[df_meta['twitter_id'].isin(df_links.user_id)].drop_duplicates(subset=['twitter_id'])[['twitter_id', 'affiliation']]
df_links_ = df_links[df_links.user_id.isin(look_up['twitter_id'])]

In [59]:

len(df_links_.user_id.unique()), len(look_up)

Out[59]:

(971, 971)

Unsupervised Learning¶

This will be an exploratory step, where we will try to visualze the dataset of counts of links shared by user. We will reduce the high-dimensional data (where each domain is one dimension) to two-dimensions for visualization using the UMAP algorithm (much like the populat TSNE algorithm).

In [60]:

def color(party):
    if party == 'Democrat':
        return 'blue' 
    elif party == 'Republican':
        return 'red'
    else:
        return 'black'

def viz_umap_embed(count_matrix, title="Members of Congress Embedded by UMAP", threshold=5, **kwargs):
    '''
    Visualizes the count matrix in 2 dimensions using UMAP.
    '''
    if threshold:
        count_matrix = count_matrix[count_matrix.sum(axis=1) >= threshold]
    parties = look_up.set_index('twitter_id').loc[count_matrix.index].affiliation

    embedding = umap.UMAP(n_components=2, **kwargs).fit_transform(count_matrix.values)

    plt.figure(figsize=(14,8))
    ax = plt.scatter(x = embedding[:,0], 
                     y = embedding[:,1],
                     s = 100,
                     c = parties.apply(color),
                     alpha = .4)

    plt.title(title)
    plt.axis('off')
    plt.show()

In [61]:

# domains to exclude in our count matrix
exclude = ['youtube.com', 'twitter.com', 'fb.com', 
           'facebook.com', 'instagram.com', 'ow.ly', 
           'house.gov', 'senate.gov', 'usa.gov']

In [62]:

count_matrix = urlexpander.tweet_utils.count_matrix(
      df_links_,
      user_col='user_id',
      domain_col='link_resolved_domain',
      min_freq=5,
      exclude_domain_list=exclude,
)

In [63]:

count_matrix.head(2)

Out[63]:

	frc.org	thehill.com	iheart.com	c-span.org	news9.com	speaker.gov	foxnews.com	frcaction.org	koco.com	okcfox.com	...	mullinforcongress.com	therepublicanstandard.com	barbaracomstockforcongress.com	thefriendshipchallenge.com	tomreedforcongress.com	sincomillas.com	detodopr.com	thedowneypatriot.com	garretgravesforcongress.com	about.com
user_id
1004855106.0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1009269193.0	0	12	0	16	0	2	1	0	0	0	...	0	0	0	0	0	0	0	0	0	0

2 rows × 5301 columns

In [64]:

viz_umap_embed(count_matrix, 
               title="Link Sharing of Members of Congress Embedded by UMAP",
               threshold=5,
               # umap params
               n_neighbors=50,
               min_dist=0.1,
               metric='dice',
               random_state=303)

/Users/leonyin/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py:475: DataConversionWarning: Data with input dtype float32 was converted to bool by check_pairwise_arrays.
  warnings.warn(msg, DataConversionWarning)

Supervised Learning¶

In the unsupervised case, we already see Democrats and Republicans sectioned off. Here we will fit a logistic regression model to predict whether a congress member is a Democrat or a Republican based on the count_matrix we just created.

In [65]:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

In [66]:

# filter out independents, and accounts that sent less than one link!
dems_repubs = look_up[look_up.affiliation != 'Independent'].twitter_id
count_matrix_ = count_matrix[(count_matrix.index.isin(dems_repubs)) &
                             (count_matrix.sum(axis=1) >= 1)]
parties = look_up.set_index('twitter_id').loc[count_matrix_.index].affiliation

# create the training set
X, y = count_matrix_.values, parties
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=303, test_size=.15)
len(X_train), len(X_test) 

Out[66]:

(820, 145)

In [67]:

logreg = LogisticRegression(penalty='l2', C=.7,
                            solver='liblinear',
                            random_state=303)
logreg.fit(X_train, y_train)
logreg.score(X_test, y_test)

Out[67]:

0.9517241379310345

Evaluation¶

In [68]:

import numpy as np
from sklearn.metrics import confusion_matrix

In [69]:

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.tight_layout()

In [70]:

y_pred = logreg.predict(X_test)
class_names = logreg.classes_

# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred)
np.set_printoptions(precision=2)

# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names,
                      title='Confusion matrix, without normalization')

# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names, normalize=True,
                      title='Normalized confusion matrix')

plt.show()

Confusion matrix, without normalization
[[49  2]
 [ 5 89]]
Normalized confusion matrix
[[0.96 0.04]
 [0.05 0.95]]

In [71]:

# these are what we got wrong!
df_meta.set_index('twitter_id').loc[
    y_test[y_test != y_pred].index
][['twitter_name']]

Out[71]:

	twitter_name
twitter_id
9.410800851211756e+17	SenDougJones
23820360.0	billhuizenga
136526394.0	WebsterCongress
4615689368.0	GeneGreen29
19726613.0	SenatorCollins
242376736.0	RepCharlieDent
16056306.0	JeffFlake

In [72]:

indep = look_up[look_up.affiliation == 'Independent'].twitter_id
count_matrix_indep_ = count_matrix[(count_matrix.index.isin(indep)) &
                                   (count_matrix.sum(axis=1) >= 2)]

In [73]:

df_ind = df_meta.set_index('twitter_id').loc[count_matrix_indep_.index][['twitter_name']]
df_ind['preds'] = logreg.predict(count_matrix_indep_)
df_ind

Out[73]:

	twitter_name	preds
twitter_id
1068481578.0	SenAngusKing	Republican
216776631.0	BernieSanders	Democrat
2915095729.0	AkGovBillWalker	Republican
29442313.0	SenSanders	Democrat
3196634042.0	GovernorMapp	Republican

K-Fold Cross Validation¶

In [74]:

from sklearn.model_selection import cross_val_score

In [75]:

logreg_cv = LogisticRegression(penalty='l2', C=.7,
                               solver='liblinear',
                               random_state=303)
scores = cross_val_score(logreg_cv, X, y, cv=5)

In [76]:

print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy: 0.96 (+/- 0.02)

In [ ]: