Networks and Congress


In [132]:
%matplotlib inline

import json
import pandas as pd
import numpy as np
import networkx as nx
import requests
from pattern import web
import matplotlib.pyplot as plt
from operator import itemgetter


# set some nicer defaults for matplotlib
from matplotlib import rcParams

#these colors come from colorbrewer2.org. Each is an RGB triplet
dark2_colors = [(0.10588235294117647, 0.6196078431372549, 0.4666666666666667),
                (0.8509803921568627, 0.37254901960784315, 0.00784313725490196),
                (0.4588235294117647, 0.4392156862745098, 0.7019607843137254),
                (0.9058823529411765, 0.1607843137254902, 0.5411764705882353),
                (0.4, 0.6509803921568628, 0.11764705882352941),
                (0.9019607843137255, 0.6705882352941176, 0.00784313725490196),
                (0.6509803921568628, 0.4627450980392157, 0.11372549019607843),
                (0.4, 0.4, 0.4)]

rcParams['figure.figsize'] = (10, 6)
rcParams['figure.dpi'] = 150
rcParams['axes.color_cycle'] = dark2_colors
rcParams['lines.linewidth'] = 2
rcParams['axes.grid'] = False
rcParams['axes.facecolor'] = 'lightgray'
rcParams['font.size'] = 14
rcParams['patch.edgecolor'] = 'none'

def remove_border(axes=None, top=False, right=False, left=True, bottom=True):
    """
    Minimize chartjunk by stripping out unnecessary plot borders and axis ticks
    
    The top/right/left/bottom keywords toggle whether the corresponding plot border is drawn
    """
    ax = axes or plt.gca()
    ax.spines['top'].set_visible(top)
    ax.spines['right'].set_visible(right)
    ax.spines['left'].set_visible(left)
    ax.spines['bottom'].set_visible(bottom)
    
    #turn off all ticks
    ax.yaxis.set_ticks_position('none')
    ax.xaxis.set_ticks_position('none')
    
    #now re-enable visibles
    if top:
        ax.xaxis.tick_top()
    if bottom:
        ax.xaxis.tick_bottom()
    if left:
        ax.yaxis.tick_left()
    if right:
        ax.yaxis.tick_right()

The website govtrack.us collects data on activities in the Senate and House of Representatives. It's a great source of information for making data-driven assessments about Congress.

Problem 1.

The directories at http://www.govtrack.us/data/congress/113/votes/2013 contain JSON information about every vote cast for the current (113th) Congress. Subdirectories beginning with "S" correspond to Senate votes, while subdirectories beginning with "H" correspond to House votes.

Write two functions: one that downloads and parses a single Senate vote page given the vote number, and another that repeatedly calls this function to build a full collection of Senate votes from the 113th Congress.

In [133]:
"""
Function
--------
get_senate_vote

Scrapes a single JSON page for a particular Senate vote, given by the vote number

Parameters
----------
vote : int
   The vote number to fetch
   
Returns
-------
vote : dict
   The JSON-decoded dictionary for that vote
   
Examples
--------
>>> get_senate_vote(11)['bill']
{u'congress': 113,
 u'number': 325,
 u'title': u'A bill to ensure the complete and timely payment of the obligations of the United States Government until May 19, 2013, and for other purposes.',
 u'type': u'hr'}
"""
#your code here

def get_senate_vote(vote):
    
    url = ''.join(["https://www.govtrack.us/data/congress/113/votes/2013/s",vote,"/data.json"])
    data = requests.get(url).json()
    
    return data
    
    
In [134]:
"""
Function
--------
get_all_votes

Scrapes all the Senate votes from http://www.govtrack.us/data/congress/113/votes/2013,
and returns a list of dicts

Parameters
-----------
None

Returns
--------
votes : list of dicts
    List of JSON-parsed dicts for each senate vote
"""
#Your code here

## Q: Why use BeautifulSoup?
## A: pattern.web's documentation is awful. 
##    Even though it says its implementation is on top of BS, it doesn't behave perfectly.  
##    BS's 'findAll' method is great here, so i switched.

from BeautifulSoup import BeautifulSoup
import re

def get_all_votes():
    
    link = "https://www.govtrack.us/data/congress/113/votes/2013/"

    data = requests.get(link).text   
            
    soup = BeautifulSoup(data)
    
    ## Q: what does the regex mean?
    ## A: 's' and '/' are string literals
    ##    \d is shorthand for [0-9]
    ##    {1,4} means we need at least one digit, and up to 4 
    ##          (i'm pretty sure there aren't more than 999 votes, but definitely not more than 9999 votes!)
    
    pattern = re.compile("s\d{1,4}/")
    
    ## search for href tags with the above pattern
    
    senate_votes = soup.findAll(href=pattern)
    
    ## Q: why [1:-1] in vote['href']?
    ## A: first character (index 0) is 's', last character (index -1) is '/'
    ##    so [1:-1] to extract just the number, which is what we need to pass to get_senate_vote()
    all_senate_votes = []
    all_folders = []
    
    votes = [ int(vote['href'][1:-1]) for vote in senate_votes ]
        
    votes = np.sort(votes)
    
    ## for the first day i worked on this function, i had a hell of a time getting all the way through 
    ## without the site crashing on me.  since then, it hasn't been a problem, although my code didn't change at all.
    ## anyway, the missed[] array here is for catching errors, just in case the problem happens again.
    missed = []
    
    for vote in votes:
        
        try:
            all_senate_votes.append(get_senate_vote(str(vote)))
        except:
            missed.append(vote)
        
    return all_senate_votes
In [92]:
vote_data = get_all_votes()

Problem 2

Now, turn these data into a NetworkX graph, according to the spec below. For details on using NetworkX, consult the lab materials for November 1, as well as the NetworkX documentation.

For this section I made three helper functions.
  • get_senator_display_names() - gets a set of all display names
  • fill_sen_df() - constructs a pandas dataframe with senator data (party, state, etc)
  • fill_vote_agreement() - fills the rest of the sen_df dataframe with edge weights based on voting records
I recognize that this isn't the most efficient way to produce this data. But it best represented the way I was thinking about organizing the data, and, as it didn't take too too long to run, I went with this implementation.
In [137]:
def get_senator_display_names(data):
    
    info = set()
    for vote in data:
        
        for ix in xrange(4):
            
            for vote_type in vote['votes']:
                
                if (vote_type == 'Yea') or (vote_type == 'Nay'):
                    
                    for datum in vote['votes'][vote_type]:
                        
                        senator_info = datum['display_name']
                        info.add(senator_info)  ## get unique list of senator names
                        
    return info

## test
## info = get_senator_display_names(vote_data)
## print list(info)[:2]
In [231]:
def fill_sen_df(info):
    
    ## set up df for all senator info (including vote agreement counts)
    ## it's a 104 row X 108 column matrix 
    ##        (the first 4 columns are string data about each senator)
    
    
    cols = ['display','surname','party','color','state']
    cols.extend(list(info))
    
    sen_df = pd.DataFrame(index=info, 
                          columns = cols, 
                          data = np.zeros((len(info),len(info)+5)))       
    
    ## change the first few columns to string dtype
    sen_df[['display','surname', 'party', 'color', 'state']] = sen_df[['display','surname', 'party', 'color', 'state']].astype(str)
    
    for senator in info:
        
        sen_df.loc[senator, 'display'] = senator
        
        surname = senator[:-7]
        sen_df.loc[senator,'surname'] = surname
        
        affiliation = re.search("\(([RDI]{1})", senator).groups()[0]
        sen_df.loc[senator,'party'] = affiliation
        
        home_state = senator[-3:-1]
        sen_df.loc[senator,'state'] = home_state
        
        col = 'r' if affiliation == 'R' else ('b' if affiliation == 'D' else 'k')
        sen_df.loc[senator,'color'] = col

    return sen_df
In [1]:
def fill_vote_agreement(sen_df, data):
    
    ct = 0
    
    sen_names = []
    
    for vote in data:
        
        ct += 1
        #print 'this is vote #',ct,'out of',len(data),'total votes.'
        
        already_counted = []
        
        for vote_type in vote['votes']:
            
            if (vote_type == 'Yea') or (vote_type == 'Nay'):
                
                for nodeA in vote['votes'][vote_type]:
                    
                    nameA = nodeA['display_name']
                    
                    if (not nameA in sen_names):
                    
                        sen_names.append(nameA)


                    for nodeB in vote['votes'][vote_type]:
                        
                        nameB = nodeB['display_name']
                        
                        this_pair = np.sort([nameA, nameB])
                        
                        if (not (this_pair[0], this_pair[1]) in already_counted) and (nameA != nameB):
                            
                            already_counted.append((this_pair[0], this_pair[1]))
                            #if (sen_df.loc[ this_pair[0], this_pair[1] ] > 1): 
                                #print 'value:', sen_df.loc[ this_pair[0], this_pair[1] ]
                            
                            sen_df.loc[this_pair[0],this_pair[1]] = sen_df.loc[this_pair[0],this_pair[1]] + 1
                            
                            #print this_pair[0], 'and', this_pair[1], 
                            #' count: ', sen_df.loc[ this_pair[0], this_pair[1] ]
                            
    return sen_df
In [2]:
"""
Function
--------
vote_graph

Parameters
----------
data : list of dicts
    The vote database returned from get_vote_data

Returns
-------
graph : NetworkX Graph object, with the following properties
    1. Each node in the graph is labeled using the `display_name` of a Senator (e.g., 'Lee (R-UT)')
    2. Each node has a `color` attribute set to 'r' for Republicans, 
       'b' for Democrats, and 'k' for Independent/other parties.
    3. The edges between two nodes are weighted by the number of 
       times two senators have cast the same Yea or Nay vote
    4. Each edge also has a `difference` attribute, which is set to `1 / weight`.

Examples
--------
>>> graph = vote_graph(vote_data)
>>> graph.node['Lee (R-UT)']
{'color': 'r'}  # attributes for this senator
>>> len(graph['Lee (R-UT)']) # connections to other senators
101
>>> graph['Lee (R-UT)']['Baldwin (D-WI)']  # edge relationship between Lee and Baldwin
{'difference': 0.02, 'weight': 50}
"""
#Your code here

def vote_graph(data):
    
    ## in general i know it's not good to create globals like this
    ## but seeing as we have required inputs and outputs for this function, this is my workaround to having access
    ## to this df later on, when i create my graphs.
    
    global sen_df
    
    ##
    ## three helper functions
    ##
    ## get_senator_display_names retrieves a list of all the senators in "[surname] ([party]-[state])" format
    info = get_senator_display_names(data)
    
    ## i want one big pandas df to hold all my data
    
    ## fill_sen_df() just gets all the basic information for a given senator
    sen_df = fill_sen_df(info)
    
    ## fill_vote_agreement goes in and loops through each vote to collect edge weights
    ## then it dumps it in the 104x104 matrix which constitutes the tail end of my df
    df = fill_vote_agreement(sen_df, data)
    
    
    ## for use later on in graph creation
    sen_df = df
    
    ## initialize Graph object
    g = nx.Graph()

    
    ## add graph nodes (as senator surnames)
    for idx, name in enumerate(df['display']):
        
        g.add_node(name, surname=df.ix[idx,'surname'], color=df.ix[idx, 'color'])
    
    ## add edges (along with weight and difference parameters)
    
    ## we do this with a nested for loop - basically running through all senator names twice, concurrently
    ## for each unique pair with some voting relationship, add an edge to the network
    
    for nameA in df['display']:
        
        for nameB in df['display']:
            
            this_pair = np.sort([nameA,nameB])
            
            if (nameA != nameB) and (df.loc[this_pair[0]][this_pair[1]] > 0):
            
                ## get edge weight from sen_df
                weight = sen_df.loc[this_pair[0]][this_pair[1]]
                               
                ## add edge between co-voting senators
                ## include 'weight' and 'difference' attributes
                g.add_edge(this_pair[0], this_pair[1], weight=weight, difference=(1./weight))                    
    
    return g

votes = vote_graph(vote_data)

## test
print votes.node['Lee (R-UT)']
print len(votes['Lee (R-UT)'])                    
print votes['Lee (R-UT)']['Baldwin (D-WI)']                    
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-2-78dfdf2a5da3> in <module>()
     89     return g
     90 
---> 91 votes = vote_graph(vote_data)
     92 
     93 ## test

NameError: name 'vote_data' is not defined

How (and how not) to visualize networks

Network plots often look impressive, but creating sensible network plots is tricky. From Ben Fry, the author of the Processing program:

Usually a graph layout isn’t the best option for data sets larger than a few dozen nodes. You’re most likely to wind up with enormous spider webs or balls of string, and the mess seen so far is more often the case than not. Graphs can be a powerful way to represent relationships between data, but they are also a very abstract concept, which means that they run the danger of meaning something only to the creator of the graph. Often, simply showing the structure of the data says very little about what it actually means, even though it’s a perfectly accurate means of representing the data. Everything looks like a graph, but almost nothing should ever be drawn as one.

Let's look at bad and better ways of visualizing the senate vote network.

First, consider the "default" plot from networkx.

In [142]:
#this makes sure draw_spring results are the same at each call
np.random.seed(1)  

color = [votes.node[senator]['color'] for senator in votes.nodes()]

#determine position of each node using a spring layout
pos = nx.spring_layout(votes, iterations=200, k=2)

#plot the edges
nx.draw_networkx_edges(votes, pos, alpha = .05)

#plot the nodes
nx.draw_networkx_nodes(votes, pos, node_color=color)

#draw the labels
lbls = nx.draw_networkx_labels(votes, pos, alpha=5, font_size=8)

#coordinate information is meaningless here, so let's remove it
plt.xticks([])
plt.yticks([])
remove_border(left=False, bottom=False)

The spring layout tries to group nodes with large edge-weights near to each other. In this context, that means it tries to organize the Senate into similarly-voting cliques. However, there's simply too much going on in this plot -- we should simplify the representation.

Problem 3

Compute the Minimum Spanning Tree of this graph, using the difference edge attribute as the weight to minimize. A Minimum Spanning Tree is the subset of edges which trace at least one path through all nodes ("spanning"), with minimum total edge weight. You can think of it as a simplification of a network.

Plot this new network, making modifications as necessary to prevent the graph from becoming too busy.

In [183]:
#Your code here
spanning_tree = nx.minimum_spanning_tree(votes, weight='difference')

## recompute color assignments
color = [spanning_tree.node[senator]['color'] for senator in spanning_tree.nodes()]

##
## repeat graph display code from above
## the only difference is k parameter for spring_layout
##

#determine position of each node using a spring layout
pos = nx.spring_layout(spanning_tree, iterations=200, k=.03)

#plot the edges
nx.draw_networkx_edges(spanning_tree, pos, alpha = .02)

#plot the nodes
nx.draw_networkx_nodes(spanning_tree, pos, node_color=color)

#draw the labels
lbls = nx.draw_networkx_labels(spanning_tree, pos, alpha=5, font_size=8)

#coordinate information is meaningless here, so let's remove it
plt.xticks([])
plt.yticks([])
remove_border(left=False, bottom=False)

Problem 4

While this graph has less information, the remaining information is easier to digest. What does the Minimum Spanning Tree mean in this context? How does this graph relate to partisanship in the Senate? Which nodes in this graph are the most and least bi-partisan?

MST re-draws a graph that connects all its vertices together, but using a schema that minimizes whatever we want to count as 'weight'.
Here, we've assigned weight as the difference score between two senators - in other words, the inverse of their voting similarity. This essentially translates to a tree structure which penalizes highly dissociated pairwise connections.

We can see this in the fact that Republican and Democratic senators are, for the most part, strung out in a line connecting to only other members on their side of the aisle. The nodes in the center of the swoosh are the most bi-partisan, and the ones out on the tails are the most partisan voters.

Problem 5

(For this problem, use the full graph for centrality computation, and not the Minimum Spanning Tree)

Networkx can easily compute centrality measurements.

Briefly discuss what closeness_centrality means, both mathematically and in the context of the present graph -- how does the centrality relate to partisanship? Choose a way to visualize the closeness_centrality score for each member of the Senate, using edge difference as the distance measurement. Determine the 5 Senators with the highest and lowest centralities.

Comment on your results. In particular, note the outliers John Kerry (who recently resigned his Senate seat when he became Secretary of State), Mo Cowan (Kerry's interim replacement) and Ed Markey (Kerry's permanent replacement) have low centrality scores -- why?

Closeness centrality refers to the summed distance of any one given node to all other nodes in a network. (Well, technically, it's the inverse of the summed distances, but that's really the same core ratio.) So nodes with high closeness centrality have faster access (ie. shorter walks) to other network nodes than do those with low closeness centrality.

In terms of partisanship, closeness centrality is positively related with bipartisan voting. The more you vote in agreement with senators across the aisle, the shorter your distance (here, measured by our 'difference' vector) to any one given senator - even the ones on the fringes.

By contrast, if you're extremely partisan, you may be very close to other members of your party, but the long walks necessary to connect your node to partisan nodes of the other party essentially make you, on average, 'far' from many other nodes. This prediction is imbalanced somewhat by the fact that, here, we have a Democratic majority in the Senate, and so even relatively partisan Democrats will appear to have higher closeness centrality, simply by virtue of the fact that there are more of their own party members to vote alongside.

Now, I'm a little shaky on the math here, but I found a passage on p13 of Borgatti's introduction of the closeness centrality formula (linked from the networkx documentation) that helped explain the outliers . Take the first of this pair of closeness centrality equations:



where m is '# of voting instances' and n is '# of potential voting agreements'.

So while the number of senators you could vote alongside remains relatively constant, our numerator is largely impacted by the number of voting events you take part in. If for some reason you didn't take part in many votes, then you'd have a reduced closeness centrality - which wouldn't really be an appropriate measure of your degree of partisanship.

For instance, the bar graph displayed below shows a few outliers on the low end. We have: Lautenberg, Chisea, Booker (all NJ) and Kerry, Cowan, and Markey (all MA). Here's the scoop:

  • Lautenberg (D-NJ) has fewer voting instances by virtue of being dead.
  • Chisea (R-NJ) served as interim senator for Lautenberg's spot before the election.
  • Booker (D-NJ) won that , and was just sworn in last month. So all NJ senators had fewer voting instances per individual.
  • Kerry, Cowan, and Markey have all occupied the same seat (as described in this question's intro) - all of them combined had fewer votes - same situation as NJ.
In [144]:
#Your code here

## compute closeness scores
## add to sen_df

closeness = nx.closeness_centrality(votes, distance='difference')
sen_df['closeness'] = np.zeros((104,1))
for sen in closeness.keys():
    
    sen_df.ix[sen,'closeness'] = closeness[sen]
    
close_df = sen_df.sort(columns='closeness')

N = close_df.shape[0]
ind = np.arange(N)  ## the x locations for the groups
width = .9       ## the width of the bars

fig, ax = plt.subplots()
ax.bar(ind, close_df['closeness'], width, color=list(close_df['color'].values))

## with help from @1503
xlocs, senators = zip(*enumerate(close_df['surname']))
xticks_locs, xticks_labels = plt.xticks(xlocs, senators)
plt.xlim(min(xlocs), max(xlocs))

## add some labels
ax.set_ylabel('Closeness score')
ax.set_xlabel('Senator Surnames')
ax.set_title('Closeness Centrality of US Senators in the 113th Congress')
ax.set_xticklabels( close_df['surname'], rotation='vertical', fontsize=8)

## it's hard to get everyone on in one graph - so we make it wider
fig.set_size_inches(19,5.5)
plt.show()

Problem 6

Centrality isn't a perfect proxy for bipartisanship, since it gauges how centralized a node is to the network as a whole, and not how similar a Democrat node is to the Republican sub-network (and vice versa).

Can you come up with another measure that better captures bipartisanship than closeness centrality? Develop your own metric -- how does it differ from the closeness centrality? Use visualizations to support your points.

I recomputed our graph object with edges only drawn between bipartisan voting pairs (ie. D/R but not R/R or D/D). Below are three code blocks:

  • computing the new graph

  • displaying the normal spring layout

  • displaying minimum spanning tree </ul> Comments follow after the code blocks.

In [254]:
#your code here
def across_aisle_graph(df):

    g_aisle = nx.Graph()
    
    for idx, name in enumerate(df['display']):
        
        g_aisle.add_node(name, surname=df.ix[idx,'surname'], color=df.ix[idx, 'color'])
        
    for nameA in df['display']:
        
        for nameB in df['display']:
            
            this_pair = np.sort([nameA,nameB])
            if (nameA != nameB) and (df.loc[this_pair[0]][this_pair[1]] > 0) and (df.loc[nameA, 'party'] != df.loc[nameB, 'party']):
            
                ## get edge weight from sen_df
                weight = sen_df.loc[this_pair[0]][this_pair[1]]
                
                ## add edge between co-voting senators
                ## include 'weight' and 'difference' attributes
                g_aisle.add_edge(this_pair[0], this_pair[1], weight=weight, difference=(1./weight))                    
    
    return g_aisle

cross_aisle = across_aisle_graph(sen_df)
In [255]:
np.random.seed(1)  

color = [cross_aisle.node[senator]['color'] for senator in cross_aisle.nodes()]

#determine position of each node using a spring layout
pos = nx.spring_layout(cross_aisle, iterations=200, k=2)

#plot the edges
nx.draw_networkx_edges(cross_aisle, pos, alpha = .05)

#plot the nodes
nx.draw_networkx_nodes(cross_aisle, pos, node_color=color)

#draw the labels
lbls = nx.draw_networkx_labels(cross_aisle, pos, alpha=5, font_size=8)

#coordinate information is meaningless here, so let's remove it
plt.xticks([])
plt.yticks([])
remove_border(left=False, bottom=False)
In [258]:
#Your code here
aisle_spanning_tree = nx.minimum_spanning_tree(cross_aisle, weight='difference')

## recompute color assignments
color = [aisle_spanning_tree.node[senator]['color'] for senator in aisle_spanning_tree.nodes()]

##
## repeat graph display code from above
## the only difference is k parameter for spring_layout
##

#determine position of each node using a spring layout
pos = nx.spring_layout(aisle_spanning_tree, iterations=100, k=.4)

#plot the edges
nx.draw_networkx_edges(aisle_spanning_tree, pos, alpha = .02)

#plot the nodes
nx.draw_networkx_nodes(aisle_spanning_tree, pos, node_color=color)

#draw the labels
lbls = nx.draw_networkx_labels(aisle_spanning_tree, pos, alpha=5, font_size=8)

#coordinate information is meaningless here, so let's remove it
plt.xticks([])
plt.yticks([])
remove_border(left=False, bottom=False)

It looks like there's a cluster of (mostly) Democrats that engages in relatively bipartisan voting behavior. These guys bridge the divide between the sharply partisan camps on either side of the aisle.

The exceptions in this graph support the validity of this approach to network visualization.

For instance, Collins (R-ME) has long had a reputation in the Senate for working towards compromise with Democrats, and we see her lone red circle in the middle group of bridge-makers.
On the other side, Manchin (D-WV) behaves pretty oddly for a registered Democrat. He has long been a strong supporter of the coal mining industry, and broke rank with his party members to vote against key gay rights measures. So the fact that his blue shows up in an ocean of red makes sense, considering his history.

In general, I prefer this approach to evaluating partisanship (over closeness centrality), as it gets at the cross-aisle voting habits that define bipartisan senators. In particular, the minimum spanning tree in this version revealed the sort of clustering I've described here that suggests, for the most part, it's a group of Democratic senators who lead the effort to bridge partisan politics.

NB: The graph rendering has been a little unstable, so I'm attaching screenshots of the two displays I referred to in part 6, below. The first is before minimum spanning, the second is after.
In [250]:
path = 'images/before_minimizing.png'
Image(path)
Out[250]:
In [252]:
path = 'images/after_minimizing.png'
Image(path)
Out[252]: