In [1]:
import pandas, sqlite3
from matplotlib import pyplot
from sklearn.cluster import KMeans
import networkx

Clustering, database queries

In the previous class we looked at a dataset of movie reviews and defined "similarity" between movies as being reviewed by the same users. We then constructed a network between users and movies and searched for movies that had the most connections to a query movie, The Avengers (2012).

In today's class we're going to do something similar, but rather than look for items that are similar to one specific item, we're going to compare all items to each other. We'll use k-means clustering. We'll work more with the specific algorithm on Thursday, for today we'll rely on a library.

I also want to give you an example of a SQL-based interface, since that is both powerful and very common in data science as a way to manage large volumes of data. For today I am giving you a 1/30th sample of NYC CitiBike data from April 2018. The full file is around 250MB.

In [2]:
citibike = pandas.read_csv("citibike-201804.csv")
citibike.keys()
Out[2]:
Index(['tripduration', 'starttime', 'stoptime', 'start station id',
       'start station name', 'start station latitude',
       'start station longitude', 'end station id', 'end station name',
       'end station latitude', 'end station longitude', 'bikeid', 'usertype',
       'birth year', 'gender'],
      dtype='object')

These field names will cause trouble because of the spaces. Let's remove them.

In [3]:
dict(zip(citibike.keys(), [s.replace(" ", "_") for s in citibike.keys()]))
Out[3]:
{'tripduration': 'tripduration',
 'starttime': 'starttime',
 'stoptime': 'stoptime',
 'start station id': 'start_station_id',
 'start station name': 'start_station_name',
 'start station latitude': 'start_station_latitude',
 'start station longitude': 'start_station_longitude',
 'end station id': 'end_station_id',
 'end station name': 'end_station_name',
 'end station latitude': 'end_station_latitude',
 'end station longitude': 'end_station_longitude',
 'bikeid': 'bikeid',
 'usertype': 'usertype',
 'birth year': 'birth_year',
 'gender': 'gender'}
In [4]:
citibike = citibike.rename(columns=dict(zip(citibike.keys(), [s.replace(" ", "_") for s in citibike.keys()])))
citibike.head()
Out[4]:
tripduration starttime stoptime start_station_id start_station_name start_station_latitude start_station_longitude end_station_id end_station_name end_station_latitude end_station_longitude bikeid usertype birth_year gender
0 1380 2018-04-01 13:57:02.6330 2018-04-01 14:20:03.1060 72 W 52 St & 11 Ave 40.767272 -73.993929 358 Christopher St & Greenwich St 40.732916 -74.007114 28676 Subscriber 1986 1
1 536 2018-04-02 00:51:07.9250 2018-04-02 01:00:04.3990 72 W 52 St & 11 Ave 40.767272 -73.993929 529 W 42 St & 8 Ave 40.757570 -73.990985 30143 Subscriber 1988 1
2 295 2018-04-02 17:59:25.1300 2018-04-02 18:04:20.9250 72 W 52 St & 11 Ave 40.767272 -73.993929 3236 W 42 St & Dyer Ave 40.758985 -73.993800 25631 Subscriber 1960 1
3 1225 2018-04-03 08:36:22.0150 2018-04-03 08:56:47.0800 72 W 52 St & 11 Ave 40.767272 -73.993929 382 University Pl & E 14 St 40.734927 -73.992005 18266 Subscriber 1991 1
4 384 2018-04-03 19:00:33.7030 2018-04-03 19:06:58.3780 72 W 52 St & 11 Ave 40.767272 -73.993929 514 12 Ave & W 40 St 40.760875 -74.002777 33422 Subscriber 1979 1

Now create a database.

In [6]:
conn = sqlite3.connect("citibike.db")
citibike.to_sql("trips", conn, if_exists="replace")
In [11]:
stations = pandas.read_sql_query("select distinct start_station_id, start_station_name, start_station_latitude, start_station_longitude from trips where start_station_latitude < 45", conn)
stations.head()
Out[11]:
start_station_id start_station_name start_station_latitude start_station_longitude
0 72 W 52 St & 11 Ave 40.767272 -73.993929
1 79 Franklin St & W Broadway 40.719116 -74.006667
2 82 St James Pl & Pearl St 40.711174 -74.000165
3 83 Atlantic Ave & Fort Greene Pl 40.683826 -73.976323
4 119 Park Ave & St Edwards St 40.696089 -73.978034
In [12]:
pyplot.scatter(stations["start_station_longitude"], stations["start_station_latitude"])
pyplot.show()
In [9]:
stations.start_station_latitude.max()
Out[9]:
45.506364054011385

Create a clustering with SKLearn's KMeans package.

In [16]:
clustering = KMeans(n_clusters=10)
clustering.fit(stations[["start_station_longitude", "start_station_latitude"]])

centers = clustering.cluster_centers_
station_clusters = clustering.labels_

pyplot.figure(figsize=(4,8))
pyplot.scatter(stations["start_station_longitude"], stations["start_station_latitude"], c=station_clusters, cmap="viridis")
#pyplot.scatter(centers[:,0], centers[:,1])
pyplot.show()
In [23]:
sql_query = "select start_station_name, end_station_name, " + \
"count(*) as trip_count from trips group by start_station_name, end_station_name"
print(sql_query)
station_trips = pandas.read_sql_query(sql_query, conn)
select start_station_name, end_station_name, count(*) as trip_count from trips group by start_station_name, end_station_name
In [24]:
station_trips.head()
Out[24]:
start_station_name end_station_name trip_count
0 1 Ave & E 110 St 1 Ave & E 78 St 1
1 1 Ave & E 110 St 1 Ave & E 94 St 1
2 1 Ave & E 110 St 2 Ave & E 96 St 5
3 1 Ave & E 110 St 5 Ave & E 93 St 1
4 1 Ave & E 110 St Central Park North & Adam Clayton Powell Blvd 1
In [25]:
station_graph = networkx.from_pandas_edgelist(station_trips, "start_station_name", "end_station_name", ["trip_count"])
adjacency = networkx.convert_matrix.to_numpy_matrix(station_graph, nodelist=stations.start_station_name, weight="trip_count")
In [26]:
adjacency[:10,:10]
Out[26]:
matrix([[1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 3., 1., 0., 0., 0., 0., 0., 0., 0.],
        [0., 1., 1., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 1., 2., 0., 0.],
        [0., 0., 0., 0., 0., 0., 2., 2., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 2.]])
In [40]:
network_clustering = KMeans(n_clusters=10)
In [41]:
network_clustering.fit(adjacency)
Out[41]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=10, n_init=10, n_jobs=None, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

This clustering of the CitiBike stations is based on connections between stations: how many trips started in a station and ended in each of the other stations. Although we are displaying the points using their lat/lon coordinates, the clustering algorithm does not actually know anything about those positions. It only knows which stations tend to connect to each other.

Many of these clusters do still look like neighborhoods. You can see Greenwich Village, Chelsea, the Financial District, and the areas around Park Slope in Brooklyn. But there are also clusters that are long and stringy, like the one that runs along the west side of Manhattan, and others that exist inside other clusters, like the stations adjacent to Central Park.

In [42]:
centers = network_clustering.cluster_centers_
station_clusters = network_clustering.labels_

pyplot.figure(figsize=(7,14))
pyplot.scatter(stations["start_station_longitude"], stations["start_station_latitude"], c=station_clusters, cmap="tab10")
#pyplot.scatter(centers[:,0], centers[:,1])
pyplot.show()