Notebook

A football addiction confession and English club fan catchments¶

Thursday 13th June

Update 14:46: Attempted a fix for Google Chrome browser but it does not display properly in Kyso. Open in Firefox or Safari to view map.

A football addiction confession¶

Not even two weeks since the Champions League final, today's release of the Premier League 2019/20 fixtures marks the countdown to the new English domestic football season. For fans of newly promoted clubs it's a particularly exciting day, and the more studious will begin to chart potential pathways to league survival. All fans will eagerly look out for their team's opening day, Boxing Day and final day fixtures, as well as those against rivals.

I can remember in years gone by it being my small dose of the potent drug of club football, which, in Europe at least, is otherwise unavailable at this time of year. Lower strength substitutes such as tennis, cricket, the Olympics and even international football are available, but none of these can compete with club football.

Having been brought up in the rural Wiltshire football wilderness, and with a father who did not count the beautiful game among his interests, I ended up devoting my loyalty as a kid to Manchester United - the most successful club at that time. And it would no exaggeration to say I got myself addicted to football. In what were sometimes difficult teen years, following club football provided a hobby and escape for me. I would love to have been good at playing it too, but my slightly poor co-ordination, self consciousness and lack of self confidence got in the way.

This addiction manifested itself in many different ways. Some examples include a few trips to Old Trafford (I even caught this quite remarkable Rooney goal), obsessively checking the BBC Sport transfer gossip column, regularly watching Sky Sports News for hours on end (when it was on Freeview) and a successful lobbying effort to get our house subscribed to Sky Sports.

The definition of a "true fan" is probably someone whose emotions on match days are strongly correlated with their team's showing on the pitch; they feel intensely the joy of the highs and the pain of the lows. As United's domestic and European success between 2006 and 2009 came to an end when I was about 16, it got to a point when I realised I would be happier if I took the conscious decision to "mentally detach" myself from supporting them. Although that may sound like a joke, I'm actually being dead serious!

I didn't abandon football entirely - far from it. Football has been a continued interest of mine and a fruitful source of conversation, particularly as I wouldn't count myself a big film or TV buff. And as I have moved around since school, it has been fun to watch my local teams: York City, West Ham United, Curicó Unido (in Chile), Bristol City and, most recently, Tottenham Hotspur.

I might not be a "fan", but I do feel some degree of belonging and affinity towards all of these clubs. And having worked as a steward at Tottenham since their return to N17 and told my colleagues I'm a Spurs supporter, maybe it is now appropriate to use the "f word" once more. I do appreciate most people will be reading all of this in horror!

Football club fan catchments¶

As some light relief from working on my MSc thesis, I have applied my newly developed Python programming skills to the topic of football club fan catchments: the result is a map that shows the geographic catchment of almost every team in the Premier League and EFL(1).

I created a Voronoi diagram dividing England up in to 89 thiessen polygons, where every possible point inside a given polygon is closest to a certain English Football League club stadium(2). I must acknowledge I used a spreadsheet of stadiums from a blog post written by someone who undertook the same exercise a year ago, but I still claim some novelness to my version by virtue of the fact it is interactive(3)!

I then went a step further by using Census Output Area data to estimate the populations in each of these polygons. I have also calculated the average age of each catchment and YouGov "fame ranking" of those for which it was available(4).

The interactive map shows a thiessen polygon for each club. The colours correspond by default to the population size of the catchment; the intervals for each colour can be seen in the legend. You can click the layer icon on the right-hand side to toggle to an average aged based choropleth map (note you must also deselect population). Additional information is also displayed in the tooltip - this can brought up by clicking on the desired club's map marker.

I hope you find it interesting and would welcome any feedback in the comments section.

_{(1) The Welsh clubs Swansea City, Cardiff City and Newport (who still play in the English system) are excluded, leaving 89 clubs in total.}

_{(2) Based on the teams due to participate in the EFL/EPL the 2019/20 season. Coventry City catchment calculated using coordinates of the Ricoh Arena (their former Stadium). Otherwise they would have an identical catchment due to Birmingham City thanks to their current groundshare arrangement.}

_{(3) I also could not get the co-ordinates in his data to work so I re-extracted these using the Wikipedia API}

_{(4) There are no good statistics on actual fan base sizes other than for the top clubs. Although I think the survey is of UK residents, these YouGov rankings actually contains some European teams that I have stripped out. I could also have scraped the number of Twitter followers for each club but decided life was too short!}

In [53]:

Out[53]:

Sources¶

Grounds for the 18/19 season from Safe.com(4): https://www.safe.com/blog/2018/08/fme-and-data-integration-evangelist177/
Coordinates using Wikipedia API to extract from each club page
Capacities scrapped from this Wikipedia page: https://en.wikipedia.org/wiki/List_of_football_stadiums_in_England
Fame rank from YouGov: https://yougov.co.uk/ratings/sport/popularity/football-clubs/all
Boundaries from Open Geography Portal: http://geoportal.statistics.gov.uk/
Population data from Nomis: https://www.nomisweb.co.uk/

Import packages & configuration¶

In [20]:

import pandas as pd
import numpy as np
import geopandas as gpd
import os
import requests
import pysal as ps
from pysal.contrib.viz import mapping as maps
import palettable as pltt
from seaborn import palplot
import zipfile
import folium
from shapely import geometry
from shapely.geometry import Point
import folium.plugins
import re
import wikipedia
import matplotlib.pyplot as plt
from libpysal.cg.voronoi import Voronoi, voronoi_frames
from shapely.geometry import Polygon
import requests
from bs4 import BeautifulSoup

pd.options.display.max_rows = 200

%matplotlib inline

Main club & stadium data¶

In [21]:

# Read in data sets
stadiums_data = pd.read_csv(
    "EnglishFootballGrounds/PremiershipFootballGrounds.csv")
stadiums_data = stadiums_data.append(pd.read_csv(
    "EnglishFootballGrounds/ChampionshipFootballGrounds.csv"))
stadiums_data = stadiums_data.append(pd.read_csv(
    "EnglishFootballGrounds/LeagueOneFootballGrounds.csv"))
stadiums_data = stadiums_data.append(pd.read_csv(
    "EnglishFootballGrounds/LeagueTwoFootballGrounds.csv"))

# Only using team and stadium
stadiums_data = stadiums_data[["Team", "Stadium"]]

In [22]:

# Import English regions data to get England boundary
regions = gpd.read_file("regions.shp", crs={'init': 'epsg:27700'})
regions = regions.to_crs(epsg=4326)

regions["dis"] = np.zeros

# Disolve boundaries
regionz = regions.dissolve(by='dis')

In [23]:

# Remove teams relegated from League 2 end of 2018/19 season
stadiums_data = stadiums_data[stadiums_data.Team != "Notts County"]
stadiums_data = stadiums_data[stadiums_data.Team != "Yeovil Town"]

# Add teams promoted to League 2 end of 2018/19 season
stadiums_data = stadiums_data.append(
    {"Team": "Salford City", "Stadium": "Moor Lane"}, ignore_index=True)
stadiums_data = stadiums_data.append(
    {"Team": "Leyton Orient", "Stadium": "Brisbane Road"}, ignore_index=True)
stadiums_data = stadiums_data.replace(np.nan, '', regex=True)

# Remove Welsh teams
stadiums_data = stadiums_data[stadiums_data.Team != "Swansea City"]
stadiums_data = stadiums_data[stadiums_data.Team != "Cardiff City"]
stadiums_data = stadiums_data[stadiums_data.Team != "Newport County"]

# Correct spelling errors
stadiums_data.Team = stadiums_data.Team.apply(lambda x: x.replace('and Hove', '& Hove'))
stadiums_data.Team = stadiums_data.Team.apply(lambda x: x.replace('Middlesborough', 'Middlesbrough'))
stadiums_data.Team = stadiums_data.Team.apply(lambda x: x.replace('West Bromich Albion', 'West Bromwich Albion'))

Get stadium coordinates¶

In [24]:

# Function to remove extra characters in stadium name
def rem_brack(x):
    return re.sub("[\(\[].*?[\)\]]", "", x)


stadiums_data["Stadium_name"] = stadiums_data.Stadium.apply(rem_brack)

In [25]:

# Exceptions for those where incorrect stadium name prevented match
def resolve_names(stad_name):    
        if stad_name == "Stamford Bridge":
            return "Stamford Bridge (stadium)"
        if stad_name == "White Hart Lane":
            return "Tottenham Hotspur Stadium"
        elif stad_name == "Etihad Stadium":
            return "City of Manchester Stadium"
        elif stad_name == "Ashton Gate":
            return "Ashton Gate Stadium"
        elif stad_name == "Bramall Land":
            return "Bramall Lane"
        elif stad_name == "Hillsborough":
            return "Hillsborough Stadium"
        elif stad_name == "The Valley":
            return"The Valley (London)"
        elif stad_name == "St Andrews":
            return "St Andrew's (Stadium)"
        elif stad_name == "County Ground":
            return "County Ground (Swindon)"
        elif stad_name == "St James Park":
            return "St James Park (Exeter)"
        elif stad_name == "Broadhill Way":
            return "Broadhall Way"
        elif stad_name == "Globe Arena":
            return "Globe Arena (football stadium)"
        else:
            return stad_name
        
stadiums_data.Stadium_name = stadiums_data.Stadium_name.apply(resolve_names)

In [27]:

# Get coordinates from Wikipedia API
no_match = []


def get_coords(name):
    # Exceptions where no coordinates in page name
    if name == "Vale Park":
        coords = (53.049829, -2.192612)
    elif name == "Memorial Stadium":
        coords = (51.486421, -2.583098)
    elif name == "Bloomfield Road":
        coords = (53.805080, -3.048066)
    elif name == "Moor Lane":
        coords = (53.513631, -2.276775)
    elif name == "Anfield": #Someone editing the Wikipedia page
        coords = (53.4252, -2.9565)
    else:
        try:
            coords = wikipedia.WikipediaPage(title=name).coordinates
        except:
            no_match.append(name)
            coords = "not found"
    return coords


stadiums_data["geometryA"] = stadiums_data.Stadium_name.apply(get_coords)

YouGov fame data¶

In [28]:

# https://yougov.co.uk/ratings/sport/popularity/football-clubs/all
fame = pd.read_csv("yougov_fame.csv")

In [29]:

# Clean data
fame.Team = fame.Team.apply(lambda x: x.replace(" F.C.", ""))
fame.Team = fame.Team.apply(lambda x: x.replace(" F.C", ""))
fame.Team = fame.Team.apply(lambda x: x.replace(" F. C.", ""))
fame.Team = fame.Team.apply(lambda x: x.replace(" A.F.C.", ""))
fame.Team = fame.Team.apply(lambda x: x.replace("A.F.C. ", ""))

In [30]:

# Merge with main stadiums data
stadiums_data = stadiums_data.merge(fame, how="left")
stadiums_data = stadiums_data.sort_values("rank")
stadiums_data = stadiums_data.reset_index(drop=True)
stadiums_data = stadiums_data.reset_index()
stadiums_data = stadiums_data.rename(columns={"index": "rank_en"})
stadiums_data.rank_en = stadiums_data.rank_en.apply(lambda x: x+1)

Wikipedia capacity data¶

In [31]:

# Get stadium capacity data from Wikipedia API
website_url = requests.get(
    "https://en.wikipedia.org/wiki/List_of_football_stadiums_in_England")
soup = BeautifulSoup(website_url.text)
My_table = soup.find('table', {'class': 'wikitable sortable'})

In [32]:

# Function to get the stadium name
def stad_name(texty):
    match = re.search(r'(title=").+"{1}', str(texty))
    if match:
        subbed = re.sub('"', '', str(match.group()))
        subbed = re.sub('title=', '', str(subbed))
        return subbed
    else:
        return "None"

cap_data = pd.DataFrame(pd.Series(My_table.findAll("tr")))
cap_data.columns = ["parsed"]
cap_data["Stadium_name_alt"] = cap_data.parsed.apply(stad_name)

In [33]:

# Remove header row
cap_data = cap_data.drop(0)
# Remove extra rows where ground share
cap_data = cap_data.query(
    "Stadium_name_alt not in ['Reading F.C. Women', 'Yeovil Town L.F.C.', 'Portsmouth_F.C._Ladies', 'London Bees']")

In [34]:

# Extract capacity
def cap(texty):
    match = re.search(
        r'(">\d)(\d|,)(\d|,).+(<sup class)', str(texty))
    if match:
        subbed = re.sub('</span>', '', str(match.group()))
        subbed = re.sub('♠">', '', subbed)
        subbed = re.sub('">', '', subbed)
        subbed = re.sub(' ', '', subbed)
        subbed = re.sub('<supclass', '', subbed)
        return subbed
    else:
        return "None"

cap_data["capacity"] = cap_data.parsed.apply(cap)

# Selhurst Park not parsing - add manually
cap_data["capacity"][33] = "26,125"

# Add Ricoh Arena
cap_data = cap_data.append({"parsed": "NA", "Stadium_name_alt": "Ricoh Arena", "capacity": "32,753"}, ignore_index=True)

In [35]:

# Names per wikipedia stadium list 
def alt_names(stad_name):
    stripped = stad_name.strip()
    if stripped == "St. James' Park":
        return "St James' Park"
    elif stripped == "St Andrew's (Stadium)":
        return "St Andrew's (stadium)"
    elif stripped == "Britannia Stadium":
        return "Bet365 Stadium"
    elif stripped == "Macron Stadium":
        return "University of Bolton Stadium"
    elif stripped == "St James Park (Exeter)":
        return "St James Park, Exeter"
    elif stripped == "Crown Ground":
        return "Wham Stadium"
    elif stripped == "Memorial Stadium":
        return "Memorial Stadium (Bristol)"
    elif stripped == "County Ground (Swindon)":
        return "County Ground, Swindon"
    elif stripped == "Stadium MK":
        return "Stadium mk"
    elif stripped == "Highbury Stadium":
        return "Highbury Stadium (Fleetwood)"
    elif stripped == "Spotland":
        return "Spotland Stadium"
    elif stripped == "New Meadow":
        return "Greenhous Meadow"
    else:
        return stripped
    
stadiums_data["Stadium_name_alt"] = stadiums_data.Stadium_name.apply(alt_names)

# Merge capacities with main data
stadiums_data = stadiums_data.merge(cap_data, how="left")

Make polygons¶

In [37]:

# Create new columns with latitute and longitude
stadiums_data["lat"] = stadiums_data.geometryA.apply(lambda x: x[0])
stadiums_data["lat"] = stadiums_data["lat"].apply(lambda x: float(x))
stadiums_data["lon"] = stadiums_data.geometryA.apply(lambda x: x[1])
stadiums_data["lon"] = stadiums_data["lon"].apply(lambda x: float(x))

Voronoi function taken from https://gist.github.com/pv/8036995

In [38]:

def voronoi_finite_polygons_2d(vor, radius=None):
    """
    Reconstruct infinite voronoi regions in a 2D diagram to finite
    regions.

    Parameters
    ----------
    vor : Voronoi
        Input diagram
    radius : float, optional
        Distance to 'points at infinity'.

    Returns
    -------
    regions : list of tuples
        Indices of vertices in each revised Voronoi regions.
    vertices : list of tuples
        Coordinates for revised Voronoi vertices. Same as coordinates
        of input vertices, with 'points at infinity' appended to the
        end.

    """

    if vor.points.shape[1] != 2:
        raise ValueError("Requires 2D input")

    new_regions = []
    new_vertices = vor.vertices.tolist()

    center = vor.points.mean(axis=0)
    if radius is None:
        radius = vor.points.ptp().max()

    # Construct a map containing all ridges for a given point
    all_ridges = {}
    for (p1, p2), (v1, v2) in zip(vor.ridge_points, vor.ridge_vertices):
        all_ridges.setdefault(p1, []).append((p2, v1, v2))
        all_ridges.setdefault(p2, []).append((p1, v1, v2))

    # Reconstruct infinite regions
    for p1, region in enumerate(vor.point_region):
        vertices = vor.regions[region]

        if all(v >= 0 for v in vertices):
            # finite region
            new_regions.append(vertices)
            continue

        # reconstruct a non-finite region
        ridges = all_ridges[p1]
        new_region = [v for v in vertices if v >= 0]

        for p2, v1, v2 in ridges:
            if v2 < 0:
                v1, v2 = v2, v1
            if v1 >= 0:
                # finite ridge: already in the region
                continue

            # Compute the missing endpoint of an infinite ridge

            t = vor.points[p2] - vor.points[p1]  # tangent
            t /= np.linalg.norm(t)
            n = np.array([-t[1], t[0]])  # normal

            midpoint = vor.points[[p1, p2]].mean(axis=0)
            direction = np.sign(np.dot(midpoint - center, n)) * n
            far_point = vor.vertices[v2] + direction * radius

            new_region.append(len(new_vertices))
            new_vertices.append(far_point.tolist())

        # sort region counterclockwise
        vs = np.asarray([new_vertices[v] for v in new_region])
        c = vs.mean(axis=0)
        angles = np.arctan2(vs[:, 1] - c[1], vs[:, 0] - c[0])
        new_region = np.array(new_region)[np.argsort(angles)]

        # finish
        new_regions.append(new_region.tolist())

    return new_regions, np.asarray(new_vertices)

In [39]:

# Create normal voronoi
vor = Voronoi(np.c_[stadiums_data.lon, stadiums_data.lat])

# Finite region voronoi
regions, vertices = voronoi_finite_polygons_2d(vor)
poly = geometry.Polygon(vertices)

In [40]:

# Create list of vertices in each polygon
cells = [vertices[region] for region in regions]
listy = []
for cell in cells:
    polygon = Polygon(cell)
    listy.append(polygon)

# New column in data frame with geometry
stadiums_data["geometry"] = listy
stadiums_data = stadiums_data.set_geometry('geometry')
stadiums_data.crs = {'init': 'epsg:4326'}

# Use England boundary to cut
stadiums_data = gpd.overlay(stadiums_data, regionz, how='intersection')

/Users/student/anaconda3/envs/cusplondon/lib/python3.6/site-packages/geopandas/base.py:29: UserWarning: GeoSeries crs mismatch: {'init': 'epsg:4326'} and {'init': 'epsg:4326', 'no_defs': True}
  other.crs))

Resolving multipolygon issue: https://github.com/geopandas/geopandas/issues/834

In [41]:

upcast_dispatch = {geometry.Point: geometry.MultiPoint,
                   geometry.LineString: geometry.MultiLineString,
                   geometry.Polygon: geometry.MultiPolygon}


def maybe_cast_to_multigeometry(geom):
    caster = upcast_dispatch.get(type(geom), lambda x: x[0])
    return caster([geom])

stadiums_data.geometry = stadiums_data.geometry.apply(maybe_cast_to_multigeometry)

Population attributes¶

In [42]:

# Get Census output area centroids from ONS Open Geography
r = requests.get(
    "https://opendata.arcgis.com/datasets/ba64f679c85f4563bfff7fad79ae57b1_0.geojson")
oa_centroids_data = r.json()
centroids = gpd.GeoDataFrame.from_features(oa_centroids_data['features'], crs={'init': 'epsg:27700'})

In [43]:

# Import population and age data by output area (downloaded from www.nomisweb.co.uk)
pop = pd.read_csv("oa_pop.csv", header=6)

# Merge LSOA boundaries with population
centroids = centroids.merge(pop, left_on="oa11cd", right_on="2011 output area")

/Users/student/anaconda3/envs/cusplondon/lib/python3.6/site-packages/IPython/core/interactiveshell.py:3020: DtypeWarning: Columns (1) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

In [44]:

# Spatial join to allocate output area centroids to club catchment areas
team_geom = stadiums_data[["Team", "geometry"]]
centroids = gpd.sjoin(team_geom, centroids)

# Convert population data to numeric and group by to get single instance for each club
centroids["All usual residents"] = pd.to_numeric(centroids["All usual residents"])
pop_sum = pd.DataFrame(centroids.groupby(by="Team")[
                   "All usual residents"].sum()).reset_index()

/Users/student/anaconda3/envs/cusplondon/lib/python3.6/site-packages/geopandas/tools/sjoin.py:44: UserWarning: CRS of frames being joined does not match!
  warn('CRS of frames being joined does not match!')

In [45]:

# Function to get weighted average through groupby
def weighted_average(df, df_col, weight_col, by_col):
    df['_df_times_weight'] = df[df_col]*df[weight_col]
    df['_weight_where_notnull'] = df[weight_col]*pd.notnull(df[df_col])
    g = df.groupby(by_col)
    result = g['_df_times_weight'].sum() / g['_weight_where_notnull'].sum()
    del df['_df_times_weight'], df['_weight_where_notnull']
    return result

In [46]:

# Group by function with weighted average age
mean_age = pd.DataFrame(weighted_average(
    centroids, "Mean Age", "All usual residents", "Team")).reset_index()
mean_age.columns = ["Team", "Mean age"]

In [47]:

# Combine above in to new data frame
stadiums_data2 = team_geom.merge(pop_sum, how="left")
stadiums_data2 = stadiums_data2.merge(mean_age, how="left")
stadiums_data2.columns = ["Team", "geometry", "population", "average_age"]
stadiums_data2.average_age = stadiums_data2.average_age.round(1)
stadiums_data2.to_file("club_catch.geojson", driver='GeoJSON')
club_catch = f'club_catch.geojson'

/Users/student/anaconda3/envs/cusplondon/lib/python3.6/site-packages/geopandas/io/file.py:108: FionaDeprecationWarning: Use fiona.Env() instead.
  with fiona.drivers():

Build map¶

In [48]:

# Folium map
m = folium.Map(location=[52, -1], zoom_start=5, zoom_control=False)
# Stadium marker
for i in range(0, len(stadiums_data2)):
    folium.Marker(stadiums_data.iloc[i]['geometryA'],
                  icon=folium.Icon(color="purple", prefix= 'fa', icon="fa-futbol-o"), 
                  popup=str(
        "</br> Club: " + str(stadiums_data2.iloc[i]['Team']) + "</br> Stadium name: " +
        stadiums_data.iloc[i]['Stadium_name']) +
        "</br> Capacity: " + str(stadiums_data.iloc[i]['capacity']) +
        "</br> Population: " + str(stadiums_data2.iloc[i]['population']) +
        "</br> Average age: " + str(stadiums_data2.iloc[i]['average_age'])+
        "</br> YouGov fame rank: " + str(stadiums_data.iloc[i]['rank_en'])).add_to(m)

# Choropleth polygons

folium.Choropleth(
    geo_data=club_catch,
    name='Average age',
    data=stadiums_data2[["Team", "average_age"]],
    columns=['Team', 'average_age'],
    key_on='feature.properties.Team',
    fill_color='YlGn',
    fill_opacity=0.8,
    line_opacity=0.8,
    legend_name='average_age'
).add_to(m)

folium.Choropleth(
    geo_data=club_catch,
    name='Population',
    data=stadiums_data2[["Team", "population"]],
    columns=['Team', 'population'],
    key_on='feature.properties.Team',
    fill_color='YlGn',
    fill_opacity=0.8,
    line_opacity=0.8,
    legend_name='population'
).add_to(m)

folium.LayerControl().add_to(m)

Out[48]:

<folium.map.LayerControl at 0x1a22939a20>

Google Chrome map display fix: https://github.com/python-visualization/folium/issues/812