Built-Up Area Leafiness analysis

by Robin Wilson ([email protected])

Import relevant libraries

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

Read the CSV data which was exported from the GIS data using QGIS

In [2]:
df = pd.read_csv('BuiltUpAreas_WithLeaf_WithArea.csv')
In [3]:
df.columns
Out[3]:
Index(['bua11cd', 'bua11nm', 'bua_id', 'has_sd', 'sd_count', 'label', 'name',
       'Leaf_mean', 'Leaf_media', 'Leaf_stdev', 'Leaf_min', 'Leaf_max', 'area',
       'perimeter'],
      dtype='object')

How many rows are there to start with?

In [4]:
len(df)
Out[4]:
5360

How many rows if we exclude BUA's under 10km^2

In [5]:
len(df[df.area > 10000000])
Out[5]:
161

Ok, lets do the rest of our analysis with just these large areas

In [6]:
large = df[df.area > 10000000]

What are the top areas if we sort by mean leafiness?

In [7]:
large.sort_values("Leaf_mean", ascending=False).head()
Out[7]:
bua11cd bua11nm bua_id has_sd sd_count label name Leaf_mean Leaf_media Leaf_stdev Leaf_min Leaf_max area perimeter
251 E34004906 Winchester BUA 4605 Y 2 E34004906 Winchester BUA 0.232308 0.225365 0.091031 0.006093 0.579025 1.232749e+07 52200.004
4763 E34004756 Northwich BUA 5080 Y 2 E34004756 Northwich BUA 0.181535 0.176116 0.061717 -0.021819 0.478769 1.441000e+07 104700.214
4629 E34004560 Maidenhead BUA 2820 Y 5 E34004560 Maidenhead BUA 0.181376 0.175004 0.075712 -0.056126 0.528889 1.853756e+07 93600.066
16 E34001481 Heswall BUA 724 N 0 E34001481 Heswall BUA 0.167553 0.163465 0.047651 0.037823 0.393085 1.017750e+07 42999.946
257 E34004941 Worcester BUA 1265 Y 2 E34004941 Worcester BUA 0.160467 0.155388 0.048406 -0.008730 0.441869 2.466749e+07 96100.084

What are the top areas if we sort by median leafiness?

In [8]:
large.sort_values("Leaf_media", ascending=False).head()
Out[8]:
bua11cd bua11nm bua_id has_sd sd_count label name Leaf_mean Leaf_media Leaf_stdev Leaf_min Leaf_max area perimeter
251 E34004906 Winchester BUA 4605 Y 2 E34004906 Winchester BUA 0.232308 0.225365 0.091031 0.006093 0.579025 1.232749e+07 52200.004
4763 E34004756 Northwich BUA 5080 Y 2 E34004756 Northwich BUA 0.181535 0.176116 0.061717 -0.021819 0.478769 1.441000e+07 104700.214
4629 E34004560 Maidenhead BUA 2820 Y 5 E34004560 Maidenhead BUA 0.181376 0.175004 0.075712 -0.056126 0.528889 1.853756e+07 93600.066
16 E34001481 Heswall BUA 724 N 0 E34001481 Heswall BUA 0.167553 0.163465 0.047651 0.037823 0.393085 1.017750e+07 42999.946
218 E34004294 Great Malvern BUA 1268 N 0 E34004294 Great Malvern BUA 0.159090 0.157340 0.046503 -0.017508 0.425216 1.352000e+07 86700.022

What are the lowest areas if we sort by mean leafiness

In [9]:
large.sort_values("Leaf_mean", ascending=True).head()
Out[9]:
bua11cd bua11nm bua_id has_sd sd_count label name Leaf_mean Leaf_media Leaf_stdev Leaf_min Leaf_max area perimeter
4911 E34004978 Grays BUA 5708 Y 4 E34004978 Grays BUA 0.046586 0.045719 0.025500 -0.040814 0.195052 2.635254e+07 1.115000e+05
4744 E34004846 Thanet BUA 4243 Y 3 E34004846 Thanet BUA 0.050020 0.043096 0.045293 -0.166400 0.347691 2.789248e+07 9.520007e+04
142 E34004707 Greater London BUA 5705 Y 104 E34004707 Greater London BUA 0.061683 0.057584 0.036802 -0.095982 0.583697 1.737855e+09 3.256799e+06
4708 E34004682 Stevenage BUA 3252 Y 2 E34004682 Stevenage BUA 0.063386 0.062211 0.021998 -0.017950 0.178562 2.189499e+07 5.230007e+04
4833 E34004858 Exeter BUA 4 Y 3 E34004858 Exeter BUA 0.066929 0.062454 0.031770 -0.025321 0.275862 2.849248e+07 1.078999e+05

What are the areas that have the most variability in their leafiness? Each area has a very different mean leafiness value, so we can't just compare standard deviation values. Instead, we'll calculate the co-efficient of variation (standard deviation as a proportion of the mean) and look at the variability in that.

In [10]:
large['Leaf_cv'] = large.Leaf_stdev / large.Leaf_mean
/Users/robin/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
In [11]:
large.sort_values("Leaf_cv", ascending=False).head()
Out[11]:
bua11cd bua11nm bua_id has_sd sd_count label name Leaf_mean Leaf_media Leaf_stdev Leaf_min Leaf_max area perimeter Leaf_cv
4744 E34004846 Thanet BUA 4243 Y 3 E34004846 Thanet BUA 0.050020 0.043096 0.045293 -0.166400 0.347691 2.789248e+07 9.520007e+04 0.905490
308 E34004644 Felixstowe BUA 4247 Y 2 E34004644 Felixstowe BUA 0.088974 0.079795 0.059855 -0.086180 0.437179 1.206250e+07 4.910002e+04 0.672721
306 E34004640 Reading BUA 5244 Y 8 E34004640 Reading BUA 0.086041 0.076874 0.055083 -0.093819 0.477467 8.369745e+07 2.895002e+05 0.640186
4866 E34004900 Blackpool BUA 707 Y 7 E34004900 Blackpool BUA 0.076584 0.070172 0.047481 -0.052320 0.376961 6.126501e+07 1.720999e+05 0.619988
81 E34005054 Greater Manchester BUA 5071 Y 72 E34005054 Greater Manchester BUA 0.097402 0.084551 0.059387 -0.151687 0.581677 6.302525e+08 1.630301e+06 0.609703

and the least variability?

In [12]:
large.sort_values('Leaf_cv', ascending=True).head()
Out[12]:
bua11cd bua11nm bua_id has_sd sd_count label name Leaf_mean Leaf_media Leaf_stdev Leaf_min Leaf_max area perimeter Leaf_cv
70 E34004403 Bath BUA 1266 N 0 E34004403 Bath BUA 0.129506 0.126995 0.027902 0.031190 0.307061 2.423750e+07 92099.940000 0.215446
22 E34001962 Yeovil BUA 722 N 0 E34001962 Yeovil BUA 0.124182 0.122399 0.028439 0.023708 0.302564 1.256501e+07 42000.094000 0.229008
4653 E34004587 Stafford BUA 4492 Y 3 E34004587 Stafford BUA 0.150732 0.148805 0.037492 0.026252 0.365269 2.045752e+07 90700.098005 0.248732
230 E34005036 York BUA 2279 Y 2 E34005036 York BUA 0.105797 0.104355 0.026749 0.004041 0.260249 3.402500e+07 118499.958002 0.252833
175 E34004487 Durham BUA 5436 N 0 E34004487 Durham BUA 0.147859 0.144863 0.039650 0.011482 0.320897 1.286749e+07 50299.956000 0.268158

Now let's have a look at this on a graph...

In [13]:
from bokeh.plotting import figure, ColumnDataSource
from bokeh.models import HoverTool


def scatter_with_hover(df, x, y,
                       fig=None, cols=None, name=None, marker='x',
                       fig_width=500, fig_height=500, **kwargs):
    """
    Plots an interactive scatter plot of `x` vs `y` using bokeh, with automatic
    tooltips showing columns from `df`.

    Parameters
    ----------
    df : pandas.DataFrame
        DataFrame containing the data to be plotted
    x : str
        Name of the column to use for the x-axis values
    y : str
        Name of the column to use for the y-axis values
    fig : bokeh.plotting.Figure, optional
        Figure on which to plot (if not given then a new figure will be created)
    cols : list of str
        Columns to show in the hover tooltip (default is to show all)
    name : str
        Bokeh series name to give to the scattered data
    marker : str
        Name of marker to use for scatter plot
    **kwargs
        Any further arguments to be passed to fig.scatter

    Returns
    -------
    bokeh.plotting.Figure
        Figure (the same as given, or the newly created figure)

    Example
    -------
    fig = scatter_with_hover(df, 'A', 'B')
    show(fig)

    fig = scatter_with_hover(df, 'A', 'B', cols=['C', 'D', 'E'], marker='x', color='red')
    show(fig)

    Author
    ------
    Robin Wilson <[email protected]>
    with thanks to Max Albert for original code example
    """

    # If we haven't been given a Figure obj then create it with default
    # size etc.
    if fig is None:
        fig = figure(width=fig_width, height=fig_height, tools=['box_zoom', 'reset'])

    # We're getting data from the given dataframe
    source = ColumnDataSource(data=df)

    # We need a name so that we can restrict hover tools to just this
    # particular 'series' on the plot. You can specify it (in case it
    # needs to be something specific for other reasons), otherwise
    # we just use 'main'
    if name is None:
        name = 'main'

    # Actually do the scatter plot - the easy bit
    # (other keyword arguments will be passed to this function)
    fig.scatter(x, y, source=source, name=name, marker=marker, **kwargs)

    # Now we create the hover tool, and make sure it is only active with
    # the series we plotted in the previous line
    hover = HoverTool(names=[name])

    if cols is None:
        # Display *all* columns in the tooltips
        hover.tooltips = [(c, '@' + c) for c in df.columns]
    else:
        # Display just the given columns in the tooltips
        hover.tooltips = [(c, '@' + c) for c in cols]

    #hover.tooltips.append(('index', '$index'))

    # Finally add/enable the tool
    fig.add_tools(hover)

    return fig
In [14]:
fig = scatter_with_hover(large, 'Leaf_mean', 'Leaf_cv', cols=['name'])
fig.xaxis.axis_label = "Leafiness Mean"
fig.yaxis.axis_label = "Leafiness CV"
In [15]:
from bokeh.io import output_notebook
from bokeh.plotting import show
In [16]:
output_notebook()
Loading BokehJS ...
In [17]:
show(fig)