In [1]:

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sn
import mpld3
import pandas as pd
import numpy as np
import imp
from mpl_toolkits.basemap import Basemap
import mplleaflet

sn.set_context('talk')

ECOREG full analysis (part 1: introduction and dimensionality reduction)¶

If possible, we'd like to get a paper from the ECOREG data, but I'm struggling to make time due to other commitments. The aim of this notebook is to set things out in a more structured way, which will hopefully make it easier to produce a draft manuscript if I get the chance.

1. Background and hypotheses¶

We are interested in understanding the relationships between in-stream ecology (as represented by various PB and MZB metrics), water chemistry and hydromorphology, especially the degree to which hydromorphologic alteration does or does not influence ecology. We know that short-term hydrological fluctuations (droughts, floods etc.) have a direct impact on species composition and abundances, but here we are more interested in long-term changes in ecological assemblages driven by modifications to the hydrological regime. The main questions we would like to address are:

Is there a difference in ecology between regulated and unregulated streams?
Are there differences in water chemistry and hydromorphology between regulated and unregulated streams?
If the answers to (1) and (2) are "yes", can the differences in water chemistry and hydromorphology explain differences in ecology?

In reality, we do not expect to find straightforward answers to these questions. The ecological and water chemistry datasets available in this study represent a single point in time, rather than the long-term average conditions in each stream. It may therefore be difficult or impossible to link these "snap-shots" of water chemistry and ecology to the long-term hydrological regime, especially because we know the ecological assemblages will be most affected by recent hydrological extremes. Furthermore, even dividing our data into simple classes such as "regulated" or "unregulated" is problematic, because our sites exhibit various degrees of regulation from "almost natural" to "heavily modified". With this in mind, it seems likely that some of the "regulated" sites might be indistinguishable from, for example, natural systems downstream of large lakes. Finally, there are significant issues of spatial and temporal scale when it comes to relating catchment scale hydromorphological indices to plot scale ecology and water chemistry metrics: the ecology and water chemistry measurements were all collected in the field at the same time and place, whereas the hydrological indicators (HIs) are estimated by aggregating catchment scale flow records for the preceding several years. For this reason alone, we might expect to find stronger statistical relationships between ecology and water chemistry than between ecology and any of the HIs.

2. Literature¶

I haven't studied anything biological since I was 16. This section summarises some results and ideas from the literature that might be useful.

There are many papers that tackle the question of defining ecologically sustainable flows for regulated river reaches (e.g. Gorla & Perona, 2013; Vogel et al., 2007; Zhang et al., 2015). However, the vast majority of those using catchment scale hydrological data consider the effects on ecology implicitly, by assuming that changes in the flow regime can be used as a direct proxy for ecological impact. In other words, they calculate a statistical measure of how much the hydrology has changed, and then assume this is proportional to effects on the ecology. Based on my (admittedly limited) literature search, very few authors have managed to link their hydrological indicators to real ecology data in a convincing way. This must have been attempted (lots of times?) and Richter et al. (1997) present a good scientific argument as to why we should expect such a relationship to exist. However, I've yet to find a clear statistical demonstration of this in the literature without resorting to repeatedly sub-dividing the data, for example by estimating a wide range of different "structure indices" (e.g. Lorenz et al., 2004), which ultimately seems a bit arbitrary, and in any case would be impossible in our study without gathering more data regarding catchment properties.

2.1. Indices of hydromorphological change¶

By far the most widely used metrics for assessing hydromorphological change are the Indicators of Hydrological Alteration (IHA) first proposed by Richter et al. (1996). The authors define 32 metrics in five classes:

Magnitude of monthly water conditions
Magnitude and duration of annual extreme water conditions
Timing of annual extreme water conditions
Frequency and duration of high and low pulses
Rate and frequency of water condition changes

These 32 parameters can either be used to assess hydromorphological change at a single location, by considering changes before and after some intervention (such as building a dam), or they can be used to compare the hydrological regimes between different sites. Most of the studies I have found focusing on hydrological alteration have taken the former approach, using a "before-after" (BA) design, where flow data collected prior to alteration are compared to measurements taken afterwards (e.g. Richter et al., 1997; Vogel et al., 2007). In our study, we do not have any data prior to alteration, but we do have an approximate "control-impact" (CI) design for the Norwegian sites: as far as possible, Susi attempted to pair regulated sites with unregulated ones that have otherwise similar characteristics. I'm not sure if this is also the case for the German sites? (It looks unlikely from the map - see below - but check this). Penas et al. (2016) reviewed the statistical power and advantages/disadvantages of different experimental design strategies in the context of assessing hydromorphological alteration. Their (rather obvious) conclusion is that "before-after-control-impact" (BACI) designs are the best, as these provide ways of estimating both spatially and temporally variable confounding factors. In comparison, BA designs are limited because they are not able to detect changes due to natural variability (i.e. the before and after periods are not expected to be the same, regardless of alteration), whereas CI designs are limited because there are no true analogues for the sites (i.e. we assume that a neighbouring unmodified site reflects what the impacted location would have looked like without alteration, which involves some big assumptions regarding site similarity). Note that the approximate CI design of our study therefore places additional constraints on what we can expect to identify statistically.

The IHA metrics are very widely used, but numerous other systems have also been proposed. Olden & Poff (2003) considered around 170 hydrological metrics applied to more than 400 streams across the USA, demonstrating that the 32 IHA parameters did a reasonable job of representing the overall variability across the wider group of 170. However, they also identified considerable redundancy even within the IHA metrics, suggesting that dimensionality reduction techniques (see below) should be used to define a subset of hydrological parameters for any given analysis.

A popular alternative to the IHA approach is the eco-surplus/deficit method proposed by Vogel et al. (2007). This provides a means of summarising hydrological changes in just a small number of simple metrics, rather than using 32 indicators as required by the IHA methodology. Gao et al. (2009) have shown that the eco-surplus/deficit approach generally does a good job of encoding the information captured by the broader suite of IHA metrics. This approach is therefore appealing, as it would dramatically simplify our analysis and would also avoid issues of collinearity in our statistical tests. Unfortunately, Vogel's method is only really suitable for a BA experimental design, which will not work with our data. Nevertheless, the basic idea of summarising hydrological regimes using flow duration curves is an interesting one, and something that may be worth exploring further when testing hypothesis 2 (see section 1).

Most hydrological indicators are designed to be used with long-term monitoring records: a number of authors state that 20 years is the usual rule of thumb to ensure stable estimates of streamflow predictability (Richter et al., 1996; Olden & Poff, 2003; Gan et al., 1991). In our dataset, we have only three years of flow data prior to ecological sampling at each site, which precludes using the full suite of IHA metrics as originally proposed by Richter et al. (1996). However, given that we have ecological data for only a single time point, I do not think this will be a major limitation - linking multi-year hydrological averages to a single round of ecological sampling is likely to be difficult regardless.

2.2. Linking hydrology and ecology¶

A few previous studies have attempted to link the IHA metrics to plot scale ecological data. Monk et al. (2007) related a large number of HIs to LIFE scores at 83 locations in England and Wales, using long-term data for both metrics (20 years for hydrology and 11 years for ecology). Unfortunately, their paper does not present the relationships identified very clearly and it also uses stepwise linear regression, which means the p-values and significance tests presented are likely to be inaccurate (see e.g. here). Yang et al. (2008) used genetic programming to relate an ecological time series (annual Shannon Index of fish species diversity) from a river in Illinois to annual hydrograph summaries calculated using the IHA method. Their study is similar to ours in that the authors did not identify a time at which hydrological modification took place - they simply looked for relationships between ecological and hydrological metrics. However, a key difference is that Yang et al. had access to a long (34 year) time series of ecological data to match the hydrological measurements. In contrast, we have just a single sampling point for ecology, but our analysis is spatially (rather than temporally) distributed across 40 different sites.

More recently, Zhang et al. (2015) have applied the eco-surplus/deficit concept first proposed by Vogel et al. (2007) in a BA study focusing on four sites in China. They demonstrated correlations between their estimated degrees of hydrologic alteration and changes in the Shannon Diversity Index (an entropy-based ecological metric), but they don't say diversity of what - algae, macroinvertebrates, fish? Their study has the advantage of having before and after ecological data for multiple years, which dramatically increases their ability to detect changes compared to our dataset. Nevertheless, the relationships identified are not clear and their interpretation seems slightly bizarre e.g. their description of Fig. 7 on page 720 makes me think they must be looking at a different plot to me!

2.3. Implications¶

My overall impression, based upon statistical considerations and a brief look at the literature, is that we cannot reasonably expect to demonstrate a clear link between HIs and ecological metrics, given the data we have available. My preliminary analysis of the ECOREG data (supported by Susi's more detailed work on the Norwegian dataset?) identified a small number of weak but significant relationships between ecology and water chemistry, but nothing of interest between ecology and hydrology. Given the considerations above this is perhaps not surprising, and I'd be wary of concluding that water chemistry has a greater influence on ecological assemblages than hydrology - it seems just as likely that the result is due to limitations in our dataset.

3. Methodology¶

3.1. Sampling sites¶

Our dataset comprises 64 sites in total, of which half (32) are considered "regulated", but with a considerable gradient from "almost natural" to "substantially modified" (is this correct - are any of them heavily modified?). 24 of the sites are located in Germany (12 regulated) and the remaining 40 are in Norway (20 regulated). See maps below.

In [2]:

# Read basic datasets

# Hydro indicators
in_xls = r'C:\Data\James_Work\Staff\Susi_S\ECOREG\Stats_Input_Data\hydro_indic.xlsx'
hi_df = pd.read_excel(in_xls, sheetname='hydro_indic', index_col=0)

# Site props
in_xls = r'C:\Data\James_Work\Staff\Susi_S\ECOREG\Stats_Input_Data\site_props.xlsx'
site_df = pd.read_excel(in_xls, sheetname='site_props', index_col=0)

# MZB
in_xls = r'C:\Data\James_Work\Staff\Susi_S\ECOREG\Stats_Input_Data\mzb_chem_ecol.xlsx'
mzb_df = pd.read_excel(in_xls, sheetname='mzb_data', index_col=0)

# PB
in_xls = r'C:\Data\James_Work\Staff\Susi_S\ECOREG\Stats_Input_Data\pb_chem_ecol.xlsx'
pb_df = pd.read_excel(in_xls, sheetname='pb_data', index_col=0)

In [3]:

# Added 27/07/2017
# Daniel recommends removing German site 107002711
# from the analysis (see e-mail from Susi received 30/05/2017 at 11.20)
hi_df = hi_df[hi_df.index != 107002711]
site_df = site_df[site_df.index != 107002711]
mzb_df = mzb_df[mzb_df.index != 107002711]
pb_df = pb_df[pb_df.index != 107002711]

In [4]:

# Map
# Map of sites
fig = plt.figure(figsize=(10, 10))

# Use Albers Equal Area projection
m = Basemap(projection='aea',
            width=1500000,
            height=1500000,
            resolution='i',
            lat_1=46.5,          # 1st standard parallel
            lat_2=66.5,          # 2st standard parallel
            lon_0=4,lat_0=56.5)  # Central point

# Add map components
m.fillcontinents (color='darkkhaki', lake_color='paleturquoise')
m.drawcountries(linewidth=1)
m.drawmapboundary(fill_color='paleturquoise')

# Map (long, lat) to (x, y) for plotting
x, y = m(site_df['lon'].values, site_df['lat'].values)

# Add to df
site_df['east'] = x
site_df['north'] = y

# Get reg and unreg
reg_df = site_df.query('regulated == 1')
ureg_df = site_df.query('regulated == 0')

# Plot
plt.plot(reg_df['east'], reg_df['north'], 'or', markersize=5, label='Regulated')
plt.plot(ureg_df['east'], ureg_df['north'], 'sk', markersize=5, label='Unregulated')
plt.title('ECOREG site locations', fontsize=18)
l = plt.legend(loc='upper left', fontsize=16, 
               frameon=True, fancybox=True)
l.get_frame().set_facecolor('paleturquoise')
l.get_frame().set_edgecolor('k')

C:\Data\Anaconda2\lib\site-packages\mpl_toolkits\basemap\__init__.py:1708: MatplotlibDeprecationWarning: The axesPatch function was deprecated in version 2.1. Use Axes.patch instead.
  limb = ax.axesPatch
C:\Data\Anaconda2\lib\site-packages\mpl_toolkits\basemap\__init__.py:1711: MatplotlibDeprecationWarning: The axesPatch function was deprecated in version 2.1. Use Axes.patch instead.
  if limb is not ax.axesPatch:

In [5]:

# Interactive map - may be useful for exploring site properties?
fig = plt.figure(figsize=(10, 10))
plt.plot(reg_df['lon'], reg_df['lat'], 'or', markersize=15, label='Regulated')
plt.plot(ureg_df['lon'], ureg_df['lat'], 'sk', markersize=15, label='Unregulated')

mplleaflet.display()

Out[5]:

3.2. Water chemistry and ecology datasets¶

All 65 sites have been surveyed for PB, and 61 of the 65 have approximately contemporaneous MZB data as well. In Norway, all ecological surveys (for both PB and MZB) took place during September 2013, but the sampling dates in Germany were more variable (add details?). For PB, the metrics measured were species richness and abundance, both for the overall assemblage and subdivided according to red algae, green algae and cyanobacteria. For MZB, a wide variety of metrics were calculated using the ASTERICS software package (version 4; reference?):

Abundance [ind/m2]
Number of Taxa
German Saprobic Index (new version)
Average score per Taxon
Diversity (Shannon-Wiener-Index)
Evenness
Grazers and scrapers [%]
Miners [%]
Xylophagous Taxa [%]
Shredders [%]
Gatherers/Collectors [%]
Active filter feeders [%]
Passive filter feeders [%]
Predators [%]
Parasites [%]
Swimming/skating [%]
Swimming/diving [%]
Burrowing/boring [%]
Sprawling/walking [%]
(Semi)sessile [%]
EPT-Taxa [%]
Number of Genera
Life Index

The ecological metrics were standardised to adjust for differences in the survey strategies used in Germany and Norway (more detail needed).

During each field visit, water samples were collected at each survey location and analysed for total nitrogen (TN), total phosphorus (TP) and total organic carbon (TOC). In-situ measurements of pH and conductivity were also taken using a hand-held probe.

3.3. Hydrological data¶

Sites were selected to be coincident with flow gauging stations, such that daily average flow data are available for all 65 locations. Long discharge datasets are available for some stations, but many of the older datasets have problems with missing data in the early parts of the series. Each site has at least three years' worth of daily flow data spanning the period immediately prior to ecological sampling.

The raw data series were processed to generate 62 hydrological parameters, closely based on the IHA metrics defined by Richter et al. (1996). The IHA software for calculating these metrics was not used, however, as for this study a single set of metrics based on the three year period prior to sampling was desired for each site, rather than splitting the data according to e.g. water or calendar years as implemented by the IHA R package. Small gaps in the daily flow series (up to 7 days in length) were filled using linear interpolation prior to calculating the metrics.

For a more detailed description of the hydrological indices used, see here.

4. Statistical analysis¶

(This section will eventually need separating into "Methods" and "Results", but for now it's easier to treat them together).

Linking hydrological and ecological datasets typically involves large numbers of variables, and many studies published in the literature resort to methods that are essentially "data dredging" for significant p-values. For example, it is common to see authors applying stepwise regression algorithms to find the "most significant" model out of many possible alternatives. Unfortunately, from the point of view of significance testing, this approach is invalid - it is extremely difficult (impossible?) to meaningfully interpret the p-values and confidence intervals obtained in this way (see e.g. Flom & Cassell, 2007; Harrell, 2011; also here and here).

As far as possible, models to be tested should be associated with specific hypotheses, derived either from prior domain knowledge or from the results of preliminary data exploration. In the workflow below, I've used the following methodology to guide my exploration of the dataset:

Use Principal Component Analysis (PCA) to identify a reduced set of metrics for more detailed examination.
Also use the PCA to look for any signs of clustering between regulated and unregulated sites in the space defined by the first few principal components (PCs). Such clusters may suggest metrics demonstrating significant differences between regulated and unregulated groups.
For the metrics identified in (2), explicitly test for differences between regulated and unregulated groups. I'll use a Bayesian reformulation of the standard t-test, but other alternatives are possible.
If significant differences in both explanatory and response variables can be identified, attempt to assess the relationships between these variables. I'll focus on multiple linear regression initially to keep things simple, but more sophisticated alternatives are also possible.

My hope is that this approach will identify some interesting patterns, without resorting to iterative model selection and/or "p-value fishing", which can dramatically bias significance tests.

4.1. Catchment typologies¶

Germany and Norway have very different hydrological, ecological and climatic regimes. In particular, hydrological responses in Norwegian streams are expected to be more heavily influenced by snow melting processes during the late spring and early summer. The image below illustrates a typical hydrological year in each country, by plotting the mean flow in each month with a 95% CI representing the variability among sites. Values are plotted as a percentage of the long-term annual mean in order to allow for differences in catchment size.

In [6]:

# Extract data of interest
df = hi_df.query('(eco_dataset=="pb") & (time_per==3)')

# Extract cols of interest
cols = ['avg%02d' % mon for mon in range(1, 13)] + ['mean',]
df = df[cols]

# Calculate monthly as % of long-term mean
for col in df.columns:
    df[col] = 100.* df[col] / df['mean']

del df['mean']
    
# Join country 
df = df.join(site_df['country'])

# Calc ststs.
grpd = df.groupby('country')
df = grpd.describe(percentiles=[0.05, 0.50, 0.95])

# Plot
fig, axes = plt.subplots(nrows=2, ncols=1, 
                         sharex=True, figsize=(10, 10))

# Germany
axes[0].fill_between(range(1, 13), 
                     df.loc['D', '5%'].values,
                     df.loc['D', '95%'].values,
                     color='r', alpha=0.2, 
                     label='Germany 90% CI')
axes[0].plot(range(1, 13), 
             df.loc['D', '50%'].values,
             'r-', label='Germany median')

axes[0].set_ylabel('Flow relative to mean (%)')
axes[0].set_title('Germany')
axes[0].legend(loc='best')

# Norway
axes[1].fill_between(range(1, 13), 
                     df.loc['N', '5%'].values,
                     df.loc['N', '95%'].values,
                     color='b', alpha=0.2, 
                     label='Norway 90% CI')
axes[1].plot(range(1, 13), 
             df.loc['N', '50%'].values,
             'b-', label='Norway median')

axes[1].set_ylabel('Flow relative to mean (%)')
axes[1].set_title('Norway')
axes[1].legend(loc='best')

plt.xlim((1, 12))
plt.xlabel('Month')

Out[6]:

Text(0.5,0,u'Month')

As expected, the differences are very stark: the German sites have high autumn and winter flows, which generally decline throughout the summer, whereas Norwegian sites have very low flows during the winter, with peak discharges due to snow melting events during May and June. The sudden increase in flows in Germany during August is a little strange - perhaps investigate further?

Given the differences in hydrology and climate, it should be no surprise that there are also substantial differences in ecology and water chemistry. In this study we are not specifically interested in characterising the hydro-climatic contrasts between Germany and Norway - our focus is instead on the effects of hydromorphologic alteration on ecology and water chemistry. Based on the plots above (and my earlier preliminary analysis), if we treat the sites from Germany and Norway as a single dataset, I suspect most of our findings will relate to differences due to geography, rather than due to river regulation. To allow for this, I propose analysing the data separately for each country - essentially defining German and Norwegian catchment "typologies", which are not expected to behave in the same way. This approach is consistent with previous work. In particular, Olden & Poff (2003) divided their data according to 6 catchment typologies, and found that streams dominated by snow melt processes (as in Norway) generally responded differently to other river categories and required the use of different sets of representative HIs. Similarly, Monk et al. (2007) used "hydrological regime classification" to group their sites prior to analysis.

4.2. Dimensionality reduction¶

The raw dataset consists of more than 100 variables in total: 62 HIs, 23 MZB metrics, 8 PB metrics, 5 water chemistry variables and two geographic co-ordinates (latitude and longitude). We also have categorical variables specifying the country (N or D) and whether a river is regulated or not. The number of possible combinations to test is therefore very large and, because many of the variables are similar in nature and meaning (e.g. there are several diversity-based ecological indices), it seems likely that collinearity will be a problem for any statistical tests. The first step in the analysis is therefore to reduce the number of metrics to a more manageable subset using a combination of domain knowledge and dimensionality reduction techniques.

Principal Component Analysis (PCA) is a popular, linear technique for dimensionality reduction. Olden & Poff (2003) suggest that it offers pragmatic variable selection in the context of eco-hydrological data analysis:

Such an approach [PCA] may be particularly useful in large-scale, data-intensive studies where indices other than hydrologic indices are being examined and related to patterns in biological data.

In the following sections, I've used PCA to identify metrics for further analysis and to highlight potentially interesting patterns.

The principal components (PCs) in a PCA are linear combinations of the original variables, which can make them difficult to interpret in a physically meaningful way. As far as possible, I'd like to continue working with a subset of the original, "raw" metrics, rather than a reprojection of them into an arbitrary sub-space. I have therefore followed the approach of Gao et al. (2009) by choosing the metric with the highest (absolute) loading to represent each PC (see below for details). Because the PCs are orthogonal, choosing one metric from each of the first few PCs should reduce correlation and avoid problems with collinearity.

The remainder of this notebook consists of PCA analyses for PB, MZB, water chemistry and HIs, all performed separately for Germany and Norway. This creates a lot of plots, so to keep things manageable I've tried to follow a consistent format throughout. For each dataset (PB, MZB, water chenistry and HIs), I've done the following:

Summarise previous observations based on my earlier exploratory analyses. These comments mostly relate to differences identified between Germany and Norway, which provide context for the split analysis presented here.
For each country:

a. Screen for data anomalies. PCA is strongly affected by outliers, so it's a good idea to investigate any possible data issues before calculating the PCs.

b. Perform the PCA.

c. Following Gao et al. (2009), use the Kaiser-Guttman Criterion (KGC) to determine the number of principal components to keep. This amounts to keeping those components with eigenvalues greater than 1. This method is rather arbitrary and has been criticised in the past, but for the work presented here it provides a simple and consistent method for selecting the number of PCs. In all cases below, the KGC selects enough PCs to explain around 80 - 90% of the overall variance, which seems reasonable to me.

d. Look at plots showing the data projected into the space defined by the first 2 and 3 PCs. Consider the loadings on these PCs to see if they are suggestive of particular groups of indices (e.g. "PC1 is mainly winter and spring high flows."). If possible, interpret the distribution of the data in terms of river regulation.

e. For each of the PCs selected by the KGC, choose the metric with the highest absolute loading to represent that PC. These are the metrics that will be considered in subsequent analyses.

To keep things tidy, I've defined a convenience function for the PCA and then shifted it into a separate Python file (available here). This code is imported into this notebook below, providing a consistent interface for the PCA and (hopefully) keeping this notebook more manageable.

In [7]:

# Import custom functions
func_path = r'C:\Data\James_Work\Staff\Susi_S\ECOREG\Python\ECOREG\final_analysis\ecoreg_code.py'

ecoreg = imp.load_source('ecoreg_code', func_path)

4.2.1. PB data¶

Overview¶

Previous work in section 2.2 of this notebook compared PB distributions in Norway and Germany. In general, the patterns for abundance and richness are very similar: the species richness and abundance of green algae and cyanobacteria in Norway are greater than in Germany, whereas the abundance and richness of red algae is broadly similar in both locations. As a result, overall richness and abundance is also greater in Norway than in Germany.

Note that the abundance scores were originally evaluated on a qualitative scale, which perhaps makes them less reliable than the richness scores (?). For this reason, I'm tempted to prefer richness metrics to abundance metrics if the choice between the two is close.

Germany¶

In [8]:

# Join categorical variables to PB data
df = pb_df.join(site_df[['country', 'regulated']])

# Select country
df = df.query('country == "D"')

# Columns to consider
pb_cols = ['pb_rich', 'green_rich', 'cyano_rich', 'red_rich', 
           'pb_abund', 'green_abund', 'cyano_abund', 'red_abund']

# Run PCA
res = ecoreg.run_pca(df, cols=pb_cols)
res

Variance explained by first 3 PCs (%):

[ 47.15103586  20.57610931  15.01732106]

Total: 82.74%

Out[8]:

	1	2	3	4	5	6	7	8
pb_rich	0.470834	-0.052387	0.167972	-0.371005	0.212707	-0.037973	-0.555186	0.504772
green_rich	0.208074	-0.601117	0.298898	-0.349261	0.233626	0.332528	0.393047	-0.253769
cyano_rich	0.414141	0.308274	-0.188102	-0.306034	0.219224	-0.589740	0.241834	-0.387400
red_rich	0.307929	0.263295	0.434615	-0.153380	-0.769717	0.112540	0.055484	-0.123415
pb_abund	0.450090	-0.141856	-0.187604	0.457102	0.019783	0.230982	-0.452128	-0.524399
green_abund	0.240947	-0.588825	-0.208872	0.262196	-0.378526	-0.470102	0.177115	0.295308
cyano_abund	0.335303	0.212500	-0.608242	-0.082205	-0.069554	0.503153	0.344949	0.297840
red_abund	0.309229	0.244054	0.457593	0.582017	0.333375	-0.003660	0.347766	0.254139

In [9]:

# Get biggest loading on each PC
n = 3
for i in range(1, n+1):
    print np.argmax(np.abs(res[i]))

pb_rich
green_rich
cyano_abund

The plot of the first two PCs suggests there may be significant differences in PB assemblages between regulated and unregulated sites in Germany: a simple linear separator at $PC1 \approx 0$ does a pretty good job of dividing the two categories. This is investigated in more detail later. Looking at the loadings, PC1 is mostly inversely correlated with overall PB richness and abundance, so the plot suggests that, in general, regulated sites have greater PB abundance and richness. This seems reasonable, as we might expect regulated sites to have more stable flow conditions than unregulated ones, potentially reducing the number of scouring and drying-out events and thereby allowing PB communities to develop more fully.

The KGC implies keeping the first 3 PCs. The highest absolute loading on PC1 is overall PB richness, for PC2 it's green algae richness and for PC3 it's cyanobacteria abundance.

Norway¶

In [10]:

# Join categorical variables to PB data
df = pb_df.join(site_df[['country', 'regulated']])

# Select country
df = df.query('country == "N"')

# Columns to consider
pb_cols = ['pb_rich', 'green_rich', 'cyano_rich', 'red_rich', 
           'pb_abund', 'green_abund', 'cyano_abund', 'red_abund']

# Run PCA
res = ecoreg.run_pca(df, cols=pb_cols)
res

Variance explained by first 3 PCs (%):

[ 61.58332351  17.04691049  16.20821227]

Total: 94.84%

Out[10]:

	1	2	3	4	5	6	7	8
pb_rich	0.433529	-0.065927	0.104958	-0.461769	-0.009239	-0.027719	0.564551	0.513701
green_rich	0.358940	-0.442501	0.148385	-0.441402	0.419547	0.106534	-0.392593	-0.342057
cyano_rich	0.332540	0.493375	0.210084	-0.294316	-0.615415	0.093263	-0.272980	-0.230844
red_rich	0.313121	0.131720	-0.601687	-0.039169	0.057863	-0.707219	-0.099245	-0.087675
pb_abund	0.428939	-0.088527	0.154789	0.453416	-0.018465	0.012646	0.513187	-0.561040
green_abund	0.352080	-0.475054	0.113645	0.442309	-0.366372	-0.082903	-0.366456	0.407944
cyano_abund	0.276072	0.537365	0.366241	0.295827	0.547688	-0.064157	-0.205330	0.259113
red_abund	0.300703	0.129627	-0.624336	0.122638	0.085325	0.684015	-0.054530	0.099037

In [11]:

# Get biggest loading on each PC
n = 3
for i in range(1, n+1):
    print np.argmax(np.abs(res[i]))

pb_rich
cyano_abund
red_abund

As above, the KGC implies keeping the first three PCs, which in this case account for nearly 95% of the overall data variance. Once again, the first PC is dominated by overall PB richness and abundance, while PC2 is predominantly correlated with the cyanobacteria metrics and PC3 is mostly related red algae. Unlike Germany, there is no clear separation visible between the regulated and unregulated sites in Norway.

PB summary¶

There is some evidence for differences in PB assemblages between regulated and unregulated sites in Germany, but not in Norway. It remains to be seen whether the German differences are significant.
The loadings are usually very similar between the richness and abundance metrics for each PB class. Because I have more faith in the data quality of the richness measurements, I will choose richness as the metric for further investigation where possible, even if the abundance loadings are marginally greater in the PCA results. This does not apply for cyanobacteria on PC3 in Germany, because the absolute value of the richness score is significantly smaller than that for abundance. Update 05/12/2017: We have decided not to do this, and to instead strictly choose the variable with the highest absolute loading
The selected metrics for further analysis for PB are:

PC	Germany	Norway
1	Overall PB richness	Overall PB richness
2	Green algae richness	Cyanobacteria abundance
3	Cyanobacteria abundance	Red algae abundance

4.2.2. MZB data¶

Overview¶

The MZB dataset is more complicated than for PB, with many more possible variables to consider. The various indicators (listed in section 3.2 above) can be divided into three broad categories: metrics describing mode of life (swimming, burrowing etc.); metrics describing feeding behaviour (miners, filter-feeders etc.); and overall ecological summaries such as the LIFE Index or the Shannon-Wiener Diversity. In general, my preference is to include the overall metrics if possible, rather than focusing on just a small subset of feeding behaviours or modes of life.

Previous exploratory analysis (section 2.1 of this notebook) compared differences between the scores for Germany and Norway and the main patterns identified included:

Norway has fewer burrowing/boring taxa than Germany.
German sites generally have few parasites or miners, whereas the Norwegian sites are more variable.
Shredders and xylophages are rare in Norway, but occur in Germany.
When it comes to filter-feeding, German streams are dominated by passive taxa, whereas the opposite is true in Norway.
There is a strong positive correlation between miners and active filter feeders, but the relationship does not extend to passive filter feeders. Miners are also strongly correlated with parasites, perhaps because some miners are parasites?
There is a positive relationship between parasites and active filter-feeders, suggesting that active filter feeders are commonly parasitised?
Most German and Norwegian sites have similar overall abundances, but the histogram for Norway is strongly skewed, indicating that a few of the Norwegian sites have very high abundance indeed.
German sites tend to have a larger number of taxa and genera i.e. more diversity (and also greater "evenness", whatever that means). Both countries have very similar distributions of EPT taxa, but the LIFE index for the Norwegian streams is much more variable than for the sites in Germany, albeit with approximately the same overall mean.

Germany (all MZB metrics)¶

Note: The code below produces a warning: ComplexWarning: Casting complex values to real discards the imaginary part. This appears to be due to very small numerical errors in the solutions for the eigenvalues and eigenvectors from the matrix calculations. The result is that some of the smaller components have very tiny complex parts (e.g. $1.10^{−15}$i). This can be ignored.

In [12]:

# Join categorical variables to MZB data
df = mzb_df.join(site_df[['country', 'regulated']])

# Select country
df = df.query('country == "D"')

# Columns to consider
mzb_cols = ['abund', 'n_taxa', 'ger_sap_idx', 'av_score_p_taxon', 
            'shan_wei_idx', 'even', 'graz_scrap', 'miners', 'xylo',
            'shred', 'gath_coll', 'acti_filt_feed', 'pass_filt_feed',
            'pred', 'para', 'swim_skat', 'swim_div', 'burr_bore', 
            'sprawl_walk', 'sessil', 'ept_taxa', 'n_genera', 'life_idx']

# Run PCA
res = ecoreg.run_pca(df, cols=mzb_cols)
res

Variance explained by first 8 PCs (%):

[ 26.45329046  22.10426832   9.58696788   9.08873262   7.82347525
   6.97031186   4.78450806   4.31772432]

Total: 91.13%

C:\Data\Anaconda2\lib\site-packages\numpy\core\numeric.py:482: ComplexWarning: Casting complex values to real discards the imaginary part
  return array(a, dtype, copy=False, order=order)

Out[12]:

	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20
abund	-0.202689	0.272146	0.132588	0.104710	0.186598	-0.171022	-0.298761	-0.011861	0.267018	-0.239648	0.113434	0.043722	-0.331697	-0.007578	0.045862	-0.418734	-0.432877	-0.155100	-0.112552	-0.110359
n_taxa	-0.307511	0.223609	0.024349	0.028847	-0.075126	0.057968	-0.340178	-0.038589	0.153406	-0.015930	0.068411	-0.048523	-0.064333	-0.032909	0.114396	0.189635	0.450553	0.039800	-0.157038	0.465787
ger_sap_idx	0.331758	0.018565	0.184829	-0.114912	-0.080460	0.257233	-0.149876	0.135523	0.242463	-0.048219	0.015919	0.072377	-0.197916	0.403770	-0.068757	-0.243327	0.217988	0.494025	-0.053626	0.011032
av_score_p_taxon	-0.288042	-0.075881	0.099462	0.284433	-0.066929	-0.263660	0.092440	-0.018283	-0.220009	0.271111	0.376452	0.333376	0.068977	0.015196	-0.251293	-0.150939	-0.098432	0.249459	0.159680	0.313794
shan_wei_idx	-0.244495	0.227708	-0.315142	-0.010861	-0.231999	0.091055	-0.015646	-0.105916	0.127790	0.116427	0.109186	-0.185957	-0.075487	0.064684	-0.060794	-0.070072	0.112952	-0.146381	-0.109457	0.171977
even	-0.084443	0.172458	-0.480222	-0.027799	-0.285666	0.012253	0.259032	-0.156110	0.055773	0.178080	-0.017508	-0.343375	-0.025055	0.131290	-0.144913	-0.165737	-0.159052	0.201803	-0.079710	-0.194206
graz_scrap	-0.148121	-0.111585	-0.357536	0.050484	0.427386	0.222806	-0.023800	0.054053	0.257062	-0.267938	-0.232830	-0.032579	0.212424	-0.032048	-0.232260	-0.133380	-0.010232	0.013152	0.443162	0.257395
miners	0.300187	0.031175	-0.054586	0.366670	-0.173890	-0.173097	-0.011439	-0.152062	0.081962	-0.029367	-0.197366	0.029483	-0.257287	-0.029114	-0.095740	0.190011	0.115347	0.022103	0.346853	0.048160
xylo	-0.147314	-0.140141	0.171693	-0.029492	-0.316890	-0.190223	0.135826	0.308157	0.655627	0.255061	-0.278106	0.119362	0.041550	-0.230355	-0.047603	0.071751	-0.055068	-0.061895	0.005803	-0.040665
shred	-0.212644	0.122741	0.407057	0.088173	-0.195159	0.062181	0.183678	0.296327	-0.132667	-0.116390	-0.049457	-0.310784	0.276881	0.264198	-0.087565	-0.327812	0.153543	-0.107476	0.256555	-0.044098
gath_coll	0.050337	-0.320979	-0.054530	-0.108902	-0.295580	-0.137629	-0.324601	-0.332672	-0.049266	-0.059419	0.036287	0.002274	0.024509	-0.283819	0.012663	-0.476244	0.343662	-0.140882	0.205325	-0.187702
acti_filt_feed	0.326231	0.030840	-0.113216	0.201268	-0.133856	-0.050259	-0.049386	0.077650	0.275527	-0.179899	0.523494	0.011359	0.348672	0.244366	-0.019381	0.156465	0.011483	-0.415672	0.001989	-0.037461
pass_filt_feed	0.105569	0.207770	-0.008460	-0.468047	0.117843	-0.330173	-0.015467	0.010425	-0.049061	0.233686	-0.035652	0.057934	-0.301368	0.332487	-0.254550	0.061214	0.059427	-0.306235	0.334898	0.097194
pred	0.015278	0.372786	-0.050992	-0.000201	-0.134714	0.299346	-0.042872	-0.086077	-0.050188	0.270514	-0.167665	0.389096	0.191665	0.068005	0.542625	-0.119175	-0.118110	-0.111371	0.306082	-0.005656
para	0.201836	0.187334	0.084418	0.339166	-0.172404	-0.226570	-0.158222	-0.273890	-0.053179	-0.143148	-0.410227	-0.045178	0.188221	0.157239	-0.126132	-0.003626	-0.214395	0.094992	-0.122715	0.202359
swim_skat	0.027368	0.315289	0.110688	-0.337883	0.083276	-0.295153	-0.073126	-0.050079	-0.087843	-0.087980	-0.199079	-0.041919	0.440713	-0.155790	-0.129214	0.009057	0.106341	0.011836	-0.203228	0.003648
swim_div	-0.033867	-0.249678	0.357901	-0.216364	-0.043378	0.131022	-0.097138	-0.416037	0.147812	0.127863	0.119992	-0.438700	0.059323	0.069969	0.139472	0.176680	-0.359923	0.028884	0.213781	0.231167
burr_bore	-0.073376	-0.195084	-0.236911	-0.023010	-0.246471	-0.233259	-0.308488	0.509301	-0.286288	-0.159534	-0.114470	-0.217494	-0.119071	0.107968	0.316831	0.101772	-0.173720	0.028150	0.108603	0.122778
sprawl_walk	-0.137868	-0.015729	-0.004709	-0.351526	-0.441758	0.151953	0.203079	-0.115766	-0.047584	-0.578155	0.026633	0.379557	-0.072219	0.001433	-0.148339	0.126142	-0.184286	0.011718	0.025808	0.128035
sessil	0.258421	0.231321	-0.127506	-0.219965	0.008055	-0.247986	-0.001966	0.137436	0.143319	-0.043711	0.315762	-0.077137	0.182866	-0.306135	0.166484	-0.072787	-0.105082	0.443000	0.225180	0.109263
ept_taxa	-0.122110	-0.333494	-0.196751	-0.145126	0.024748	-0.093269	-0.336490	-0.038970	0.085508	0.185748	-0.081002	0.255059	0.328775	0.416522	-0.059816	-0.018505	-0.111510	0.090726	-0.172154	-0.081005
n_genera	-0.312772	0.217133	0.066749	0.072933	-0.033312	0.057725	-0.337964	-0.037463	-0.000327	-0.052760	0.089239	-0.006308	0.041092	-0.001029	-0.191004	0.416786	0.026411	0.234123	0.296615	-0.572284
life_idx	-0.241994	-0.083525	-0.018770	0.035386	0.167951	-0.420564	0.360878	-0.256588	0.174787	-0.250585	-0.014030	0.023473	-0.018810	0.318179	0.466046	0.040695	0.251971	0.144354	0.050576	-0.121745

In [13]:

# Get biggest loading on each PC
n = 8
for i in range(1, n+1):
    print np.argmax(np.abs(res[i]))

ger_sap_idx
pred
even
pass_filt_feed
sprawl_walk
life_idx
life_idx
burr_bore

The KGC implies keeping 8 PCs, which together explain about 91% of the variance.

The selected metrics based on the absolute loadings for each of the first eight PCs are given in the table at the end of this section. They are dominated by mode of life and feeding habit scores, rather than by the overall metrics.

Germany (overall MZB metrics only)¶

In [14]:

# Join categorical variables to MZB data
df = mzb_df.join(site_df[['country', 'regulated']])

# Select country
df = df.query('country == "D"')

# Columns to consider
mzb_cols = ['abund', 'n_taxa', 'ger_sap_idx', 'av_score_p_taxon', 
            'shan_wei_idx', 'even', 'ept_taxa', 'n_genera', 'life_idx']

# Run PCA
res = ecoreg.run_pca(df, cols=mzb_cols)
res

Variance explained by first 3 PCs (%):

[ 47.8273187   21.52295671  15.27024946]

Total: 84.62%

Out[14]:

	1	2	3	4	5	6	7	8	9
abund	0.353281	-0.207103	0.438873	-0.124395	-0.397467	-0.222834	0.643787	0.009186	-0.045132
n_taxa	0.430468	-0.222247	0.165168	0.270036	-0.027119	-0.020976	-0.352151	0.355896	0.642850
ger_sap_idx	-0.383955	-0.356627	0.159976	0.130692	-0.067341	-0.747914	-0.288984	-0.181904	-0.047033
av_score_p_taxon	0.314110	0.404829	0.133597	-0.120616	0.695173	-0.455430	0.116923	-0.003204	0.026252
shan_wei_idx	0.386603	-0.260219	-0.389250	0.113195	0.006269	-0.166414	-0.111729	0.450875	-0.612365
even	0.208254	-0.212754	-0.714568	-0.132373	-0.033402	-0.189682	0.223879	-0.403714	0.364740
ept_taxa	0.014327	0.494011	-0.141145	0.775403	-0.268802	-0.163226	0.174952	-0.070501	-0.010053
n_genera	0.429799	-0.184838	0.232770	0.235721	0.114140	0.233142	-0.266387	-0.679079	-0.268080
life_idx	0.256980	0.474822	-0.021055	-0.439122	-0.516766	-0.196066	-0.447404	-0.088101	-0.042566

In [15]:

# Get biggest loading on each PC
n = 3
for i in range(1, n+1):
    print np.argmax(np.abs(res[i]))

n_taxa
ept_taxa
even

Norway (all MZB metrics)¶

In [16]:

# Join categorical variables to MZB data
df = mzb_df.join(site_df[['country', 'regulated']])

# Select country
df = df.query('country == "N"')

# Columns to consider
mzb_cols = ['abund', 'n_taxa', 'ger_sap_idx', 'av_score_p_taxon', 
            'shan_wei_idx', 'even', 'graz_scrap', 'miners', 'xylo',
            'shred', 'gath_coll', 'acti_filt_feed', 'pass_filt_feed',
            'pred', 'para', 'swim_skat', 'swim_div', 'burr_bore', 
            'sprawl_walk', 'sessil', 'ept_taxa', 'n_genera', 'life_idx']

# Run PCA
res = ecoreg.run_pca(df, cols=mzb_cols)
res

Variance explained by first 6 PCs (%):

[ 28.21385783  19.11215492  12.55027987  10.41135812   6.44127079
   6.25385114]

Total: 82.98%

Out[16]:

	1	2	3	4	5	6	7	8	9	10	...	14	15	16	17	18	19	20	21	22	23
abund	-0.077466	0.370889	0.176995	-0.037019	0.090266	-0.067748	0.088880	-0.228839	-0.280373	0.243098	...	0.316852	0.153176	-0.268312	-0.178901	-0.019095	-0.071006	-0.002545	-0.110800	0.064899	0.055568
n_taxa	-0.132154	0.344479	-0.258413	-0.151213	-0.132740	-0.145956	0.019273	-0.073737	-0.152895	0.224488	...	-0.194034	0.079388	0.205588	0.012938	-0.063188	-0.278754	0.264099	0.609093	0.075289	-0.137450
ger_sap_idx	0.079574	0.006273	0.051912	0.506744	-0.093524	-0.354483	0.049315	0.111271	-0.297284	0.025280	...	0.340925	-0.474177	0.087848	-0.108471	-0.043332	-0.064961	0.034104	0.001376	-0.057108	0.007877
av_score_p_taxon	0.003133	0.279696	-0.022272	-0.396158	-0.249517	-0.049904	0.163241	-0.068178	0.361515	-0.235113	...	0.258178	-0.550388	-0.159752	0.135694	0.036591	0.058981	0.012706	0.096693	-0.066537	-0.026935
shan_wei_idx	-0.251686	0.122887	-0.372824	0.057992	-0.166738	0.072124	-0.128656	-0.181265	-0.002594	-0.079695	...	0.067484	0.030205	-0.111965	0.125151	0.047190	-0.601967	-0.034700	-0.460192	-0.131661	0.094861
even	-0.234437	-0.074878	-0.295350	0.247731	-0.129072	0.172251	-0.187907	-0.299062	-0.007027	-0.141481	...	0.298186	0.101461	-0.330939	-0.069229	-0.029290	0.390840	-0.107306	0.347997	0.055985	-0.081719
graz_scrap	0.220185	0.200111	-0.206736	-0.042831	0.361671	-0.017956	0.134769	0.116715	-0.337730	-0.069188	...	-0.272234	-0.261871	-0.311010	0.016522	-0.222852	0.108870	0.177230	-0.078898	-0.057529	-0.009605
miners	-0.245929	-0.241843	0.061844	-0.288685	-0.066103	-0.189850	0.190061	0.052598	-0.155569	-0.066920	...	0.060766	0.023266	-0.163743	-0.100693	0.410802	0.067725	0.398543	-0.162536	0.469526	0.009378
xylo	-0.075362	0.085187	-0.237468	0.295290	-0.255595	-0.308451	0.239389	0.517197	0.164764	-0.091052	...	0.025398	0.350602	-0.182505	0.004986	-0.070500	0.046081	0.064703	-0.016124	0.035359	-0.012296
shred	0.042639	-0.022207	-0.140647	-0.001385	-0.243760	0.644052	0.252471	0.332598	-0.339874	0.022636	...	-0.021856	-0.074131	-0.127562	0.216694	0.117440	0.017025	-0.027799	0.041793	-0.030666	0.009856
gath_coll	0.264757	-0.205146	0.017963	0.074517	-0.250867	-0.254227	-0.235926	-0.289647	-0.105474	0.074979	...	-0.165183	0.040266	-0.229753	0.595364	-0.020365	0.111298	0.300660	-0.055602	-0.070172	-0.021730
acti_filt_feed	-0.320411	-0.011753	0.300305	-0.103826	-0.020448	-0.089201	0.062031	0.092776	-0.144844	-0.074521	...	-0.001902	0.040243	-0.065390	0.243639	-0.282513	-0.043122	-0.249027	-0.097921	-0.001512	-0.718617
pass_filt_feed	-0.161938	0.315489	0.242794	0.237875	0.033116	0.107363	-0.079761	0.066864	-0.049453	-0.095298	...	0.061159	0.026662	0.314097	0.244822	0.523433	0.136266	0.235270	0.009575	-0.275299	-0.097543
pred	-0.207011	0.030783	0.087925	0.346497	-0.048564	0.034647	0.478938	-0.249423	0.380696	0.017560	...	-0.421459	-0.129588	-0.042522	-0.025265	-0.076844	0.023964	0.097610	-0.052568	0.012737	0.021172
para	-0.261832	-0.265498	0.032451	-0.252251	0.019103	-0.153664	0.207232	0.014232	-0.101087	0.068066	...	0.057769	0.120801	-0.112426	-0.113406	-0.043764	0.076990	0.114887	0.087301	-0.772490	0.177662
swim_skat	-0.186616	-0.180841	-0.195789	0.106161	0.355648	0.087039	0.238342	-0.089304	0.126366	0.605805	...	0.211561	-0.144219	0.020336	0.283119	0.065839	0.037639	0.005668	-0.018140	0.082809	-0.080214
swim_div	0.342194	0.035069	0.101561	-0.051655	-0.113609	-0.200483	0.153585	0.068458	0.084609	0.280142	...	-0.026097	0.092981	-0.268543	0.104559	0.404793	-0.152203	-0.492008	0.175832	-0.046215	0.017520
burr_bore	-0.169519	0.282674	0.272687	0.135343	0.080422	0.097970	-0.307377	0.076395	0.098207	0.136269	...	-0.277740	-0.054504	-0.499750	-0.170682	0.141423	0.026162	0.069846	0.005578	-0.057344	0.003315
sprawl_walk	0.076166	0.197812	-0.199230	-0.027861	0.560152	-0.118466	0.025119	0.101150	0.201018	-0.289272	...	0.195857	0.217224	-0.069939	0.293106	0.072552	0.020244	0.042599	0.024723	-0.015626	-0.013137
sessil	-0.308499	0.076324	0.332211	-0.005133	0.010913	-0.020762	0.009399	0.104368	-0.095702	-0.061616	...	0.041432	0.016520	0.004458	0.406497	-0.291239	-0.027954	-0.145527	0.167666	0.212764	0.619959
ept_taxa	0.261336	0.196146	0.167714	-0.067631	-0.207815	0.203037	0.004271	0.102744	0.255130	0.315777	...	0.311782	0.204951	0.018923	-0.015456	-0.332616	0.100948	0.367074	-0.183862	-0.023702	-0.036202
n_genera	-0.121048	0.338020	-0.263254	-0.146549	-0.167987	-0.190282	0.002730	-0.068839	-0.139680	0.178012	...	-0.179562	0.026119	0.241688	0.029484	0.031540	0.547041	-0.291868	-0.350509	0.004624	0.100121
life_idx	0.245870	0.117477	0.144935	0.095975	-0.022677	0.081137	0.470781	-0.440525	-0.203284	-0.286406	...	0.050725	0.275360	0.034621	-0.008010	0.014929	-0.019063	-0.024538	0.005644	0.003427	-0.010453

23 rows × 23 columns

In [17]:

# Get biggest loading on each PC
n = 6
for i in range(1, n+1):
    print np.argmax(np.abs(res[i]))

swim_div
abund
shan_wei_idx
ger_sap_idx
sprawl_walk
shred

The KGC implies keeping 6 PCs, which explain 83% of the variance.

The selected metrics based on the absolute loadings for each of the first six PCs are given in the table at the end of this section.

Norway (Overall MZB metrics only)¶

In [18]:

# Join categorical variables to MZB data
df = mzb_df.join(site_df[['country', 'regulated']])

# Select country
df = df.query('country == "N"')

# Columns to consider
mzb_cols = ['abund', 'n_taxa', 'ger_sap_idx', 'av_score_p_taxon', 
            'shan_wei_idx', 'even', 'ept_taxa', 'n_genera', 'life_idx']

# Run PCA
res = ecoreg.run_pca(df, cols=mzb_cols)
res

Variance explained by first 3 PCs (%):

[ 37.75415326  29.06685981  13.44399019]

Total: 80.27%

Out[18]:

	1	2	3	4	5	6	7	8	9
abund	-0.235830	-0.352211	0.281209	0.580435	0.397063	0.463482	0.163455	-0.025070	-0.067287
n_taxa	-0.502155	-0.167226	0.133344	0.007866	-0.163211	-0.115364	-0.354830	-0.183164	0.708967
ger_sap_idx	0.155209	0.082880	0.807473	0.027669	-0.361476	-0.203543	0.372603	-0.066470	0.030280
av_score_p_taxon	-0.280408	-0.417315	-0.302324	-0.174600	0.013996	-0.309908	0.721531	0.035773	0.084909
shan_wei_idx	-0.465562	0.259701	0.080055	-0.304134	0.099569	0.097940	0.048184	-0.665991	-0.389266
even	-0.252863	0.466792	0.147644	-0.365278	0.277124	0.341202	0.230151	0.494115	0.269448
ept_taxa	0.136221	-0.467686	0.033614	-0.463651	-0.396634	0.622025	-0.036752	-0.016138	-0.027664
n_genera	-0.495892	-0.167508	0.152684	-0.017583	-0.222722	-0.206402	-0.280110	0.521630	-0.509554
life_idx	0.210604	-0.367196	0.327229	-0.436549	0.625466	-0.280846	-0.229857	-0.008489	-0.013091

In [19]:

# Get biggest loading on each PC
n = 3
for i in range(1, n+1):
    print np.argmax(np.abs(res[i]))

n_taxa
ept_taxa
ger_sap_idx

The first 3 PCs explain a little over 80% of the variance, but the projection of the data into the space of the first 2 or 3 PCs does not suggest a clear separation. Perhaps there is a tendency for regulated sites to have higher scores on PC1 and PC3?

PC1 is strongly inversely correlated with the number of taxa, the number of genera and the Shannon-Weiner index - in other words, PC1 seems to be largely representative of diversity. PC2, meanwhile, has large contributions (but in opposite directions) from eveness and EPT taxa, but I don't know enough about these metrics to be be able to interpret this. PC3 is overwhelmingly correlated with the German saprobic index.

MZB summary¶

All of the following are very tentative and require further investigation.

For sites in Germany, there is a suggestion that regulated rivers may have fewer tax and genera.
In Norway, there is an equally weak suggestion that the opposite is true: regulated sites may have greater abundance and diversity.
Regulated sites in Norway may have a greater proportion of sessile species and active filter feeders, but fewer swimmers and divers. They may also have lower scores on the German saprobic index.
The selected metrics for further analysis for MZB are:

PC	Germany (all metrics)	Germany (overall metrics)	Norway (all metrics)	Norway (overall metrics)
1	LIFE index	Number of taxa or genera	German saprobic index	Number of taxa or genera
2	German saprobic index	Evenness	Shannon-Weiner diversity	German saprobic index
3	Evenness	EPT taxa	Abundance	EPT taxa
4	Number of taxa or genera		Shredders
5	Passive filter feeders		Sprawlers and walkers
6	Predators		Swimmers and divers
7	Sprawlers and walkers
8	Burrowers and borers

4.2.3. Water chemistry data¶

Overview¶

We have two sets of chemistry data - one matching the PB sampling and another for the MZB sampling. For Norway, the PB and MZB surveys took place at the same time so the chemistry datasets are identical, but this is not the case in Germany. We have only 5 chemical variables, so dimensionality reduction may not be necessary, but it is nevertheless worth screening the data here and perhaps looking to see if any variables are redundant. It's also worth checking for outliers.

In [20]:

# Columns to consider
chem_cols = ['tn', 'tp', 'ph', 'cond', 'toc']

pb_chem_df = pd.melt(pb_df[chem_cols])

sn.factorplot(y='value', col='variable', data=pb_chem_df,
              kind='box', sharey=False, col_wrap=3)

Out[20]:

<seaborn.axisgrid.FacetGrid at 0xcfc0da0>

It looks as though the TP record in particular may include some outliers. There are several German sites with TP concentrations in excess of 100 ug/l, so some of the larger values, despite being high, might be genuine readings. However, the highest recorded concentration of 439 ug/l is actually from a site in Norway, and the next highest concentration recorded for a Norwegian site is just 61 ug/l. The TOC and TN values for this outlying record are also fairly low, which suggests the measurement of 439 ug/l may be an error (due to sample contamination?). I will therefore remove it from further consideration in the analysis below.

Germany¶

In [21]:

# Join categorical variables to PB data
df = pb_df.join(site_df[['country', 'regulated']])

# Select country
df = df.query('country == "D"')

# Run PCA
res = ecoreg.run_pca(df, cols=chem_cols)
res

Variance explained by first 2 PCs (%):

[ 40.65730844  38.40554088]

Total: 79.06%

Out[21]:

	1	2	3	4	5
tn	0.538594	0.195796	0.655097	-0.477909	0.118457
tp	-0.209071	0.659330	0.028845	-0.102991	-0.714236
ph	0.557756	-0.047692	-0.703687	-0.399682	-0.178079
cond	0.565478	0.320326	-0.010104	0.759681	0.020222
toc	-0.188011	0.649665	-0.273384	-0.155356	0.666118

In [22]:

# Get biggest loading on each PC
n = 2
for i in range(1, n+1):
    print np.argmax(np.abs(res[i]))

cond
tp

The first 2 PCs explain approximately 80% of the variance. PC1 is dominated by conductivity, whereas PC2 is correlated with TP and TOC. There is no obvious distinction between regulated and unregulated sites, but two of the German unregulated rivers are clearly very different to the others: their large scores on PC2 imply unusually high concentrations of TP and TOC.

Norway¶

Note that the outlier ($TP > 430 \; \mu g/l$) has been removed.

In [23]:

# Join categorical variables to PB data
df = pb_df.join(site_df[['country', 'regulated']])

# Select country
df = df.query('country == "N"')

# Remove outlier
df = df.query('tp < 430')

# Run PCA
res = ecoreg.run_pca(df, cols=chem_cols)
res

Variance explained by first 3 PCs (%):

[ 38.58174579  31.96264102  19.98102125]

Total: 90.53%

Out[23]:

	1	2	3	4	5
tn	0.661465	0.216671	0.011237	0.206646	-0.687523
tp	-0.033848	0.029105	0.998535	0.030514	0.002099
ph	-0.321633	0.640416	-0.009668	-0.630791	-0.297369
cond	0.148437	0.733070	-0.030590	0.431849	0.503135
toc	0.660179	-0.068423	0.042104	-0.609898	0.430968

In [24]:

# Get biggest loading on each PC
n = 3
for i in range(1, n+1):
    print np.argmax(np.abs(res[i]))

tn
cond
tp

This time the KGC recommends keeping the first three PCs, which explain about 91% of the variance. PC1 is dominated by TN and TOC, PC2 by pH and conductivity and PC3 almost entirely comprises TP.

There is no clear distinction between regulated versus unregulated locations, but there are some notable outliers, such as the one in the lower-left corner of the 2D plot, which likely corresponds to high TOC concentrations and low pH (i.e. acidification).

Water chemistry summary¶

There are no obvious differences in water chemistry between regulated and unregulated sites, either in Norway or Germany.
The selected metrics for further analysis of water chemistry are:

PC	Germany	Norway
1	Conductivity	TN
2	TP	Conductivity
3		TP

4.2.4. Hydrological indicators¶

Overview¶

Some of the main differences in hydrology between Norway and Germany have already been discussed above, and are the main motivation for performing separate analyses in each country.

The dataset of HIs is fairly complicated: I have calculated 62 indicators based broadly on the IHA methodology of Richter et al. (1996), but modified to calculate one set of indices for the entire three year period prior to sampling. Note that, as with water chemistry, the sampling times for PB and MZB at the German sites were not always the same, so there are two slightly different sets of HIs for the German sites - one for PB and one for MZB. However, these metrics are sufficiently similar that the conclusions based on the PB dataset (described below) also apply for MZB. For a full description of the HIs calculated, see this notebook.

Germany¶

Note: The code below produces a warning: ComplexWarning: Casting complex values to real discards the imaginary part. This appears to be due to very small numerical errors in the solutions for the eigenvalues and eigenvectors from the matrix calculations. The result is that some of the smaller components have very tiny complex parts (e.g. $1.10^{-15}i$). This can be ignored.

In [25]:

# Join categorical variables to MZB data
df = hi_df.query('(eco_dataset == "pb") and (time_per == 3)')
df = df.join(site_df[['country', 'regulated']])

# Select country
df = df.query('country == "D"')

# Columns to consider
hi_cols = ['mean', 'avg01', 'avg02', 'avg03', 'avg04', 'avg05', 'avg06', 
           'avg07', 'avg08', 'avg09', 'avg10', 'avg11', 'avg12', 
           'min01', 'min02', 'min03', 'min04', 'min05', 'min06', 
           'min07', 'min08', 'min09', 'min10', 'min11', 'min12', 
           'max01', 'max02', 'max03', 'max04', 'max05', 'max06', 
           'max07', 'max08', 'max09', 'max10', 'max11', 'max12', 
           'min', 'p05', 'p25', 'p50', 'p75', 'p95', 'max', 'range', 'iqr',
           'range90', 'cv', 'days_to_max', 'days_to_min', 'days_to_p05', 
           'days_to_p95', 'ma_07_max', 'ma_07_min', 'ma_30_max', 
           'ma_30_min', 'ma_90_max', 'ma_90_min', 'revs_per_yr',
           'n_hi_pulse_yr', 'av_fall_rt', 'av_rise_rt']

# Run PCA
res = ecoreg.run_pca(df, cols=hi_cols)

Variance explained by first 6 PCs (%):

[ 71.62762368  12.01578372   4.08925178   3.24794129   2.09915735
   1.69193292]

Total: 94.77%

In [26]:

# Get biggest loading on each PC
n = 6
for i in range(1, n+1):
    print np.argmax(np.abs(res[i]))

mean
cv
n_hi_pulse_yr
days_to_p05
days_to_max
days_to_min

The fact that the eigenvalues become small and roughly constant after about PC10 implies there is lots of collinearity (and therefore redundancy) between these metrics. The KGC implies keeping just the first 6 PCs, which together explain 95% of the overall variance in the 62-dimensional raw dataset. Note that PC1 alone explains more than 70% of the total variance.

The metrics with the highest loading on PC1 are mostly concerned with average conditions: long-term mean and median flows are prominent, as are the averages in the summer and autumn (July to November). PC2, on the other hand, is dominated more by extremes and by metrics describing flow variability and catchment "flashiness": the coefficient of variation, range of flows, typical rise rate, maximum summer flow and the number of days before sampling to the last event bigger than the 95th flow percentile.

Norway¶

In [27]:

# Join categorical variables to MZB data
df = hi_df.query('(eco_dataset == "pb") and (time_per == 3)')
df = df.join(site_df[['country', 'regulated']])

# Select country
df = df.query('country == "N"')

# Columns to consider
hi_cols = ['mean', 'avg01', 'avg02', 'avg03', 'avg04', 'avg05', 'avg06', 
           'avg07', 'avg08', 'avg09', 'avg10', 'avg11', 'avg12', 
           'min01', 'min02', 'min03', 'min04', 'min05', 'min06', 
           'min07', 'min08', 'min09', 'min10', 'min11', 'min12', 
           'max01', 'max02', 'max03', 'max04', 'max05', 'max06', 
           'max07', 'max08', 'max09', 'max10', 'max11', 'max12', 
           'min', 'p05', 'p25', 'p50', 'p75', 'p95', 'max', 'range', 'iqr',
           'range90', 'cv', 'days_to_max', 'days_to_min', 'days_to_p05', 
           'days_to_p95', 'ma_07_max', 'ma_07_min', 'ma_30_max', 
           'ma_30_min', 'ma_90_max', 'ma_90_min', 'revs_per_yr',
           'n_hi_pulse_yr', 'av_fall_rt', 'av_rise_rt']

# Run PCA
res = ecoreg.run_pca(df, cols=hi_cols)

Variance explained by first 8 PCs (%):

[ 65.21751362  14.17773608   4.62583254   3.07918883   2.4029621
   1.93276433   1.86976001   1.64948328]

Total: 94.96%

In [28]:

# Get biggest loading on each PC
n = 8
for i in range(1, n+1):
    print np.argmax(np.abs(res[i]))

mean
range
max12
revs_per_yr
days_to_p95
revs_per_yr
days_to_max
max10

The first 8 PCs explain around 95% of the variance, although it is clear that the first PC here is being heavily influenced by the outlier visible in the bottom-right of the 2D plot above. This corresponds to Storsjøen (S2.611), which appears to be much larger than any of the other Norwegian stations (the mean discharge here is $102 \; m^3/s$ for the three years prior to sampling, compared to $61 \; m^3/s$ for the next largest river in the dataset). Perhaps this site is also more heavily regulated than the others - check with Susi? Refitting the PCA with this outlier removed actually makes surprisingly little difference to to results, so I've decided to leave it in for now.

As with the German data, PC1 in Norway is dominated by metrics representing average flow conditions, such as the mean and median long-term flow and the average autumn flow (August to November). The results for PC2 are also similar, with measures of extremes, variability and rates of change being most prominent: the range, long-term maximum, spring maximum (May and June) and the typical rise rate are all weighted highly.

Looking at the plots, there is some suggestion that the regulated stations have higher values on the first PC, which broadly corresponds to generally higher flows. This is perhaps unsurprising, as it seems quite likely that the regulated rivers will be larger.

Summary of HIs¶

For both Norway and Germany, it is possible to group HIs on the first few PCs according to type: PC1 generally represents average flow conditions, whereas PC2 predominantly reflects flow variability and catchment "flashiness". Such findings are consistent with previous work. For example, Gao et al. (2009) found that:

PC1 can be interpreted as being dominated by base flow magnitude and monthly flow; PC2 can be interpreted as being dominated by high flow magnitude; PC3 can be interpreted as monthly flow, rate of change and frequency; etc.
The is a suggestion in the Norwegian HI data that regulated rivers are larger than unregulated ones. This can be investigated further, but probably isn't very interesting.
The selected metrics for further analysis of hydrology are:

PC	Germany	Norway
1	Mean	Mean
2	Coefficient of variation	Range
3	Number of high pulses	December maximum
4	Days to p05	Days to p95
5	Days to maximum	Number of reversals
6	Days to minimum	Days to maximum
7		October maximum

4.2.5. Summary of dimensionality reduction¶

This notebook has produced a lot of plots, most of which aren't especially interesting. However, the original dataset of more than 100 dimensions has been reduced to around 15 to 20 (depending on the choices made), and a number of potentially interesting patterns have been identified for subsequent analysis.

Points for further investigation are summarised below:

Test for differences in PB assemblages between regulated and unregulated sites in Germany.
For Germany, test whether regulated rivers have fewer MZB taxa and genera.
For Norway, test whether regulated sites have greater MZB abundance and diversity (and compare this with the results from Germany - see above).
Test whether regulated sites in Norway have a greater proportion of sessile species and active filter feeders, but fewer swimmers and divers.
Test whether regulated sites in Norway have lower scores on the German saprobic index.
Test whether regulated sites in Norway are larger than unregulated ones. Not very interesting, but a useful sense-check of the method.

ECOREG full analysis (part 1: introduction and dimensionality reduction)¶

1. Background and hypotheses¶

2. Literature¶

2.1. Indices of hydromorphological change¶

2.2. Linking hydrology and ecology¶

2.3. Implications¶

3. Methodology¶

3.1. Sampling sites¶

3.2. Water chemistry and ecology datasets¶

3.3. Hydrological data¶

4. Statistical analysis¶

4.1. Catchment typologies¶

4.2. Dimensionality reduction¶

4.2.1. PB data¶

Overview¶

Germany¶

Norway¶

PB summary¶

4.2.2. MZB data¶

Overview¶

Germany (all MZB metrics)¶

Germany (overall MZB metrics only)¶

Norway (all MZB metrics)¶

Norway (Overall MZB metrics only)¶

MZB summary¶

4.2.3. Water chemistry data¶

Overview¶

Germany¶

Norway¶

Water chemistry summary¶

4.2.4. Hydrological indicators¶

Overview¶

Germany¶

Norway¶

Summary of HIs¶

4.2.5. Summary of dimensionality reduction¶

This analysis is continued in the next notebook.¶