In this recipe we'll perform several analyses that were originally published in Lauber et al. (2009) (commonly referred to as the 88 Soils study). We'll focus on beta diversity analyses using scikit-bio, numpy, scipy, pandas, and matplotlib.

Background

The 88 Soils study analyzed bacterial communities in 88 soil samples taken from various locations in North and South America. The key finding of the study was that soil pH was significantly correlated with bacterial community composition. The study did not find strong correlation based on geographic location (e.g., latitude).

In this notebok, we'll reproduce several of the beta diversity analyses that lead to these conclusions. Our analyses will center around two key pieces of data: an unweighted UniFrac distance matrix (e.g., as produced by QIIME) containing distances between soil microbiome samples, and metadata about each soil sample (e.g., pH, latitude, soil moisture deficit, etc.; this is typically referred to as a mapping file in QIIME-related studies).

We won't include full details on how to generate a distance matrix from amplicon sequence data in this notebook, nor will we provide details about microbial ecology concepts and methodology (e.g., UniFrac, alpha/beta diversity, ordination techniques). For further reading, see the Studying microbial diversity chapter of An Introduction to Applied Bioinformatics and the tutorials on the QIIME website.

The original sequence data and metadata were obtained from the QIIME database (study 103), which is currently being re-vamped as Qiita, and processed as follows:

  1. The OTU table was generated using closed-reference OTU picking with QIIME 1.7.0 and Greengenes 13_8 97% OTUs. uclust was used as the underlying clustering method in pick_closed_reference_otus.py with the default 97% similarity.

  2. The unweighted UniFrac distance matrix was generated using MacQIIME 1.8.0 with the following command:

beta_diversity_through_plots.py -i table.biom -m map.txt -o bdiv-even500 -t /macqiime/greengenes/gg_13_8_otus/trees/97_otus.tree -e 500 -aO 6

Note: The distance matrix and per-sample metadata used here are not identical to the data used in the 88 Soils publication. The data used here were derived from the publication's original sequencing data, but different processing steps were used here (e.g., different software and/or versions, parameters, reference database, etc.). Thus, the results presented here do not exactly match the publication's results, but we expect to arrive at the same overall biological conclusions.

Getting started: loading the data

Before performing our analyses, we need to load the unweighted UniFrac distance matrix and metadata mapping file. Both are stored on disk as tab-separated text. We'll load the distance matrix into an skbio.DistanceMatrix object and the metadata into a pandas.DataFrame:

In [1]:
import pandas as pd
from skbio import DistanceMatrix

unifrac_dm = DistanceMatrix.read('data/88-soils/dm.txt')
df = pd.read_csv('data/88-soils/map.txt', sep='\t', index_col=0)

Let's take a quick look at our distance matrix:

In [2]:
print(unifrac_dm)
85x85 distance matrix
IDs:
'LQ2.141729', 'IT2.141720', 'HI3.141676', 'MD2.141689', 'RT1.141654', ...
Data:
[[ 0.          0.72303965  0.75669608 ...,  0.75618327  0.71983899
   0.80723333]
 [ 0.72303965  0.          0.67136327 ...,  0.71516397  0.6448562
   0.7136385 ]
 [ 0.75669608  0.67136327  0.         ...,  0.67475805  0.75112105
   0.67771871]
 ..., 
 [ 0.75618327  0.71516397  0.67475805 ...,  0.          0.76704384
   0.68621643]
 [ 0.71983899  0.6448562   0.75112105 ...,  0.76704384  0.          0.77375735]
 [ 0.80723333  0.7136385   0.67771871 ...,  0.68621643  0.77375735  0.        ]]

Notice that there are only 85 soil samples -- we're a few samples shy of the original 88 (actually, there are 89 total samples, but only 88 were used in the original paper). We have fewer samples than the published paper because our even sampling depth of 500 sequences per sample caused samples with too few sequences to be dropped.

Now let's take a look at our metadata:

In [3]:
df
Out[3]:
BarcodeSequence LinkerPrimerSequence TARGET_SUBFRAGMENT ASSIGNED_FROM_GEO EXPERIMENT_CENTER TITLE RUN_PREFIX TEXTURE EXPERIMENT_ALIAS DEPTH ... PRIMER_READ_GROUP_TAG PCR_PRIMERS LIBRARY_CONSTRUCTION_PROTOCOL ANNUAL_SEASON_PRECPT LATITUDE STUDY_REF SOIL_MOISTURE_DEFICIT REGION STUDY_CENTER Description
#SampleID
IT2.141720 ACGTGCCGTAGA CATGCTGCCTCCCGTAGGAGT V2 n CCME Pyrosequencing-Based Assessment of Soil pH as ... FA6P1OK loamy sand lauber_88_soils 0-0.05 ... V2 FWD:GCCTTGCCAGCCCGCTCAGTCAGAGTTTGATCCTGGCTCAG;... 19502440 750.0 47.183333 lauber_88_soils -262 0 CCME Pyrosequencing_based_assessment_of_soil_pH_as_...
HI3.141676 ACGCTATCTGGA CATGCTGCCTCCCGTAGGAGT V2 n CCME Pyrosequencing-Based Assessment of Soil pH as ... FA6P1OK loamy sand lauber_88_soils 0-0.05 ... V2 FWD:GCCTTGCCAGCCCGCTCAGTCAGAGTTTGATCCTGGCTCAG;... 19502440 1000.0 20.083333 lauber_88_soils 108 0 CCME Pyrosequencing_based_assessment_of_soil_pH_as_...
MD2.141689 ACTCGATTCGAT CATGCTGCCTCCCGTAGGAGT V2 n CCME Pyrosequencing-Based Assessment of Soil pH as ... FA6P1OK sandy loam lauber_88_soils 0-0.05 ... V2 FWD:GCCTTGCCAGCCCGCTCAGTCAGAGTTTGATCCTGGCTCAG;... 19502440 150.0 34.900000 lauber_88_soils 1054 0 CCME Pyrosequencing_based_assessment_of_soil_pH_as_...
CA1.141704 ACACGAGCCACA CATGCTGCCTCCCGTAGGAGT V2 n CCME Pyrosequencing-Based Assessment of Soil pH as ... FA6P1OK silt loam lauber_88_soils 0-0.05 ... V2 FWD:GCCTTGCCAGCCCGCTCAGTCAGAGTTTGATCCTGGCTCAG;... 19502440 400.0 36.050000 lauber_88_soils 198 0 CCME Pyrosequencing_based_assessment_of_soil_pH_as_...
PE5.141692 AGACTGCGTACT CATGCTGCCTCCCGTAGGAGT V2 n CCME Pyrosequencing-Based Assessment of Soil pH as ... FA6P1OK clay loam lauber_88_soils 0-0.05 ... V2 FWD:GCCTTGCCAGCCCGCTCAGTCAGAGTTTGATCCTGGCTCAG;... 19502440 5000.0 -12.633333 lauber_88_soils -1900 0 CCME Pyrosequencing_based_assessment_of_soil_pH_as_...
CO1.141714 ACATGATCGTTC CATGCTGCCTCCCGTAGGAGT V2 n CCME Pyrosequencing-Based Assessment of Soil pH as ... FA6P1OK sand lauber_88_soils 0-0.05 ... V2 FWD:GCCTTGCCAGCCCGCTCAGTCAGAGTTTGATCCTGGCTCAG;... 19502440 600.0 40.400000 lauber_88_soils -309 0 CCME Pyrosequencing_based_assessment_of_soil_pH_as_...
DF3.141696 ACCGCAGAGTCA CATGCTGCCTCCCGTAGGAGT V2 n CCME Pyrosequencing-Based Assessment of Soil pH as ... FA6P1OK loamy sand lauber_88_soils 0-0.05 ... V2 FWD:GCCTTGCCAGCCCGCTCAGTCAGAGTTTGATCCTGGCTCAG;... 19502440 1100.0 35.966667 lauber_88_soils -328 0 CCME Pyrosequencing_based_assessment_of_soil_pH_as_...
PE1.141715 ACTTGTAGCAGC CATGCTGCCTCCCGTAGGAGT V2 n CCME Pyrosequencing-Based Assessment of Soil pH as ... FA6P1OK sandy loam lauber_88_soils 0-0.05 ... V2 FWD:GCCTTGCCAGCCCGCTCAGTCAGAGTTTGATCCTGGCTCAG;... 19502440 2100.0 -13.083333 lauber_88_soils -2500 0 CCME Pyrosequencing_based_assessment_of_soil_pH_as_...
SP2.141678 AGCGCTGATGTG CATGCTGCCTCCCGTAGGAGT V2 n CCME Pyrosequencing-Based Assessment of Soil pH as ... FA6P1OK loamy sand lauber_88_soils 0-0.05 ... V2 FWD:GCCTTGCCAGCCCGCTCAGTCAGAGTTTGATCCTGGCTCAG;... 19502440 750.0 36.616667 lauber_88_soils -402 0 CCME Pyrosequencing_based_assessment_of_soil_pH_as_...
CO3.141651 ACATTCAGCGCA CATGCTGCCTCCCGTAGGAGT V2 n CCME Pyrosequencing-Based Assessment of Soil pH as ... FA6P1OK sandy loam lauber_88_soils 0-0.05 ... V2 FWD:GCCTTGCCAGCCCGCTCAGTCAGAGTTTGATCCTGGCTCAG;... 19502440 322.0 40.800000 lauber_88_soils 231 0 CCME Pyrosequencing_based_assessment_of_soil_pH_as_...
SA2.141687 AGATCGGCTCGA CATGCTGCCTCCCGTAGGAGT V2 n CCME Pyrosequencing-Based Assessment of Soil pH as ... FA6P1OK sand lauber_88_soils 0-0.05 ... V2 FWD:GCCTTGCCAGCCCGCTCAGTCAGAGTTTGATCCTGGCTCAG;... 19502440 400.0 35.366667 lauber_88_soils 198 0 CCME Pyrosequencing_based_assessment_of_soil_pH_as_...
CM1.141723 ACATCACTTAGC CATGCTGCCTCCCGTAGGAGT V2 n CCME Pyrosequencing-Based Assessment of Soil pH as ... FA6P1OK silty clay lauber_88_soils 0-0.05 ... V2 FWD:GCCTTGCCAGCCCGCTCAGTCAGAGTTTGATCCTGGCTCAG;... 19502440 850.0 33.300000 lauber_88_soils 155 0 CCME Pyrosequencing_based_assessment_of_soil_pH_as_...
LQ2.141729 ACTCACGGTATG CATGCTGCCTCCCGTAGGAGT V2 n CCME Pyrosequencing-Based Assessment of Soil pH as ... FA6P1OK silty clay loam lauber_88_soils 0-0.05 ... V2 FWD:GCCTTGCCAGCCCGCTCAGTCAGAGTTTGATCCTGGCTCAG;... 19502440 3500.0 18.300000 lauber_88_soils -2454 0 CCME Pyrosequencing_based_assessment_of_soil_pH_as_...
SR2.141673 AGCTATCCACGA CATGCTGCCTCCCGTAGGAGT V2 n CCME Pyrosequencing-Based Assessment of Soil pH as ... FA6P1OK sandy loam lauber_88_soils 0-0.05 ... V2 FWD:GCCTTGCCAGCCCGCTCAGTCAGAGTTTGATCCTGGCTCAG;... 19502440 500.0 34.683333 lauber_88_soils 324 0 CCME Pyrosequencing_based_assessment_of_soil_pH_as_...
CR1.141682 ACCACATACATC CATGCTGCCTCCCGTAGGAGT V2 n CCME Pyrosequencing-Based Assessment of Soil pH as ... FA6P1OK loam lauber_88_soils 0-0.05 ... V2 FWD:GCCTTGCCAGCCCGCTCAGTCAGAGTTTGATCCTGGCTCAG;... 19502440 850.0 33.933333 lauber_88_soils 160 0 CCME Pyrosequencing_based_assessment_of_soil_pH_as_...
VC1.141722 AGGTGTGATCGC CATGCTGCCTCCCGTAGGAGT V2 n CCME Pyrosequencing-Based Assessment of Soil pH as ... FA6P1OK sandy loam lauber_88_soils 0-0.05 ... V2 FWD:GCCTTGCCAGCCCGCTCAGTCAGAGTTTGATCCTGGCTCAG;... 19502440 500.0 35.900000 lauber_88_soils -175 0 CCME Pyrosequencing_based_assessment_of_soil_pH_as_...
IE2.141655 ACGTCTGTAGCA CATGCTGCCTCCCGTAGGAGT V2 n CCME Pyrosequencing-Based Assessment of Soil pH as ... FA6P1OK sandy loam lauber_88_soils 0-0.05 ... V2 FWD:GCCTTGCCAGCCCGCTCAGTCAGAGTTTGATCCTGGCTCAG;... 19502440 1200.0 41.800000 lauber_88_soils -610 0 CCME Pyrosequencing_based_assessment_of_soil_pH_as_...
RT2.141710 AGAGTCCTGAGC CATGCTGCCTCCCGTAGGAGT V2 n CCME Pyrosequencing-Based Assessment of Soil pH as ... FA6P1OK silty clay loam lauber_88_soils 0-0.05 ... V2 FWD:GCCTTGCCAGCCCGCTCAGTCAGAGTTTGATCCTGGCTCAG;... 19502440 840.0 31.466667 lauber_88_soils 135 0 CCME Pyrosequencing_based_assessment_of_soil_pH_as_...
BB1.141690 AAGAGATGTCGA CATGCTGCCTCCCGTAGGAGT V2 n CCME Pyrosequencing-Based Assessment of Soil pH as ... FA6P1OK sandy loam lauber_88_soils 0-0.05 ... V2 FWD:GCCTTGCCAGCCCGCTCAGTCAGAGTTTGATCCTGGCTCAG;... 19502440 1200.0 44.870000 lauber_88_soils -680 0 CCME Pyrosequencing_based_assessment_of_soil_pH_as_...
CC1.141721 ACACTAGATCCG CATGCTGCCTCCCGTAGGAGT V2 n CCME Pyrosequencing-Based Assessment of Soil pH as ... FA6P1OK sand lauber_88_soils 0-0.05 ... V2 FWD:GCCTTGCCAGCCCGCTCAGTCAGAGTTTGATCCTGGCTCAG;... 19502440 720.0 45.400000 lauber_88_soils -143 0 CCME Pyrosequencing_based_assessment_of_soil_pH_as_...
TL2.141706 AGGACGCACTGT CATGCTGCCTCCCGTAGGAGT V2 n CCME Pyrosequencing-Based Assessment of Soil pH as ... FA6P1OK silt loam lauber_88_soils 0-0.05 ... V2 FWD:GCCTTGCCAGCCCGCTCAGTCAGAGTTTGATCCTGGCTCAG;... 19502440 400.0 68.633333 lauber_88_soils -212 0 CCME Pyrosequencing_based_assessment_of_soil_pH_as_...
PE6.141700 AGAGAGCAAGTG CATGCTGCCTCCCGTAGGAGT V2 n CCME Pyrosequencing-Based Assessment of Soil pH as ... FA6P1OK clay lauber_88_soils 0-0.05 ... V2 FWD:GCCTTGCCAGCCCGCTCAGTCAGAGTTTGATCCTGGCTCAG;... 19502440 4000.0 -12.650000 lauber_88_soils -1600 0 CCME Pyrosequencing_based_assessment_of_soil_pH_as_...
HI1.141693 ACGCGATACTGG CATGCTGCCTCCCGTAGGAGT V2 n CCME Pyrosequencing-Based Assessment of Soil pH as ... FA6P1OK loam lauber_88_soils 0-0.05 ... V2 FWD:GCCTTGCCAGCCCGCTCAGTCAGAGTTTGATCCTGGCTCAG;... 19502440 250.0 20.083333 lauber_88_soils 858 0 CCME Pyrosequencing_based_assessment_of_soil_pH_as_...
PE7.141734 AGAGCAAGAGCA CATGCTGCCTCCCGTAGGAGT V2 n CCME Pyrosequencing-Based Assessment of Soil pH as ... FA6P1OK silty clay lauber_88_soils 0-0.05 ... V2 FWD:GCCTTGCCAGCCCGCTCAGTCAGAGTTTGATCCTGGCTCAG;... 19502440 4000.0 -12.650000 lauber_88_soils -1600 0 CCME Pyrosequencing_based_assessment_of_soil_pH_as_...
BF1.141647 AATCAGTCTCGT CATGCTGCCTCCCGTAGGAGT V2 n CCME Pyrosequencing-Based Assessment of Soil pH as ... FA6P1OK loam lauber_88_soils 0-0.05 ... V2 FWD:GCCTTGCCAGCCCGCTCAGTCAGAGTTTGATCCTGGCTCAG;... 19502440 1000.0 41.583333 lauber_88_soils -451 0 CCME Pyrosequencing_based_assessment_of_soil_pH_as_...
TL1.141653 AGCTTGACAGCT CATGCTGCCTCCCGTAGGAGT V2 n CCME Pyrosequencing-Based Assessment of Soil pH as ... FA6P1OK loam lauber_88_soils 0-0.05 ... V2 FWD:GCCTTGCCAGCCCGCTCAGTCAGAGTTTGATCCTGGCTCAG;... 19502440 400.0 68.633333 lauber_88_soils -212 0 CCME Pyrosequencing_based_assessment_of_soil_pH_as_...
KP1.141713 ACTACAGCCTAT CATGCTGCCTCCCGTAGGAGT V2 n CCME Pyrosequencing-Based Assessment of Soil pH as ... FA6P1OK silt loam lauber_88_soils 0-0.05 ... V2 FWD:GCCTTGCCAGCCCGCTCAGTCAGAGTTTGATCCTGGCTCAG;... 19502440 835.0 39.100000 lauber_88_soils -81 0 CCME Pyrosequencing_based_assessment_of_soil_pH_as_...
CL3.141664 ACAGTGCTTCAT CATGCTGCCTCCCGTAGGAGT V2 n CCME Pyrosequencing-Based Assessment of Soil pH as ... FA6P1OK loamy sand lauber_88_soils 0-0.05 ... V2 FWD:GCCTTGCCAGCCCGCTCAGTCAGAGTTTGATCCTGGCTCAG;... 19502440 1250.0 34.616667 lauber_88_soils -420 0 CCME Pyrosequencing_based_assessment_of_soil_pH_as_...
SK3.141716 AGCAGTCGCGAT CATGCTGCCTCCCGTAGGAGT V2 n CCME Pyrosequencing-Based Assessment of Soil pH as ... FA6P1OK loam lauber_88_soils 0-0.05 ... V2 FWD:GCCTTGCCAGCCCGCTCAGTCAGAGTTTGATCCTGGCTCAG;... 19502440 467.0 53.600000 lauber_88_soils -13 0 CCME Pyrosequencing_based_assessment_of_soil_pH_as_...
SF1.141728 AGATGTTCTGCT CATGCTGCCTCCCGTAGGAGT V2 n CCME Pyrosequencing-Based Assessment of Soil pH as ... FA6P1OK clay loam lauber_88_soils 0-0.05 ... V2 FWD:GCCTTGCCAGCCCGCTCAGTCAGAGTTTGATCCTGGCTCAG;... 19502440 250.0 35.383333 lauber_88_soils 494 0 CCME Pyrosequencing_based_assessment_of_soil_pH_as_...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
LQ1.141701 ACTATTGTCACG CATGCTGCCTCCCGTAGGAGT V2 n CCME Pyrosequencing-Based Assessment of Soil pH as ... FA6P1OK silt loam lauber_88_soils 0-0.05 ... V2 FWD:GCCTTGCCAGCCCGCTCAGTCAGAGTTTGATCCTGGCTCAG;... 19502440 5000.0 18.300000 lauber_88_soils -4112 0 CCME Pyrosequencing_based_assessment_of_soil_pH_as_...
MD4.141660 ACTCTTCTAGAG CATGCTGCCTCCCGTAGGAGT V2 n CCME Pyrosequencing-Based Assessment of Soil pH as ... FA6P1OK sandy loam lauber_88_soils 0-0.05 ... V2 FWD:GCCTTGCCAGCCCGCTCAGTCAGAGTTTGATCCTGGCTCAG;... 19502440 150.0 35.200000 lauber_88_soils 1087 0 CCME Pyrosequencing_based_assessment_of_soil_pH_as_...
IT1.141711 ACGTGAGAGAAT CATGCTGCCTCCCGTAGGAGT V2 n CCME Pyrosequencing-Based Assessment of Soil pH as ... FA6P1OK sandy loam lauber_88_soils 0-0.05 ... V2 FWD:GCCTTGCCAGCCCGCTCAGTCAGAGTTTGATCCTGGCTCAG;... 19502440 750.0 47.166667 lauber_88_soils -262 0 CCME Pyrosequencing_based_assessment_of_soil_pH_as_...
BF2.141708 AATCGTGACTCG CATGCTGCCTCCCGTAGGAGT V2 n CCME Pyrosequencing-Based Assessment of Soil pH as ... FA6P1OK loam lauber_88_soils 0-0.05 ... V2 FWD:GCCTTGCCAGCCCGCTCAGTCAGAGTTTGATCCTGGCTCAG;... 19502440 1000.0 41.583333 lauber_88_soils -451 0 CCME Pyrosequencing_based_assessment_of_soil_pH_as_...
SN3.141650 AGCGACTGTGCA CATGCTGCCTCCCGTAGGAGT V2 n CCME Pyrosequencing-Based Assessment of Soil pH as ... FA6P1OK loamy sand lauber_88_soils 0-0.05 ... V2 FWD:GCCTTGCCAGCCCGCTCAGTCAGAGTTTGATCCTGGCTCAG;... 19502440 600.0 36.450000 lauber_88_soils -252 0 CCME Pyrosequencing_based_assessment_of_soil_pH_as_...
MP2.141695 ACTGTACGCGTA CATGCTGCCTCCCGTAGGAGT V2 n CCME Pyrosequencing-Based Assessment of Soil pH as ... FA6P1OK sandy loam lauber_88_soils 0-0.05 ... V2 FWD:GCCTTGCCAGCCCGCTCAGTCAGAGTTTGATCCTGGCTCAG;... 19502440 2200.0 49.466667 lauber_88_soils -1721 0 CCME Pyrosequencing_based_assessment_of_soil_pH_as_...
AR3.141705 AACTGTGCGTAC CATGCTGCCTCCCGTAGGAGT V2 n CCME Pyrosequencing-Based Assessment of Soil pH as ... FA6P1OK clay lauber_88_soils 0-0.05 ... V2 FWD:GCCTTGCCAGCCCGCTCAGTCAGAGTTTGATCCTGGCTCAG;... 19502440 1400.0 -27.733333 lauber_88_soils -206 0 CCME Pyrosequencing_based_assessment_of_soil_pH_as_...
JT1.141699 ACGTTAGCACAC CATGCTGCCTCCCGTAGGAGT V2 n CCME Pyrosequencing-Based Assessment of Soil pH as ... FA6P1OK loamy sand lauber_88_soils 0-0.05 ... V2 FWD:GCCTTGCCAGCCCGCTCAGTCAGAGTTTGATCCTGGCTCAG;... 19502440 90.0 33.966667 lauber_88_soils 1032 0 CCME Pyrosequencing_based_assessment_of_soil_pH_as_...
BZ1.141724 ACACATGTCTAC CATGCTGCCTCCCGTAGGAGT V2 n CCME Pyrosequencing-Based Assessment of Soil pH as ... FA6P1OK silt loam lauber_88_soils 0-0.05 ... V2 FWD:GCCTTGCCAGCCCGCTCAGTCAGAGTTTGATCCTGGCTCAG;... 19502440 260.0 64.800000 lauber_88_soils 133 0 CCME Pyrosequencing_based_assessment_of_soil_pH_as_...
HI4.141735 ACGCTCATGGAT CATGCTGCCTCCCGTAGGAGT V2 n CCME Pyrosequencing-Based Assessment of Soil pH as ... FA6P1OK loam lauber_88_soils 0-0.05 ... V2 FWD:GCCTTGCCAGCCCGCTCAGTCAGAGTTTGATCCTGGCTCAG;... 19502440 1500.0 20.083333 lauber_88_soils -392 0 CCME Pyrosequencing_based_assessment_of_soil_pH_as_...
AR2.141703 AACTCGTCGATG CATGCTGCCTCCCGTAGGAGT V2 n CCME Pyrosequencing-Based Assessment of Soil pH as ... FA6P1OK clay lauber_88_soils 0-0.05 ... V2 FWD:GCCTTGCCAGCCCGCTCAGTCAGAGTTTGATCCTGGCTCAG;... 19502440 1400.0 -27.733333 lauber_88_soils -206 0 CCME Pyrosequencing_based_assessment_of_soil_pH_as_...
SV2.141666 AGCTGACTAGTC CATGCTGCCTCCCGTAGGAGT V2 n CCME Pyrosequencing-Based Assessment of Soil pH as ... FA6P1OK loamy sand lauber_88_soils 0-0.05 ... V2 FWD:GCCTTGCCAGCCCGCTCAGTCAGAGTTTGATCCTGGCTCAG;... 19502440 210.0 34.333333 lauber_88_soils 444 0 CCME Pyrosequencing_based_assessment_of_soil_pH_as_...
MD3.141707 ACTCGCACAGGA CATGCTGCCTCCCGTAGGAGT V2 n CCME Pyrosequencing-Based Assessment of Soil pH as ... FA6P1OK loamy sand lauber_88_soils 0-0.05 ... V2 FWD:GCCTTGCCAGCCCGCTCAGTCAGAGTTTGATCCTGGCTCAG;... 19502440 150.0 34.900000 lauber_88_soils 1054 0 CCME Pyrosequencing_based_assessment_of_soil_pH_as_...
HF2.141686 ACGCAACTGCTA CATGCTGCCTCCCGTAGGAGT V2 n CCME Pyrosequencing-Based Assessment of Soil pH as ... FA6P1OK sandy loam lauber_88_soils 0-0.05 ... V2 FWD:GCCTTGCCAGCCCGCTCAGTCAGAGTTTGATCCTGGCTCAG;... 19502440 1100.0 42.500000 lauber_88_soils -484 0 CCME Pyrosequencing_based_assessment_of_soil_pH_as_...
HJ2.141717 ACGGTGAGTGTC CATGCTGCCTCCCGTAGGAGT V2 n CCME Pyrosequencing-Based Assessment of Soil pH as ... FA6P1OK sandy loam lauber_88_soils 0-0.05 ... V2 FWD:GCCTTGCCAGCCCGCTCAGTCAGAGTTTGATCCTGGCTCAG;... 19502440 2000.0 44.216667 lauber_88_soils -1507 0 CCME Pyrosequencing_based_assessment_of_soil_pH_as_...
SR3.141674 AGCTCCATACAG CATGCTGCCTCCCGTAGGAGT V2 n CCME Pyrosequencing-Based Assessment of Soil pH as ... FA6P1OK loam lauber_88_soils 0-0.05 ... V2 FWD:GCCTTGCCAGCCCGCTCAGTCAGAGTTTGATCCTGGCTCAG;... 19502440 500.0 34.683333 lauber_88_soils 324 0 CCME Pyrosequencing_based_assessment_of_soil_pH_as_...
CF3.141691 ACAGAGTCGGCT CATGCTGCCTCCCGTAGGAGT V2 n CCME Pyrosequencing-Based Assessment of Soil pH as ... FA6P1OK silt loam lauber_88_soils 0-0.05 ... V2 FWD:GCCTTGCCAGCCCGCTCAGTCAGAGTTTGATCCTGGCTCAG;... 19502440 1300.0 42.116667 lauber_88_soils -859 0 CCME Pyrosequencing_based_assessment_of_soil_pH_as_...
SK2.141662 AGCAGCACTTGT CATGCTGCCTCCCGTAGGAGT V2 n CCME Pyrosequencing-Based Assessment of Soil pH as ... FA6P1OK sandy loam lauber_88_soils 0-0.05 ... V2 FWD:GCCTTGCCAGCCCGCTCAGTCAGAGTTTGATCCTGGCTCAG;... 19502440 467.0 53.983333 lauber_88_soils -13 0 CCME Pyrosequencing_based_assessment_of_soil_pH_as_...
AR1.141727 AACGCACGCTAG CATGCTGCCTCCCGTAGGAGT V2 n CCME Pyrosequencing-Based Assessment of Soil pH as ... FA6P1OK clay lauber_88_soils 0-0.05 ... V2 FWD:GCCTTGCCAGCCCGCTCAGTCAGAGTTTGATCCTGGCTCAG;... 19502440 1400.0 -27.733333 lauber_88_soils -206 0 CCME Pyrosequencing_based_assessment_of_soil_pH_as_...
BB2.141659 AAGCTGCAGTCG CATGCTGCCTCCCGTAGGAGT V2 n CCME Pyrosequencing-Based Assessment of Soil pH as ... FA6P1OK sandy loam lauber_88_soils 0-0.05 ... V2 FWD:GCCTTGCCAGCCCGCTCAGTCAGAGTTTGATCCTGGCTCAG;... 19502440 1200.0 44.866667 lauber_88_soils -680 0 CCME Pyrosequencing_based_assessment_of_soil_pH_as_...
GB3.141652 ACGACGTCTTAG CATGCTGCCTCCCGTAGGAGT V2 n CCME Pyrosequencing-Based Assessment of Soil pH as ... FA6P1OK clay loam lauber_88_soils 0-0.05 ... V2 FWD:GCCTTGCCAGCCCGCTCAGTCAGAGTTTGATCCTGGCTCAG;... 19502440 400.0 39.316667 lauber_88_soils -112 0 CCME Pyrosequencing_based_assessment_of_soil_pH_as_...
GB2.141732 ACCTGTCTCTCT CATGCTGCCTCCCGTAGGAGT V2 n CCME Pyrosequencing-Based Assessment of Soil pH as ... FA6P1OK loam lauber_88_soils 0-0.05 ... V2 FWD:GCCTTGCCAGCCCGCTCAGTCAGAGTTTGATCCTGGCTCAG;... 19502440 400.0 39.316667 lauber_88_soils -112 0 CCME Pyrosequencing_based_assessment_of_soil_pH_as_...
PE3.141731 AGACCGTCAGAC CATGCTGCCTCCCGTAGGAGT V2 n CCME Pyrosequencing-Based Assessment of Soil pH as ... FA6P1OK sandy loam lauber_88_soils 0-0.05 ... V2 FWD:GCCTTGCCAGCCCGCTCAGTCAGAGTTTGATCCTGGCTCAG;... 19502440 5500.0 -13.083333 lauber_88_soils -3270 0 CCME Pyrosequencing_based_assessment_of_soil_pH_as_...
MT1.141719 ACTGTCGAAGCT CATGCTGCCTCCCGTAGGAGT V2 n CCME Pyrosequencing-Based Assessment of Soil pH as ... FA6P1OK sandy loam lauber_88_soils 0-0.05 ... V2 FWD:GCCTTGCCAGCCCGCTCAGTCAGAGTTTGATCCTGGCTCAG;... 19502440 450.0 46.800000 lauber_88_soils 70 0 CCME Pyrosequencing_based_assessment_of_soil_pH_as_...
SP1.141656 AGCGAGCTATCT CATGCTGCCTCCCGTAGGAGT V2 n CCME Pyrosequencing-Based Assessment of Soil pH as ... FA6P1OK sandy loam lauber_88_soils 0-0.05 ... V2 FWD:GCCTTGCCAGCCCGCTCAGTCAGAGTTTGATCCTGGCTCAG;... 19502440 650.0 36.500000 lauber_88_soils 17 0 CCME Pyrosequencing_based_assessment_of_soil_pH_as_...
CL4.141667 ACAGTTGCGCGA CATGCTGCCTCCCGTAGGAGT V2 n CCME Pyrosequencing-Based Assessment of Soil pH as ... FA6P1OK sandy loam lauber_88_soils 0-0.05 ... V2 FWD:GCCTTGCCAGCCCGCTCAGTCAGAGTTTGATCCTGGCTCAG;... 19502440 1250.0 34.616667 lauber_88_soils -420 0 CCME Pyrosequencing_based_assessment_of_soil_pH_as_...
KP4.141733 ACTAGCTCCATA CATGCTGCCTCCCGTAGGAGT V2 n CCME Pyrosequencing-Based Assessment of Soil pH as ... FA6P1OK silt loam lauber_88_soils 0-0.05 ... V2 FWD:GCCTTGCCAGCCCGCTCAGTCAGAGTTTGATCCTGGCTCAG;... 19502440 835.0 39.100000 lauber_88_soils -81 0 CCME Pyrosequencing_based_assessment_of_soil_pH_as_...
VC2.141694 AGTACGCTCGAG CATGCTGCCTCCCGTAGGAGT V2 n CCME Pyrosequencing-Based Assessment of Soil pH as ... FA6P1OK loam lauber_88_soils 0-0.05 ... V2 FWD:GCCTTGCCAGCCCGCTCAGTCAGAGTTTGATCCTGGCTCAG;... 19502440 500.0 35.900000 lauber_88_soils -175 0 CCME Pyrosequencing_based_assessment_of_soil_pH_as_...
CL2.141671 ACAGCTAGCTTG CATGCTGCCTCCCGTAGGAGT V2 n CCME Pyrosequencing-Based Assessment of Soil pH as ... FA6P1OK sandy loam lauber_88_soils 0-0.05 ... V2 FWD:GCCTTGCCAGCCCGCTCAGTCAGAGTTTGATCCTGGCTCAG;... 19502440 1250.0 34.616667 lauber_88_soils -420 0 CCME Pyrosequencing_based_assessment_of_soil_pH_as_...
SK1.141669 AGCACGAGCCTA CATGCTGCCTCCCGTAGGAGT V2 n CCME Pyrosequencing-Based Assessment of Soil pH as ... FA6P1OK loamy sand lauber_88_soils 0-0.05 ... V2 FWD:GCCTTGCCAGCCCGCTCAGTCAGAGTTTGATCCTGGCTCAG;... 19502440 467.0 53.900000 lauber_88_soils -13 0 CCME Pyrosequencing_based_assessment_of_soil_pH_as_...

89 rows × 58 columns

Notice that there are 89 samples in the metadata DataFrame, and that the order of sample IDs in the distance matrix and metadata are not the same. This is okay: scikit-bio generally allows metadata to be a superset of the IDs present in the distance matrix (or ordination results, or whatever else you're working with). In other words, extra sample IDs in the metadata are okay, and they don't have to be in the same order as they are in the distance matrix.

Note: scikit-bio currently only allows the metadata to be a superset of the IDs in the distance matrix. There are plans to support other modes, e.g., intersecting the metadata and distance matrix, or ignoring distance matrix IDs that are not found in the mapping file. See #756 to track progress on this.

Visualizing beta diversity with ordination techniques

Now that we have our data loaded, let's visually explore the beta diversity of the 85 soils samples in our distance matrix, in the context of our per-sample metadata.

The authors presented NMDS ordination plots of the soil samples colored by pH and biome. scikit-bio doesn't have the NMDS ordination technique yet -- there's an open issue (see #99) to have this functionality added. However, we can produce similar results using principal coordinates analysis (PCoA):

In [4]:
from skbio.stats.ordination import pcoa

pcoa_results = pcoa(unifrac_dm)

Plot the PCoA results, coloring each soil sample (i.e., each point) with its latitude:

In [5]:
# display matplotlib figures within the notebook
%matplotlib inline

fig = pcoa_results.plot(df=df,
                        column='LATITUDE',
                        axis_labels=['PC 1', 'PC 2', 'PC 3'],
                        title='Samples colored by latitude',
                        cmap='Reds',
                        s=35)

There isn't much of a gradient pattern in the plot; this confirms the findings of the paper.

If we color the soil samples by their pH, we see a clear gradient pattern from left (basic) to right (acidic). This visually confirms the correlation between soil bacterial community composition and change in pH that was the key finding in the paper.

In [6]:
fig = pcoa_results.plot(df=df,
                        column='PH',
                        axis_labels=['PC 1', 'PC 2', 'PC 3'],
                        title='Samples colored by pH',
                        cmap='Reds',
                        s=35)

The only other environmental variable that was found to have significant correlation with community composition was soil moisture deficit:

In [7]:
fig = pcoa_results.plot(df=df,
                        column='SOIL_MOISTURE_DEFICIT',
                        axis_labels=['PC 1', 'PC 2', 'PC 3'],
                        title='Samples colored by soil moisture deficit',
                        cmap='Reds',
                        s=35)

We see a pattern that's somewhat similar to the pH plot above. As the paper points out, pH and soil moisture deficit are strongly correlated environmental variables, which we confirm here by computing Pearson's correlation coefficient $r$:

In [8]:
from scipy.stats import pearsonr

pearsonr(df['PH'], df['SOIL_MOISTURE_DEFICIT'])[0]
Out[8]:
0.65196220317643994

Note: The ordination plotting functionalty available in sckit-bio creates basic plots, and is intended to provide a quick look at your results in the context of metadata. For more interactivity, customization, and to generate publication-quality figures, we recommend EMPeror (paper | website).

To write the ordination results to file so that they can be loaded in EMPeror, run:

In [9]:
pcoa_results.write('data/88-soils/pcoa_results.txt')
Out[9]:
'data/88-soils/pcoa_results.txt'

Comparing distance matrices with the Mantel test

Now that we've visually explored beta diversity in the context of latitude, pH, and soil moisture deficit, let's quantify the strength of the relationships between community composition and environmental variables. There are a variety of techniques that can accomplish this; the original paper used the Mantel test to compute the correlation between the unweighted UniFrac distance matrix and distance matrices derived from environmental variables. We will replicate those analyses here.

Let's define a convenience function to compute a DistanceMatrix from a numeric environmental variable (stored as a column in a DataFrame). We'll use the Euclidean distance metric:

In [10]:
import numpy as np
import scipy.spatial.distance

def dm_from_env(df, column):
    distances = scipy.spatial.distance.pdist(df[column].values[:, np.newaxis], metric='euclidean')
    return DistanceMatrix(distances, ids=df.index)

Let's create three distance matrices from our metadata (pH, soil moisture deficit, and latitude):

In [11]:
ph_dm = dm_from_env(df, 'PH')
soil_mois_dm = dm_from_env(df, 'SOIL_MOISTURE_DEFICIT')
latitude_dm = dm_from_env(df, 'LATITUDE')

We can use the Mantel test in scikit-bio to compare the UniFrac distance matrix with each of the environmental distance matrices. We use Spearman correlation instead of Pearson correlation (the default) to match the original publication:

In [12]:
from skbio.stats.distance import mantel

mantel(unifrac_dm, ph_dm, method='spearman', strict=False)
Out[12]:
(0.77870091183431622, 0.001, 85)

We receive three values: the correlation coefficient (Spearman's $\rho$), p-value, and number of samples that matched between the two distance matrices.

Note that 85 samples were used in the test. The metadata has 89 samples but our UniFrac distance matrix only has 85 samples, so the Mantel test only considered samples that were in both distance matrices (controlled via strict=False). Just as scikit-bio doesn't require the metadata and distance matrix to be in the same order, the Mantel test reorders the samples in each distance matrix before comparing them. See the scikit-bio mantel documentation for more details.

Let's run the Mantel test on the other two environmental distance matrices we created:

In [13]:
mantel(unifrac_dm, soil_mois_dm, method='spearman', strict=False)
Out[13]:
(0.42167110480728659, 0.001, 85)
In [14]:
mantel(unifrac_dm, latitude_dm, method='spearman', strict=False)
Out[14]:
(0.22958919478208206, 0.001, 85)

pH has strong correlation ($\rho$=0.7787) with soil microbial community composition, which agrees with the major points made in the original 88 Soils publication. Soil moisture deficit has moderate correlation ($\rho=0.4217$), while latitude has weak correlation ($\rho=0.2296$). These findings are also in line with the original study results.

Note: The p-value reported for UniFrac versus latitude differs from the p-value reported in the original publication ($P > 0.15$), which was computed using PRIMER v6. We are actively investigating the discrepancy between p-values, as the paper did not find significant results, while our notebook finds significant results.

Instead of running each Mantel test separately, scikit-bio provides a convenience function for running Mantel tests for all pairs of distance matrices:

In [15]:
from skbio.stats.distance import pwmantel

pwmantel([unifrac_dm, ph_dm, soil_mois_dm, latitude_dm],
         labels=['UniFrac', 'pH', 'Soil moisture deficit', 'Latitude'],
         method='spearman', strict=False)
Out[15]:
statistic p-value n method permutations alternative
dm1 dm2
UniFrac pH 0.778701 0.001 85 spearman 999 two-sided
Soil moisture deficit 0.421671 0.001 85 spearman 999 two-sided
Latitude 0.229589 0.001 85 spearman 999 two-sided
pH Soil moisture deficit 0.348139 0.001 89 spearman 999 two-sided
Latitude 0.069542 0.037 89 spearman 999 two-sided
Soil moisture deficit Latitude 0.318524 0.001 89 spearman 999 two-sided

We obtain the same results as above, but this time they are presented in a DataFrame.

**Warning:** pwmantel doesn't correct the p-values for multiple comparisons (e.g., Bonferroni correction). It is the responsibility of the user to correct the p-values, though this feature may be added to scikit-bio in the future.

Linking evironmental variables to community structure

Besides pH, soil moisture deficit, and latitude, there are many more environmental variables in our DataFrame that we haven't explored yet. Table 2 in the original paper has Mantel test results for several other evironmental variables, with a note stating that many more were analyzed but no significant correlations were found. The paper also states that multivariate correlations did not increase the strength of the correlation coefficients; thus, single environmental variables had higher correlation than combinations of two or more environmental variables.

When there are a large number of environmental variables of potential interest, the bioenv method (originally called BIO-ENV, also known as BEST) can be helpful in finding subsets of variables that are maximally rank-correlated with the community distance matrix. See Clarke & Ainsworth (1993) for the original method publication and the scikit-bio bioenv documentation for more details.

Let's define a list of 11 numeric environmental variables of potential interest. The names match the column names in the DataFrame:

In [16]:
env_vars = [
    'TOT_ORG_CARB', 'SILT_CLAY', 'ELEVATION', 'SOIL_MOISTURE_DEFICIT',
    'CARB_NITRO_RATIO', 'ANNUAL_SEASON_TEMP', 'ANNUAL_SEASON_PRECPT',
    'PH', 'CMIN_RATE', 'LONGITUDE', 'LATITUDE'
]

Let's run bioenv with our UniFrac distance matrix, metadata, and environmental variables we defined above:

In [17]:
from skbio.stats.distance import bioenv

res = bioenv(unifrac_dm, df, env_vars)
res
Out[17]:
size correlation
vars
PH 1 0.778703
SOIL_MOISTURE_DEFICIT, PH 2 0.706704
SOIL_MOISTURE_DEFICIT, ANNUAL_SEASON_TEMP, PH 3 0.666790
SILT_CLAY, SOIL_MOISTURE_DEFICIT, ANNUAL_SEASON_TEMP, PH 4 0.629411
SILT_CLAY, SOIL_MOISTURE_DEFICIT, CARB_NITRO_RATIO, ANNUAL_SEASON_TEMP, PH 5 0.594918
SILT_CLAY, SOIL_MOISTURE_DEFICIT, CARB_NITRO_RATIO, ANNUAL_SEASON_TEMP, PH, LONGITUDE 6 0.575953
SILT_CLAY, SOIL_MOISTURE_DEFICIT, CARB_NITRO_RATIO, ANNUAL_SEASON_TEMP, ANNUAL_SEASON_PRECPT, PH, LONGITUDE 7 0.553162
TOT_ORG_CARB, SILT_CLAY, SOIL_MOISTURE_DEFICIT, CARB_NITRO_RATIO, ANNUAL_SEASON_TEMP, ANNUAL_SEASON_PRECPT, PH, LONGITUDE 8 0.535103
TOT_ORG_CARB, SILT_CLAY, SOIL_MOISTURE_DEFICIT, CARB_NITRO_RATIO, ANNUAL_SEASON_TEMP, ANNUAL_SEASON_PRECPT, PH, CMIN_RATE, LONGITUDE 9 0.509554
TOT_ORG_CARB, SILT_CLAY, SOIL_MOISTURE_DEFICIT, CARB_NITRO_RATIO, ANNUAL_SEASON_TEMP, ANNUAL_SEASON_PRECPT, PH, CMIN_RATE, LONGITUDE, LATITUDE 10 0.482862
TOT_ORG_CARB, SILT_CLAY, ELEVATION, SOIL_MOISTURE_DEFICIT, CARB_NITRO_RATIO, ANNUAL_SEASON_TEMP, ANNUAL_SEASON_PRECPT, PH, CMIN_RATE, LONGITUDE, LATITUDE 11 0.458250

We see that the "best" subset of environmental variables is a subset of size 1, which only includes pH. This result confirms the findings of the paper: not only is pH strongly correlated with community composition, but including multiple environmental variables doesn't improve the correlation over using pH alone.

Also note that pH and soil moisture deficit are the "best" subset for subset size 2. This intuitively makes sense because pH and soil moisture deficit are strongly correlated (we calculated this earlier).

The results are stored as a DataFrame, so we can easily find the subset of environmental variables with the highest correlation. In this example, it's easy to find the largest correlation coefficient (it's pH), but the following technique can be handy if the results table is large:

In [18]:
res['correlation'].idxmax()
Out[18]:
'PH'

bioenv was not used in the original paper, but it can be a useful technique when there are a large number of environmental variables.