%matplotlib inline
import pandas as pd
import numpy as np
import pylab as plt
The Shannon-Weiner index was developed from information theory and is based on measuring uncertainty. The degree of uncertainty of predicting the species of a random sample is related to the diversity of a community. If a community is dominated by one species (low diversity), the uncertainty of prediction is low; a randomly-sampled species is most likely going to be the dominant species. However, if diversity is high, uncertainty is high.
$$H' = \sum_i p_i \log(p_i)$$This function calculates Shannon diversity:
# Shannon index
shannon = lambda x: -np.sum([p*np.log(p) for p in x/float(sum(x)) if p])
The Simpson index (1949) measures the degree of concentration when individuals are classified into types. It is based on the expected outcome of two random samples from the same community:
$$D = \frac{N(N-1)}{\sum_i n_i (n_i-1)}$$This function calculates Simpson's index of diversity:
simpsons = lambda x: sum(x)*(sum(x)-1)/np.sum([float(n-1)*n for n in x if n])
genus_raw = pd.read_excel('OTU.xlsx', 'L5', index_col=0).unstack().reset_index()
genus_raw.head()
level_0 | Taxon | 0 | |
---|---|---|---|
0 | N3s | Bacteria;Acidobacteria;Acidobacteria_Gp16;Gp16... | 0 |
1 | N3s | Bacteria;Acidobacteria;Acidobacteria_Gp6;Gp6;O... | 0 |
2 | N3s | Bacteria;Actinobacteria;Actinobacteria;Acidimi... | 0 |
3 | N3s | Bacteria;Actinobacteria;Actinobacteria;Acidimi... | 0 |
4 | N3s | Bacteria;Actinobacteria;Actinobacteria;Actinom... | 0 |
taxa = genus_raw.Taxon.unique()
taxa_dict = dict(zip(taxa, range(len(taxa))))
genus_raw.Taxon.replace(taxa_dict, inplace=True)
genus_raw['nec'] = genus_raw.level_0.apply(lambda x: x.startswith('N'))
genus_raw['stool'] = genus_raw.level_0.apply(lambda x: x.endswith('s'))
genus_raw['patient'] = genus_raw.level_0.apply(lambda x: int(''.join([s for s in x if s.isdigit()])))
genus_data = genus_raw.drop('level_0', axis=1)
genus_data.columns = 'taxon', 'count', 'nec', 'stool', 'patient'
genus_data.tail()
taxon | count | nec | stool | patient | |
---|---|---|---|---|---|
16425 | 305 | 0 | True | False | 2 |
16426 | 306 | 0 | True | False | 2 |
16427 | 307 | 0 | True | False | 2 |
16428 | 308 | 0 | True | False | 2 |
16429 | 309 | 0 | True | False | 2 |
tissue_data = genus_data[genus_data.stool==False]
simpsons_tissue = tissue_data.groupby(['patient', 'nec'])['count'].aggregate(simpsons)
simpsons_tissue.unstack().rename(columns={False: 'Control', True: 'NEC'}).boxplot(grid=False)
plt.ylabel('Simpson\'s index');
/usr/local/lib/python3.4/site-packages/pandas/tools/plotting.py:2625: FutureWarning: The default value for 'return_type' will change to 'axes' in a future release. To use the future behavior now, set return_type='axes'. To keep the previous behavior and silence this warning, set return_type='dict'. warnings.warn(msg, FutureWarning)
shannon_tissue = tissue_data.groupby(['patient', 'nec'])['count'].aggregate(shannon)
shannon_tissue.unstack().rename(columns={False: 'Control', True: 'NEC'}).boxplot(grid=False)
plt.ylabel('Shannon index');
/usr/local/lib/python3.4/site-packages/pandas/tools/plotting.py:2625: FutureWarning: The default value for 'return_type' will change to 'axes' in a future release. To use the future behavior now, set return_type='axes'. To keep the previous behavior and silence this warning, set return_type='dict'. warnings.warn(msg, FutureWarning)
# Chao 1 index
chao = lambda x: len(x) + ((x==1).sum())**2/(2 * (x==2).sum())
chao_tissue = tissue_data.groupby(['patient', 'nec'])['count'].aggregate(chao)
chao_tissue.unstack().rename(columns={False: 'Control', True: 'NEC'}).boxplot(grid=False)
plt.ylabel('Chao index');
/usr/local/lib/python3.4/site-packages/pandas/tools/plotting.py:2625: FutureWarning: The default value for 'return_type' will change to 'axes' in a future release. To use the future behavior now, set return_type='axes'. To keep the previous behavior and silence this warning, set return_type='dict'. warnings.warn(msg, FutureWarning)
stool_data = genus_data[genus_data.stool==True]
simpsons_stool = stool_data.groupby(['patient', 'nec'])['count'].aggregate(simpsons)
simpsons_stool.unstack().rename(columns={False: 'Control', True: 'NEC'}).boxplot(grid=False)
plt.ylabel('Simpson\'s index');
/usr/local/lib/python3.4/site-packages/pandas/tools/plotting.py:2625: FutureWarning: The default value for 'return_type' will change to 'axes' in a future release. To use the future behavior now, set return_type='axes'. To keep the previous behavior and silence this warning, set return_type='dict'. warnings.warn(msg, FutureWarning)
shannon_stool = stool_data.groupby(['patient', 'nec'])['count'].aggregate(shannon)
shannon_stool.unstack().rename(columns={False: 'Control', True: 'NEC'}).boxplot(grid=False)
plt.ylabel('Shannon index');
/usr/local/lib/python3.4/site-packages/pandas/tools/plotting.py:2625: FutureWarning: The default value for 'return_type' will change to 'axes' in a future release. To use the future behavior now, set return_type='axes'. To keep the previous behavior and silence this warning, set return_type='dict'. warnings.warn(msg, FutureWarning)
chao_stool = stool_data.groupby(['patient', 'nec'])['count'].aggregate(chao)
chao_stool.unstack().rename(columns={False: 'Control', True: 'NEC'}).boxplot(grid=False)
plt.ylabel('Chao index');
/usr/local/lib/python3.4/site-packages/pandas/tools/plotting.py:2625: FutureWarning: The default value for 'return_type' will change to 'axes' in a future release. To use the future behavior now, set return_type='axes'. To keep the previous behavior and silence this warning, set return_type='dict'. warnings.warn(msg, FutureWarning)
simpsons_nec = genus_data[genus_data.nec==True].groupby(['patient','stool'])['count'].aggregate(simpsons)
simpsons_nec.unstack().rename(columns={False: 'Tissue', True: 'Stool'}).dropna()
stool | Tissue | Stool |
---|---|---|
patient | ||
1 | 3.186187 | 2.501635 |
2 | 1.145031 | 1.240411 |
3 | 13.310545 | 1.860451 |
4 | 4.025034 | 2.144758 |
5 | 3.065582 | 3.627015 |
7 | 1.052006 | 1.123837 |
8 | 9.209903 | 3.456548 |
9 | 2.006315 | 1.995825 |
10 | 2.105499 | 1.054506 |
11 | 2.077362 | 3.399880 |
len(simpsons_nec)
22
def plot_bipartite(data, label):
axes = data.unstack().rename(columns={False: 'Tissue', True: 'Stool'}).dropna().T.plot(grid=False,
color='k',
style='.-',
legend=False)
axes.set_xlim(-0.1, 1.1)
axes.set_xticks([0,1])
axes.set_xticklabels(['Tissue', 'Stool'])
axes.set_ylabel(label)
axes.set_xlabel('')
plot_bipartite(simpsons_nec, 'Simpson\'s index')
shannon_nec = genus_data[genus_data.nec==True].groupby(['patient','stool'])['count'].aggregate(shannon)
shannon_nec.unstack().rename(columns={False: 'Tissue', True: 'Stool'}).dropna()
stool | Tissue | Stool |
---|---|---|
patient | ||
1 | 1.553132 | 1.148086 |
2 | 0.370677 | 0.431641 |
3 | 3.026228 | 1.194941 |
4 | 1.942193 | 0.907746 |
5 | 1.983316 | 1.532137 |
7 | 0.150379 | 0.266457 |
8 | 2.678588 | 1.609806 |
9 | 0.787423 | 1.099322 |
10 | 0.875270 | 0.125509 |
11 | 0.988443 | 1.674464 |
plot_bipartite(shannon_nec, 'Shannon index')
chao_nec = genus_data[genus_data.nec==True].groupby(['patient','stool'])['count'].aggregate(chao)
chao_nec.unstack().rename(columns={False: 'Tissue', True: 'Stool'}).dropna()
stool | Tissue | Stool |
---|---|---|
patient | ||
1 | 324.000000 | inf |
2 | 314.166667 | 342.000000 |
3 | 340.250000 | 320.285714 |
4 | 322.500000 | 350.500000 |
5 | 322.500000 | 311.125000 |
7 | inf | 314.000000 |
8 | 347.785714 | 322.000000 |
9 | 314.166667 | 318.000000 |
10 | 334.500000 | 310.000000 |
11 | 314.000000 | 656.125000 |
Chao was very unstable, so I am not attempting to generate a graph.