Definition: A microbial community is just many small organisms living together.
Most often, we study bacteria and archea (prokaryotes, but some people do not like that word as it's not a monophyletic group).
Less often, microeukaryotes.
What are the (wet-)lab tools to analyse microbial communities?
The biggest reason why we study prokaryotes more than eukaryotes is that it's easier.
(Most of this also applies to amplicon and metatranscriptomics sequencing)
@HWI-D00647:25:C4P50ANXX:1:1101:1914:1994:0:N:0:AGGCAGAAAAGGCTAT
TAAAAGCATCAAAGTTTTTATAAATTTTTTATCCAACTTAAACATTTTTTTCCTCCTAA
+
<>BGGGGGCGGGGGGGGGGG>GGGGGGGGGGGEGGGGEGGGGGGGGGGGGGG>GGBGGG
@HWI-D00647:25:C4P50ANXX:1:1101:4595:1998:0:N:0:AGGCAGAAAAGGCTAT
CTCTATCGCACGCTCCAGGCGCGCTATCCCCAGGCGATAATCGATGTGATGGCACCGGC
+
:>BGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGDGGGGFGGGGGGGGGGGGGGGG
How would you process these data?
More will come in the future.
Feature tables: (sample x feature)
Note: all of these are relative abundances. Sequencing can never give you absolute abundances by itself.
It is the feature tables that are the input to the machine learning and other analyses.
This section is based on work from (Zeller et al., 2014).
Problem:
Can we use the microbiome present in a fecal sample as a test for CRC?
Background:
(These slides are adapted from Georg Zeller's work).
?
after an object/function for helpprint("Hello World")
You can run the cells out of order, but variables are preserved.
Another way of saying it is that you are manipulating an environment underneath
name = "Luis"
print("Hello {}".format(name))
name = 'Anna'
import pandas as pd
import numpy as np
These are general imports for data analysis in Python
from matplotlib import pyplot as plt
from matplotlib import style
style.use('seaborn-white')
A magic command to make plots appear inline:
% matplotlib inline
from sklearn import metrics
from sklearn import linear_model
from sklearn import cross_validation
from sklearn import ensemble
In this case, the data is already in the form of feature tables.
How to obtain these feature table from raw data is outside the scope of this tutorial. You can, however, read a tutorial on how to do so on the NGLess website.
features = pd.read_table('http://www.bork.embl.de/~zeller/mlbc_data/FR-CRC_N114_feat_specIclusters.tsv')
print(features.shape)
features
Here is what we do:
Why?
Because the number of sequences you obtain does not depend on any property of the sample. It is purely technical.
features.sum()
features /= features.sum()
min_abundance = 1e-3
print("Nr features before filtering: {}".format(features.shape[0]))
features = features[features.max(1) > min_abundance]
print("Nr features after filtering: {}".format(features.shape[0]))
It is often a heuristic process in any case. They will give you similar results.
features
fig,ax = plt.subplots()
ax.set_xscale('log')
ax.set_yscale('log')
ax.scatter(features.mean(1),features.var(1))
What is the big issue with log-transforms?
What to do about zeros in the counts table?
Empirical pseudo-count: use one order of magnitude smaller than the smallest detected value.
pseudo_count = features.values[features.values > 0].min()/10.
features = np.log10(pseudo_count + features)
fig,ax = plt.subplots()
ax.scatter(features.mean(1), features.std(1))
Not, perfect, but look at the Y-axis scale: 0-3 (previously, it was over several orders of magnitude).
Finally, we "whiten" the data: remove the mean and make the standard deviation be 1.
features = features.T
features = (features - features.mean(0))/features.std(0)
Note we also transposed the matrix to a more natural format.
labels = pd.read_table('http://www.bork.embl.de/~zeller/mlbc_data/FR-CRC_N114_label.tsv',
index_col=0, header=None, squeeze=True)
print(labels)
raw_features = features
raw_labels = labels
features = raw_features.values
labels = raw_labels.values
features
rf = ensemble.RandomForestClassifier(n_estimators=101)
prediction = cross_validation.cross_val_predict(rf, features, labels)
cross_validation.cross_val_score?
(prediction == labels).mean()
Actually, we did not need to normalize the features for a random forest! Why?
What about a penalized logistic regression?
clf = linear_model.LogisticRegressionCV()
lg_predictions = cross_validation.cross_val_predict(clf, features, labels)
print((lg_predictions == labels).mean())
lg_predictions
Instead of just "the best guess", often it is useful to
lg_predictions = np.zeros(len(labels), float)
for train, test in cross_validation.KFold(len(labels), n_folds=3, shuffle=True):
clf.fit(features[train], labels[train])
lg_predictions[test] = clf.predict_proba(features[test]).T[0]
lg_fpr, lg_tpr,_ = metrics.roc_curve(labels, lg_predictions)
fig,ax = plt.subplots()
ax.plot(lg_tpr, lg_fpr)
Do you understand what the curve above is?
X-axis: false positive rate, (1 -specificity), (how many objects that are classified as positive are actually negative?)
Y-axis: sensitivity, recall, true positive rate (how many objects that are positive are indeed classified as positive?)
Not all errors are the same: context matters.
fpr, tpr,_ = metrics.roc_curve(labels, lg_predictions)
auc_roc = metrics.roc_auc_score(labels, -lg_predictions)
fig,ax = plt.subplots()
ax.plot(tpr, fpr)
print("AUC: {}".format(auc_roc))
We can use these models for biomarker discovery:
We use the models to infer which features are important (you probably talked about this yesterday).
Many classifiers can also output a feature relevance measure.
Blackboard diagram:
clf = linear_model.LogisticRegressionCV(penalty='l1', solver='liblinear', Cs=100)
clf.fit(features, labels)
feature_values = pd.Series(np.abs(clf.coef_).ravel(), index=raw_features.columns)
feature_values.sort_values(inplace=True)
fig,ax = plt.subplots()
_=ax.plot(clf.coefs_paths_[1].mean(0))
feature_values
So, which classifier is better?
Let's run the same evaluation scheme as before:
rf = ensemble.RandomForestClassifier(n_estimators=101)
lasso_predictions = np.zeros(len(labels), float)
rf_predictions = np.zeros(len(labels), float)
for train, test in cross_validation.KFold(len(labels), n_folds=3, shuffle=True):
clf.fit(features[train], labels[train])
lasso_predictions[test] = clf.predict_proba(features[test]).T[0]
rf.fit(features[train], labels[train])
rf_predictions[test] = rf.predict_proba(features[test]).T[0]
fig,ax = plt.subplots()
lg_fpr, lg_tpr,_ = metrics.roc_curve(labels, lg_predictions)
lg_auc_roc = metrics.roc_auc_score(labels, -lg_predictions)
ax.plot(lg_tpr, lg_fpr)
print("Logistic regression (L2) AUC: {}".format(lg_auc_roc))
la_fpr, la_tpr,_ = metrics.roc_curve(labels, lasso_predictions)
la_auc_roc = metrics.roc_auc_score(labels, -lasso_predictions)
ax.plot(la_tpr, la_fpr)
print("Logistic regression (L1) AUC: {}".format(la_auc_roc))
rf_fpr, rf_tpr,_ = metrics.roc_curve(labels, rf_predictions)
rf_auc_roc = metrics.roc_auc_score(labels, -rf_predictions)
ax.plot(rf_tpr, rf_fpr)
print("RF AUC: {}".format(rf_auc_roc))
Which classifier should I use?
For small to moderate sized problems, use Random Forests.
Other classifiers can have some advantages: Lasso is sparse, for example.
Let's continue looking at using the Lasso for biomarker discovery:
We can also think of it as "decreasing the budget" of the classifier.
clf = linear_model.LogisticRegression(C=0.05, penalty='l1', solver='liblinear')
clf.fit(features, labels)
feature_values = pd.Series(np.abs(clf.coef_).ravel(), index=raw_features.columns)
feature_values.sort_values(inplace=True)
feature_values
clf = linear_model.LogisticRegression(C=0.1, penalty='l1', solver='liblinear')
fvalues = []
for _ in range(100):
selected = (np.random.random(len(features)) < 0.9)
clf.fit(features[selected], labels[selected])
feature_values = pd.Series(np.abs(clf.coef_).ravel(), index=raw_features.columns)
fvalues.append(feature_values)
fvalues = pd.concat(fvalues, axis=1)
With the code above, we try the Lasso 100 times on different parts of the data (getting ~90% each time).
frac_selected =(fvalues > 0).mean(1)
frac_selected.sort_values(inplace=True)
frac_selected
The two first hypotheses are much more likely, but we cannot rule out the third.
Issues are how do you
Remember we have a jumble of genes and for many species (most in the open environment) we have no reference genomes.
Genes from the same species should correlate accross samples.
We can attempt to cluster together the genes from the same species!
We can go one level deeper and cluster at below species level (subspecies/strain level).
(Costea, Coelho, et al., in review)
Load the metadata (sheet index 8 is the right one). The original data is at http://ocean-microbiome.embl.de/companion.html.
tara_data = pd.read_excel('http://ocean-microbiome.embl.de/data/OM.CompanionTables.xlsx', sheetname=['Table W1', 'Table W8'], index_col=0)
meta = tara_data['Table W8']
meta
Remove the non-data lines at the bottom:
meta = meta.select(lambda ix: ix.startswith('TARA'))
Now, we need to do some ID manipulation
(Let's look at the tables in the supplement first, before we look at this obscure code)
samples = tara_data['Table W1']
pangea = samples['PANGAEA sample identifier']
meta.index = meta.index.map(dict([(p,t) for t,p in pangea.to_dict().items()]).get)
A lot of bioinformatics is just converting file types and identifiers.
We are going to focus on the surface samples, in the prokaryotic size fraction:
meta = meta.select(lambda sid: '_SRF_' in sid)
meta = meta.select(lambda sid: '0.22-3' in sid or '0.22-1.6' in sid)
raw_mOTUs = pd.read_table('http://ocean-microbiome.embl.de/data/mOTU.linkage-groups.relab.release.tsv',
index_col=0)
mOTUs = raw_mOTUs.copy()
Get just the relevant rows:
mOTUs
mOTUs = mOTUs[meta.index]
mOTUs.shape
Select abundant ones
mOTUs = mOTUs[mOTUs.max(1) > 1e-2]
(mOTUs > 0).mean().mean()
Transpose
mOTUs = mOTUs.T
print(mOTUs.shape)
I.e., if $x_i$, $x_j$ are the original (high dimensional) vectors, and $p_i$, $p_j$ are their transformed counterparts (low dimensional), then
$|x_i - x_j| \approx |p_i - p_j|$
PCA is one of the simplest methods, very classical (matrix operations) and very deep (shows up in many forms).
from sklearn import decomposition
pca = decomposition.PCA(2)
We can perform it in a single call:
pca_decomp = pca.fit_transform(mOTUs)
fig,ax = plt.subplots()
ax.scatter(pca_decomp.T[0], pca_decomp.T[1], s=60)
regions = samples['Ocean and sea regions (IHO General Sea Areas 1953) [MRGID registered at www.marineregions.com]']
regions = regions.reindex(meta.index)
regions = pd.Categorical(regions)
COLORS = np.array([
'#7fc97f',
'#beaed4',
'#fdc086',
'#ffff99',
'#386cb0',
'#f0027f',
'#bf5b17',
'#666666',
])
fig,ax = plt.subplots()
ax.scatter(pca_decomp.T[0], pca_decomp.T[1], c=COLORS[regions.codes],s=60)
Hmm. We "forgot" to log transform.
What is the largest feature?
mOTUs.mean()
fig,ax = plt.subplots()
ax.scatter(pca_decomp.T[0], mOTUs['unassigned'], c=COLORS[regions.codes],s=60)
So, yeah, without log-normalization, PC1 is almost completely determined by the single largest element.
In our case, this is even worse, because the largest element is the unassigned fraction!
pc = mOTUs.values[mOTUs.values > 0].min() / 10.
mOTUs = np.log10(mOTUs + pc)
pca_decomp = pca.fit_transform(mOTUs)
fig,ax = plt.subplots()
ax.scatter(pca_decomp.T[0], pca_decomp.T[1], c=COLORS[regions.codes],s=60)
It is, in general, impossible to get a perfect low dimensional representation of a high dimensional space
pca_decomp = pca.fit_transform(mOTUs)
fig,ax = plt.subplots()
ax.scatter(pca_decomp.T[0], pca_decomp.T[1], c=COLORS[regions.codes],s=60)
ax.set_xlabel('Explained variance: {:.1%}'.format(pca.explained_variance_ratio_[0]))
ax.set_ylabel('Explained variance: {:.1%}'.format(pca.explained_variance_ratio_[1]))
You always want to the explained variance.
Ideally, 50% or more should be explained by the two axes.
fig,ax = plt.subplots()
temperature = meta['Mean_Temperature [deg C]*']
ax.scatter(temperature, pca_decomp.T[0])
We can check with standard statistics that this is indeed a strong correlation:
from scipy import stats
stats.spearmanr(temperature, pca_decomp.T[0])
We are predicting a continuous output, not just a single class.
OLS: $\beta = \arg\min | \beta X - y |^2$
Lasso: $\beta = \arg\min | \beta X - y |^2 + \alpha |\sum_i\beta_i|$
Ridge: $\beta = \arg\min | \beta X - y |^2 + \alpha |\sum_i\beta_i|^2$
Elastic net: $\beta = \arg\min | \beta X - y |^2 + \alpha_1 |\sum_i\beta_i| + \alpha_2 |\sum_i\beta_i|^2$
Ideally, elastic net is a "soft Lasso": still sparse, but not as greedy.
If you just want a regression method, use elastic nets.
We use much of the same procedure as before, except we used a different normalization: rank transform.
from scipy import stats
ranked = np.array([stats.rankdata(mOTUs.iloc[i]) for i in range(len(mOTUs))])
predictor = linear_model.ElasticNetCV(n_jobs=4)
cv = cross_validation.LeaveOneOut(len(temperature))
prediction = cross_validation.cross_val_predict(predictor, ranked, temperature, cv=cv)
fig,ax = plt.subplots()
ax.scatter(temperature, prediction)
ax.plot([0,30], [0,30], 'k:')
print("R2: {}".format(metrics.r2_score(temperature, prediction)))
predictor
It can be negative!
$R^2 = 1 - \frac{\sum_i | y_i - \hat{y}_i|^2}{\sum_i |y_i - \bar{y}|^2}$
It can be very sensitive to outliers!
What is spatial auto-correlation?
How do we solve the issue of spatial auto-correlation?
http://luispedro.org/files/Coelho2013_Bioinformatics_extra/crossvalidation.html
cv = cross_validation.LeaveOneLabelOut(labels=regions)
prediction = cross_validation.cross_val_predict(predictor, ranked, temperature, cv=cv)
print("R2: {}".format(metrics.r2_score(temperature, prediction)))
fig, ax = plt.subplots()
ax.scatter(temperature, prediction)
ax.plot([0,30], [0,30],'k:')
Models will often regress to the mean.
This is OK.
References
Big issues
Suggestions?
Here is what we did:
Worked well.
Questions before we switch topics again?