#!/usr/bin/env python # coding: utf-8 # # Machine learning and metagenomics to study microbial communities # # ## Luis Pedro Coelho # # - [@luispedrocoelho](https://twitter.com/luispedrocoelho) # - http://luispedro.org # ## Microbial communities # # _Definition_: A microbial community is just many small organisms living together. # # Most often, we study _bacteria_ and _archea_ (prokaryotes, but some people do not like that word as it's not a monophyletic group). # # Less often, microeukaryotes. # ![Q](Q.png) What are the (wet-)lab tools to analyse microbial communities? # # Wetlab methods # # - Sequencing # - WGS (metagenomics) # - Amplicon (16S, typically; but other genes too) # - Transcriptomics # - Imaging # - Metaproteomics # - ... # # The biggest reason why we study prokaryotes more than eukaryotes is that it's easier. # # Which environments are typically analysed? # # - Host associated microbiomes (with clinical implications) # - Environmental samples (ocean, lakes, soil) # - Wastewater treatment plants # - Cheese # - ... # ## Introduction to metagenomics analysis # # (Most of this also applies to amplicon and metatranscriptomics sequencing) # # 1. Collect sample # 2. Extract DNA (or RNA) # 3. Perhaps more manipulations (amplicon, RNA -> cDNA, ...) # 4. Send sample to sequencing centre # 5. Sequencing happens # 6. You get FASTQ files # #### Alternative pipeline # # 1. Download FASTQ files from ENA (EBI) or SRA # # Raw metagenomics data (FastQ Files) # # - We cannot process metagenomics data in a 1-day course: too much computation, too many tools to introduce; not enough time. # - We can, however, do some functional data analysis if we start with preprocessed data. # - Still, let's discuss what can be done with these data and how they are processed in general. # # What does a FASTQ file look like? # # # @HWI-D00647:25:C4P50ANXX:1:1101:1914:1994:0:N:0:AGGCAGAAAAGGCTAT # TAAAAGCATCAAAGTTTTTATAAATTTTTTATCCAACTTAAACATTTTTTTCCTCCTAA # + # <>BGGGGGCGGGGGGGGGGG>GGGGGGGGGGGEGGGGEGGGGGGGGGGGGGG>GGBGGG # @HWI-D00647:25:C4P50ANXX:1:1101:4595:1998:0:N:0:AGGCAGAAAAGGCTAT # CTCTATCGCACGCTCCAGGCGCGCTATCCCCAGGCGATAATCGATGTGATGGCACCGGC # + # :>BGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGDGGGGFGGGGGGGGGGGGGGGG # # ![Q](Q.png) # How would you process these data? # # Gene catalog approach # # ![Gene catalog](./MOCAT_TARA.png) # # # # Tools for this process: # # - [MOCAT](http://vm-lux.embl.de/~kultima/MOCAT/) # - [ngless](https://ngless.rtfd.io) # # Existing catalogs # # - Ocean # - Host-associated: human gut, mouse gut, pig gut, human skin # - ... # # More will come in the future. # # Outputs of metagenomics processing # # Feature tables: (sample x feature) # # - taxonomic: features are species abundances or genus abundance ... # - functional: features are gene groups, gene families # # # Note: all of these are **relative abundances**. Sequencing can never give you absolute abundances by itself. # # # It is the feature tables that are the input to the machine learning and other analyses. # # # Problems # # - [supervised] Colorectal cancer detection from fecal samples # - [unsupervised] Automated inference of species existence (mOTUs/CAGs/...) # - [supervised/unsupervised] As a **scientific tool** for relationship inferrence # - ... # # # # Colorectal Cancer (CRC) Detection # # This section is based on work from ([Zeller et al., 2014](http://onlinelibrary.wiley.com/doi/10.15252/msb.20145645/abstract)). # # Problem: # # > Can we use the microbiome present in a fecal sample as a test for CRC? # # Background: # # - CRC has a low survival rate if detected late, but high if detected early enough # - Current tests are either invasive and uncomfortable (colonoscopy) or weak in power (occult blood tests). # - Some in vitro research had already hinted at a possible microbiome involvement # # Classification problem # # 1. Get samples from healthy individuals and patients # 2. Generate data from both groups # 3. Build a classifier # # Practical I # # (These slides are adapted from Georg Zeller's work). # ### Mechanics of Jupyter notebooks # # - You can run this *cell* by typing CTRL+ENTER or from the menu. # - `?` after an object/function for help # - *Restart* when stuck # In[ ]: print("Hello World") # You can run the cells out of order, but **variables** are preserved. # # Another way of saying it is that you are manipulating an environment underneath # In[ ]: name = "Luis" # In[ ]: print("Hello {}".format(name)) # In[ ]: name = 'Anna' # In[ ]: import pandas as pd import numpy as np # ## Some basic imports # # These are general imports for data analysis in Python # In[ ]: from matplotlib import pyplot as plt from matplotlib import style style.use('seaborn-white') # A magic command to make plots appear *inline*: # In[ ]: get_ipython().run_line_magic('', 'matplotlib inline') # In[ ]: from sklearn import metrics from sklearn import linear_model from sklearn import cross_validation from sklearn import ensemble # # Getting the data # # In this case, the data is already in the form of feature tables. # # How to obtain these feature table from raw data is outside the scope of this tutorial. You can, however, read a tutorial on how to do so on the [NGLess website](http://ngless.embl.de/_static/gut-metagenomics-tutorial-presentation/gut_specI_tutorial.html). # In[ ]: features = pd.read_table('http://www.bork.embl.de/~zeller/mlbc_data/FR-CRC_N114_feat_specIclusters.tsv') # In[ ]: print(features.shape) # In[ ]: features # # Feature normalization # # Here is what we do: # # 1. Convert to relative abundances # 2. Remove low abundance features # 3. Log transform the features # 4. Whiten the data # # ![Q](Q.png) **Why?** # # Why convert to relative abundances? # # Because the **number of sequences you obtain does not depend on any property of the sample**. It is purely technical. # In[ ]: features.sum() # In[ ]: features /= features.sum() # # Why remove low abundance features? # # 1. Removing some features is good for the classification (dimensionality reduction). **We are also doing this without looking at the classes** (i.e., unsupervised). # 2. Low abundance features are noisier (when you are closer to the detection limit, you get to the Law of Small Numbers: _small groups have large variance_). # In[ ]: min_abundance = 1e-3 print("Nr features before filtering: {}".format(features.shape[0])) features = features[features.max(1) > min_abundance] print("Nr features after filtering: {}".format(features.shape[0])) # # Unsupervised feature selection alternatives # # - Use average value (mean instead of max) or prevalence (i.e., fraction > 0) # - Use the methods that you learned about yesterday # # It is often a heuristic process in any case. They will give you similar results. # # Why log transform the features? # # 1. Variance normalization. # 2. Biologically, we care about **fold changes** not about absolute changes. # # Variable normalization # In[ ]: features # In[ ]: fig,ax = plt.subplots() ax.set_xscale('log') ax.set_yscale('log') ax.scatter(features.mean(1),features.var(1)) # # Log transforming the data # # ![Q](Q.png) # # What is the big issue with log-transforms? # What to do about zeros in the counts table? # Empirical pseudo-count: use one order of magnitude smaller than the smallest detected value. # In[ ]: pseudo_count = features.values[features.values > 0].min()/10. features = np.log10(pseudo_count + features) # In[ ]: fig,ax = plt.subplots() ax.scatter(features.mean(1), features.std(1)) # Not, perfect, but look at the Y-axis scale: 0-3 (previously, it was over several orders of magnitude). # # Whitening the data # # Finally, we "whiten" the data: remove the mean and make the standard deviation be 1. # In[ ]: features = features.T features = (features - features.mean(0))/features.std(0) # Note we also transposed the matrix to a more natural format. # # Now, can we use a classifier to identify cancerous samples? # In[ ]: labels = pd.read_table('http://www.bork.embl.de/~zeller/mlbc_data/FR-CRC_N114_label.tsv', index_col=0, header=None, squeeze=True) print(labels) # In[ ]: raw_features = features raw_labels = labels features = raw_features.values labels = raw_labels.values # In[ ]: features # In[ ]: rf = ensemble.RandomForestClassifier(n_estimators=101) prediction = cross_validation.cross_val_predict(rf, features, labels) # In[ ]: get_ipython().run_line_magic('pinfo', 'cross_validation.cross_val_score') # In[ ]: (prediction == labels).mean() # ![Q](Q.png) Actually, we did not need to normalize the features for a random forest! Why? # # Let's try other classifiers # # What about a penalized logistic regression? # In[ ]: clf = linear_model.LogisticRegressionCV() lg_predictions = cross_validation.cross_val_predict(clf, features, labels) # In[ ]: print((lg_predictions == labels).mean()) # In[ ]: lg_predictions # # Can we get a soft prediction? # # Instead of just "the best guess", often it is useful to # In[ ]: lg_predictions = np.zeros(len(labels), float) for train, test in cross_validation.KFold(len(labels), n_folds=3, shuffle=True): clf.fit(features[train], labels[train]) lg_predictions[test] = clf.predict_proba(features[test]).T[0] # In[ ]: lg_fpr, lg_tpr,_ = metrics.roc_curve(labels, lg_predictions) fig,ax = plt.subplots() ax.plot(lg_tpr, lg_fpr) # ![Q](Q.png) Do you understand what the curve above is? # # ROC Curves # # X-axis: false positive rate, (1 -specificity), (how many objects that are classified as positive are actually negative?) # # Y-axis: sensitivity, recall, true positive rate (how many objects that are positive are indeed classified as positive?) # # ## Alternative way to think about it: # # 1. Order all the examples by the classifier # 2. Start at (0,0) # 3. Go down the list of examples as ordered by the classifier # - if true, go up # - if false, go right # Not all errors are the same: context matters. # # Area under the curve # In[ ]: fpr, tpr,_ = metrics.roc_curve(labels, lg_predictions) auc_roc = metrics.roc_auc_score(labels, -lg_predictions) fig,ax = plt.subplots() ax.plot(tpr, fpr) print("AUC: {}".format(auc_roc)) # # Biomarker discovery # # We can use these models for biomarker discovery: # # We use the models to infer which features are important (you probably talked about this yesterday). # # Many classifiers can also output a feature relevance measure. # # Lasso is sparse # # # Blackboard diagram: # # Lasso in scikit-learn # In[ ]: clf = linear_model.LogisticRegressionCV(penalty='l1', solver='liblinear', Cs=100) # In[ ]: clf.fit(features, labels) # In[ ]: feature_values = pd.Series(np.abs(clf.coef_).ravel(), index=raw_features.columns) # In[ ]: feature_values.sort_values(inplace=True) # In[ ]: fig,ax = plt.subplots() _=ax.plot(clf.coefs_paths_[1].mean(0)) # In[ ]: feature_values # # Comparing classifiers # # So, which classifier is better? # # Let's run the same evaluation scheme as before: # In[ ]: rf = ensemble.RandomForestClassifier(n_estimators=101) lasso_predictions = np.zeros(len(labels), float) rf_predictions = np.zeros(len(labels), float) for train, test in cross_validation.KFold(len(labels), n_folds=3, shuffle=True): clf.fit(features[train], labels[train]) lasso_predictions[test] = clf.predict_proba(features[test]).T[0] rf.fit(features[train], labels[train]) rf_predictions[test] = rf.predict_proba(features[test]).T[0] # In[ ]: fig,ax = plt.subplots() lg_fpr, lg_tpr,_ = metrics.roc_curve(labels, lg_predictions) lg_auc_roc = metrics.roc_auc_score(labels, -lg_predictions) ax.plot(lg_tpr, lg_fpr) print("Logistic regression (L2) AUC: {}".format(lg_auc_roc)) la_fpr, la_tpr,_ = metrics.roc_curve(labels, lasso_predictions) la_auc_roc = metrics.roc_auc_score(labels, -lasso_predictions) ax.plot(la_tpr, la_fpr) print("Logistic regression (L1) AUC: {}".format(la_auc_roc)) rf_fpr, rf_tpr,_ = metrics.roc_curve(labels, rf_predictions) rf_auc_roc = metrics.roc_auc_score(labels, -rf_predictions) ax.plot(rf_tpr, rf_fpr) print("RF AUC: {}".format(rf_auc_roc)) # # For a classifier, use Random Forests # # Which classifier should I use? # # For small to moderate sized problems, use Random Forests. # # - _Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?_ Manuel Fernández-Delgado, Eva Cernadas, Senén Barro, Dinani Amorim in [JMLR](http://www.jmlr.org/papers/v15/delgado14a.html) # - _I TRIED A BUNCH OF THINGS: THE DANGERS OF UNEXPECTED OVERFITTING IN CLASSIFICATION_ by Michael Skocik, John Collins, Chloe Callahan-Flintoft, Howard Bowman [BioRXiv preprint](http://biorxiv.org/content/early/2016/10/03/078816) # # Other classifiers can have some advantages: Lasso is sparse, for example. # Let's continue looking at using the Lasso for biomarker discovery: # # Increasing the penalty makes the model sparser # # We can also think of it as "decreasing the budget" of the classifier. # In[ ]: clf = linear_model.LogisticRegression(C=0.05, penalty='l1', solver='liblinear') clf.fit(features, labels) feature_values = pd.Series(np.abs(clf.coef_).ravel(), index=raw_features.columns) feature_values.sort_values(inplace=True) feature_values # # Lasso is not very stable # # Lasso is _winner takes all_. # # Back to blackboard: # In[ ]: clf = linear_model.LogisticRegression(C=0.1, penalty='l1', solver='liblinear') fvalues = [] for _ in range(100): selected = (np.random.random(len(features)) < 0.9) clf.fit(features[selected], labels[selected]) feature_values = pd.Series(np.abs(clf.coef_).ravel(), index=raw_features.columns) fvalues.append(feature_values) fvalues = pd.concat(fvalues, axis=1) # With the code above, we try the Lasso 100 times on different parts of the data (getting ~90% each time). # In[ ]: frac_selected =(fvalues > 0).mean(1) frac_selected.sort_values(inplace=True) frac_selected # # Final notes on CRC # # - There is no causality model. # - Maybe the tumour is a good environment for certain microbes # - Maybe certain microbes cause cancer # - Maybe people with subclinical symptoms subtly change their diet # # The two first hypotheses are much more likely, but we cannot rule out the third. # ## The same approach can be applied in other contexts # # Issues are how do you # # Discovery of species using clustering # # Remember we have a jumble of genes and for many species (most in the open environment) we have no reference genomes. # # Genes from the same species should correlate accross samples. # # We can attempt to cluster together the genes from the same species! # # ## Many related approaches # # - CAGs # - mOTUs # - ... # # We have several genes which we don't know where they belong # # # ![](metagenomics-species-nbt.png) # ![](nbt.2939-F1.jpg) # # Subspecies/strain Clustering # # We can go one level deeper and cluster at below species level (subspecies/strain level). # # ![](subspecies-3a.png) # # (Costea, Coelho, et al., in review) # ![And now for something completely different](https://static1.squarespace.com/static/5342b8d7e4b0cc3fc1bc6079/t/577af0fe197aea2767b508bf/1467674907628/?format=500w) # # Tutorial 2 # Load the metadata (sheet index 8 is the right one). The original data is at [http://ocean-microbiome.embl.de/companion.html](http://ocean-microbiome.embl.de/companion.html). # # In[ ]: tara_data = pd.read_excel('http://ocean-microbiome.embl.de/data/OM.CompanionTables.xlsx', sheetname=['Table W1', 'Table W8'], index_col=0) # In[ ]: meta = tara_data['Table W8'] meta # Remove the non-data lines at the bottom: # In[ ]: meta = meta.select(lambda ix: ix.startswith('TARA')) # Now, we need to do some ID manipulation # # (Let's look at the tables in the supplement first, before we look at this obscure code) # In[ ]: samples = tara_data['Table W1'] pangea = samples['PANGAEA sample identifier'] meta.index = meta.index.map(dict([(p,t) for t,p in pangea.to_dict().items()]).get) # A lot of bioinformatics is just converting file types and identifiers. # We are going to focus on the surface samples, in the prokaryotic size fraction: # In[ ]: meta = meta.select(lambda sid: '_SRF_' in sid) meta = meta.select(lambda sid: '0.22-3' in sid or '0.22-1.6' in sid) # # Taxonomic tables # In[ ]: raw_mOTUs = pd.read_table('http://ocean-microbiome.embl.de/data/mOTU.linkage-groups.relab.release.tsv', index_col=0) mOTUs = raw_mOTUs.copy() # Get just the relevant rows: # In[ ]: mOTUs # In[ ]: mOTUs = mOTUs[meta.index] # In[ ]: mOTUs.shape # Select abundant ones # In[ ]: mOTUs = mOTUs[mOTUs.max(1) > 1e-2] # In[ ]: (mOTUs > 0).mean().mean() # Transpose # In[ ]: mOTUs = mOTUs.T # In[ ]: print(mOTUs.shape) # # PCA Plot # # - Principal component analysis (PCA) # - Principal coordinate analysis (PCoA) # - Multidimensional analysis (MDS) # - Multidimensional non-euclidean analysis # # I.e., if $x_i$, $x_j$ are the original (high dimensional) vectors, and $p_i$, $p_j$ are their transformed counterparts (low dimensional), then # # $|x_i - x_j| \approx |p_i - p_j|$ # # PCA is one of the simplest methods, very classical (matrix operations) and very deep (shows up in many forms). # [Wikipedia article on PCA](https://en.wikipedia.org/wiki/Principal_component_analysis) # # ![PCA](https://upload.wikimedia.org/wikipedia/commons/thumb/f/f5/GaussianScatterPCA.svg/330px-GaussianScatterPCA.svg.png) # # PCA in scikit-learn # In[ ]: from sklearn import decomposition pca = decomposition.PCA(2) # We can perform it in a single call: # In[ ]: pca_decomp = pca.fit_transform(mOTUs) # In[ ]: fig,ax = plt.subplots() ax.scatter(pca_decomp.T[0], pca_decomp.T[1], s=60) # # Adding metadata to the plot # In[ ]: regions = samples['Ocean and sea regions (IHO General Sea Areas 1953) [MRGID registered at www.marineregions.com]'] regions = regions.reindex(meta.index) regions = pd.Categorical(regions) # In[ ]: COLORS = np.array([ '#7fc97f', '#beaed4', '#fdc086', '#ffff99', '#386cb0', '#f0027f', '#bf5b17', '#666666', ]) # In[ ]: fig,ax = plt.subplots() ax.scatter(pca_decomp.T[0], pca_decomp.T[1], c=COLORS[regions.codes],s=60) # Hmm. We "forgot" to log transform. # # What is the largest feature? # In[ ]: mOTUs.mean() # In[ ]: fig,ax = plt.subplots() ax.scatter(pca_decomp.T[0], mOTUs['unassigned'], c=COLORS[regions.codes],s=60) # So, yeah, without log-normalization, PC1 is almost completely determined by the single largest element. # # In our case, this is even worse, because the largest element is the unassigned fraction! # In[ ]: pc = mOTUs.values[mOTUs.values > 0].min() / 10. mOTUs = np.log10(mOTUs + pc) # In[ ]: pca_decomp = pca.fit_transform(mOTUs) fig,ax = plt.subplots() ax.scatter(pca_decomp.T[0], pca_decomp.T[1], c=COLORS[regions.codes],s=60) # # Explained variance in PCA # # It is, in general, impossible to get a perfect low dimensional representation of a high dimensional space # In[ ]: pca_decomp = pca.fit_transform(mOTUs) fig,ax = plt.subplots() ax.scatter(pca_decomp.T[0], pca_decomp.T[1], c=COLORS[regions.codes],s=60) ax.set_xlabel('Explained variance: {:.1%}'.format(pca.explained_variance_ratio_[0])) ax.set_ylabel('Explained variance: {:.1%}'.format(pca.explained_variance_ratio_[1])) # You **always want to the explained variance**. # # Ideally, 50% or more should be explained by the two axes. # # Correlation of temperature with PC1 # In[ ]: fig,ax = plt.subplots() temperature = meta['Mean_Temperature [deg C]*'] ax.scatter(temperature, pca_decomp.T[0]) # We can check with standard statistics that this is indeed a strong correlation: # In[ ]: from scipy import stats stats.spearmanr(temperature, pca_decomp.T[0]) # # How about predicting temperature? # # # ### Regression, not classification # # We are predicting a continuous output, not just a single class. # # Elastic net regression # # OLS: $\beta = \arg\min | \beta X - y |^2$ # # Lasso: $\beta = \arg\min | \beta X - y |^2 + \alpha |\sum_i\beta_i|$ # # Ridge: $\beta = \arg\min | \beta X - y |^2 + \alpha |\sum_i\beta_i|^2$ # Elastic net: $\beta = \arg\min | \beta X - y |^2 + \alpha_1 |\sum_i\beta_i| + \alpha_2 |\sum_i\beta_i|^2$ # Ideally, elastic net is a "soft Lasso": still sparse, but not as greedy. # # **If you just want a regression method, _use elastic nets_.** # # Feature normalization # # We use much of the same procedure as before, except we used _a different normalization_: rank transform. # # In[ ]: from scipy import stats ranked = np.array([stats.rankdata(mOTUs.iloc[i]) for i in range(len(mOTUs))]) # In[ ]: predictor = linear_model.ElasticNetCV(n_jobs=4) cv = cross_validation.LeaveOneOut(len(temperature)) prediction = cross_validation.cross_val_predict(predictor, ranked, temperature, cv=cv) # In[ ]: fig,ax = plt.subplots() ax.scatter(temperature, prediction) ax.plot([0,30], [0,30], 'k:') # In[ ]: print("R2: {}".format(metrics.r2_score(temperature, prediction))) # In[ ]: predictor # # R² as a measure of prediction # # - R²_CV_ or R² or Q² or Coefficient of Determination: nomenclature is not 100% standard. # - It's variance explained. # - Alternatively: it's the improvement over a _null model_. # # It can be negative! # # # $R^2 = 1 - \frac{\sum_i | y_i - \hat{y}_i|^2}{\sum_i |y_i - \bar{y}|^2}$ # # It can be very sensitive to outliers! # # What about spatial auto-correlation? # # ![Q](Q.png) What is spatial auto-correlation? # ![Q](Q.png) How do we solve the issue of spatial auto-correlation? # # Cross-validation as a solution to auto-correlations # # - Cross-validation can be a very powerful scientific tool # # http://luispedro.org/files/Coelho2013_Bioinformatics_extra/crossvalidation.html # In[ ]: cv = cross_validation.LeaveOneLabelOut(labels=regions) prediction = cross_validation.cross_val_predict(predictor, ranked, temperature, cv=cv) print("R2: {}".format(metrics.r2_score(temperature, prediction))) # In[ ]: fig, ax = plt.subplots() ax.scatter(temperature, prediction) ax.plot([0,30], [0,30],'k:') # # Regression to the mean in penalized linear models # # Models will often regress to the mean. # # This is OK. # # Be very careful with group/batch effects # # - They are pervasive # - They can kill your generalization # - They can trick you into thinking your system works better than it does # # # References # # - _Determining the subcellular location of new proteins from microscope images using local features_ by Coelho et al. (2013) Bioinformatics [DOI: 10.1093/bioinformatics/btt392](http://bioinformatics.oxfordjournals.org/content/29/18/2343.short) # - _Assessing and tuning brain decoders: Cross-validation, caveats, and guidelines_ by Varoquaux et al. NeuroImage (2016) [DOI: 10.1016/j.neuroimage.2016.10.038](http://www.sciencedirect.com/science/article/pii/S105381191630595X) # # # Problem: can we build a model that can generalize across studies? # # Big issues # # - Not the same technology (Illumina vs 454 and different library preps) # - Not the same sequencing depth # # ![Q](Q.png) Suggestions? # Here is what we did: # # 1. Downsample (randomly) our data to the GOS depth # 2. Only used presences/absence (encoded as 0.0/1.0) # # Worked well. # Questions before we switch topics again?