This notebook is for the analysis of 15 16S PCR libraries that were produced from DNA extracted from meals, collected by blending a plate of food in the blender. The meals were prepared to be typical of three diet types (Average American, USDA-recommended, and Vegan)
Before launching this ipython notebook, I typed the macqiime command to configure the shell. I'm using macqiime 1.8.0 http://www.wernerlab.org/software/macqiime/macqiime-installation

I copied the stuff below from a QIIME/iPython notebook tutorial: http://nbviewer.ipython.org/github/qiime/qiime/blob/1.8.0/examples/ipynb/illumina_overview_tutorial.ipynb
I'm not sure what all of it does...

In [1]:

from os import chdir, mkdir
from os.path import join
#the following are only available in the current development branch of IPython
from IPython.display import FileLinks, FileLink

Substitute the file paths below, and then there is no need to change any of the code in the rest of the notebook.
The version of macqiime that I'm using does not install the greengenes 99% cutoff OTUs and taxonomy, so did that manually as per the instructions on the MacQIIME Installation site. I just substituted the gg_13_8_otus folder that has all of the otu cutoffs for the one included in macqiime/greengenes/

In [2]:

project_name = "MicrobesWeEat"
sequence_file = "./MicrobesWeEat.fasta"
non_chimeric_sequence_file = "./non_chimeric_sequences.fasta"
mapping_file = "./MicrobesWeEat.txt"
otu_base = "/macqiime/greengenes/gg_13_8_otus/"
reference_seqs_99 = join(otu_base,"/macqiime/greengenes/gg_13_8_otus/rep_set/99_otus.fasta")
reference_tree_99 = join(otu_base,"/macqiime/greengenes/gg_13_8_otus/trees/99_otus.tree")
reference_tax_99 = join(otu_base,"/macqiime/greengenes/gg_13_8_otus/taxonomy/99_otu_taxonomy.txt")
reference_seqs_97 = join(otu_base,"/macqiime/greengenes/gg_13_8_otus/rep_set/97_otus.fasta")
reference_tree_97 = join(otu_base,"/macqiime/greengenes/gg_13_8_otus/trees/97_otus.tree")
reference_tax_97 = join(otu_base,"/macqiime/greengenes/gg_13_8_otus/taxonomy/97_otu_taxonomy.txt")

Make sure mapping file is good to go. This also provides a quick check for macqiime.¶

In [3]:

!validate_mapping_file.py -m $mapping_file

No errors or warnings were found in mapping file.

In [43]:

#!ls RawSequenceData/
#Listed are the raw sequence files.

Demultiplex and join.¶

Gotta unzip them to work with them.¶

In [44]:

#!gunzip RawSequenceData/*.gz
#!ls RawSequenceData/

After consulting with a QIIME developer via the forum, I know that there is an issue with the trailing *:N:0 in all of the fastq files. A fix before proceeding is below, but this should not be necessary in future releases.¶

In [28]:

!sed 's/ 1:N:0://g' RawSequenceData/HMSB_AZ_35_NoIndex_L001_R1_001.fastq > HMSB_AZ_35_NoIndex_L001_R1_001_fixed.fastq
!sed 's/ 4:N:0://g' RawSequenceData/HMSB_AZ_35_NoIndex_L001_R4_001.fastq > HMSB_AZ_35_NoIndex_L001_R4_001_fixed.fastq
!sed 's/ 1:N:0://g' RawSequenceData/HMSB_AZ_35_NoIndex_L001_R1_002.fastq > HMSB_AZ_35_NoIndex_L001_R1_002_fixed.fastq
!sed 's/ 4:N:0://g' RawSequenceData/HMSB_AZ_35_NoIndex_L001_R4_002.fastq > HMSB_AZ_35_NoIndex_L001_R4_002_fixed.fastq
!sed 's/ 1:N:0://g' RawSequenceData/HMSB_AZ_35_NoIndex_L001_R1_003.fastq > HMSB_AZ_35_NoIndex_L001_R1_003_fixed.fastq
!sed 's/ 4:N:0://g' RawSequenceData/HMSB_AZ_35_NoIndex_L001_R4_003.fastq > HMSB_AZ_35_NoIndex_L001_R4_003_fixed.fastq
!sed 's/ 1:N:0://g' RawSequenceData/HMSB_AZ_35_NoIndex_L001_R1_004.fastq > HMSB_AZ_35_NoIndex_L001_R1_004_fixed.fastq
!sed 's/ 4:N:0://g' RawSequenceData/HMSB_AZ_35_NoIndex_L001_R4_004.fastq > HMSB_AZ_35_NoIndex_L001_R4_004_fixed.fastq

Create a new barcode file in which the forward and reverse barcodes are concatenated into a single barcode.¶

In [25]:

!extract_barcodes.py -r RawSequenceData/HMSB_AZ_35_NoIndex_L001_R2_001.fastq -f RawSequenceData/HMSB_AZ_35_NoIndex_L001_R3_001.fastq -c barcode_paired_end --bc1_len 8 --bc2_len 8 -o Lane1.barcodes
!extract_barcodes.py -r RawSequenceData/HMSB_AZ_35_NoIndex_L001_R2_002.fastq -f RawSequenceData/HMSB_AZ_35_NoIndex_L001_R3_002.fastq -c barcode_paired_end --bc1_len 8 --bc2_len 8 -o Lane2.barcodes
!extract_barcodes.py -r RawSequenceData/HMSB_AZ_35_NoIndex_L001_R2_003.fastq -f RawSequenceData/HMSB_AZ_35_NoIndex_L001_R3_003.fastq -c barcode_paired_end --bc1_len 8 --bc2_len 8 -o Lane3.barcodes
!extract_barcodes.py -r RawSequenceData/HMSB_AZ_35_NoIndex_L001_R2_004.fastq -f RawSequenceData/HMSB_AZ_35_NoIndex_L001_R3_004.fastq -c barcode_paired_end --bc1_len 8 --bc2_len 8 -o Lane4.barcodes

Again remove the trailing *:N:0 in all of the barcode fastq files.¶

In [26]:

!sed 's/ 3:N:0://g' Lane1.barcodes/barcodes.fastq > Lane1.barcodes.fastq
!sed 's/ 3:N:0://g' Lane2.barcodes/barcodes.fastq > Lane2.barcodes.fastq
!sed 's/ 3:N:0://g' Lane3.barcodes/barcodes.fastq > Lane3.barcodes.fastq
!sed 's/ 3:N:0://g' Lane4.barcodes/barcodes.fastq > Lane4.barcodes.fastq

Now, "join" the paired ends. This will create sequences that have the forward barcode at the beginning and the reverse barcode at the end.¶

In [33]:

!join_paired_ends.py -r HMSB_AZ_35_NoIndex_L001_R1_001_fixed.fastq -f HMSB_AZ_35_NoIndex_L001_R4_001_fixed.fastq -b Lane1.barcodes.fastq -o Lane1_joined
!join_paired_ends.py -r HMSB_AZ_35_NoIndex_L001_R1_002_fixed.fastq -f HMSB_AZ_35_NoIndex_L001_R4_002_fixed.fastq -b Lane2.barcodes.fastq -o Lane2_joined
!join_paired_ends.py -r HMSB_AZ_35_NoIndex_L001_R1_003_fixed.fastq -f HMSB_AZ_35_NoIndex_L001_R4_003_fixed.fastq -b Lane3.barcodes.fastq -o Lane3_joined
!join_paired_ends.py -r HMSB_AZ_35_NoIndex_L001_R1_004_fixed.fastq -f HMSB_AZ_35_NoIndex_L001_R4_004_fixed.fastq -b Lane4.barcodes.fastq -o Lane4_joined

Now, time to demultiplex!¶

In [111]:

!split_libraries_fastq.py -q 5 -r 9 -p 0.5 -i Lane1_joined/fastqjoin.join.fastq -o Demultiplexed_Lane1 -m MicrobesWeEat.txt -b Lane1_joined/fastqjoin.join_barcodes.fastq --retain_unassigned_reads --barcode_type 16 --max_barcode_errors 2
!split_libraries_fastq.py -q 5 -r 9 -p 0.5 -i Lane2_joined/fastqjoin.join.fastq -o Demultiplexed_Lane2 -m MicrobesWeEat.txt -b Lane2_joined/fastqjoin.join_barcodes.fastq --retain_unassigned_reads --barcode_type 16 --max_barcode_errors 2
!split_libraries_fastq.py -q 5 -r 9 -p 0.5 -i Lane3_joined/fastqjoin.join.fastq -o Demultiplexed_Lane3 -m MicrobesWeEat.txt -b Lane3_joined/fastqjoin.join_barcodes.fastq --retain_unassigned_reads --barcode_type 16 --max_barcode_errors 2
!split_libraries_fastq.py -q 5 -r 9 -p 0.5 -i Lane4_joined/fastqjoin.join.fastq -o Demultiplexed_Lane4 -m MicrobesWeEat.txt -b Lane4_joined/fastqjoin.join_barcodes.fastq --retain_unassigned_reads --barcode_type 16 --max_barcode_errors 2

I want to know how many reads per library I have.¶

In [42]:

#!cat Demultiplexed_Lane1/split_library_log.txt
#!cat Demultiplexed_Lane2/split_library_log.txt
#!cat Demultiplexed_Lane3/split_library_log.txt
#!cat Demultiplexed_Lane4/split_library_log.txt

At this point, I want to concatenate all of the lanes, but there will be multiple sequences with the same ID (e.g., Unassigned_0) and I'm assuming that will cause problems. So, I have 2 choices: concatenate before running split_libraries_fastq.py OR figure out if QIIME will take more than one input fasta file. I searched for a bit to find out if the second option will work, but I ran out of patience with the search before I found an answer. So, I'm going concatenate the output from join_paired_ends.py and run split_libraries_fastq.py again.¶

In [118]:

!cat Lane*_joined/fastqjoin.join.fastq > All_Lanes_joined.fastq
!cat Lane*_joined/fastqjoin.join_barcodes.fastq > All_barcodes_joined.fastq

The parameters that I'm using for split_libraries.py were optimized through trial and error. Either pretty much all sequences passed the quality filter or none of them did, so tuning the parameters was pretty straightforward.¶

In [119]:

!split_libraries_fastq.py -q 5 -r 9 -p 0.5 -i All_Lanes_joined.fastq -o Demultiplexed_All -m MicrobesWeEat.txt -b All_barcodes_joined.fastq --retain_unassigned_reads --barcode_type 16 --max_barcode_errors 2

In [79]:

!cat Demultiplexed_All/split_library_log.txt

cat: Demultiplexed_All/split_library_log.txt: No such file or directory

In [45]:

#!ls Demultiplexed_All/

Pick closed reference OTUs for PICRUSt¶

While there are good reasons NOT to use closed-reference otu-picking in general, because one of the things I'd like to do is to use PICRUSt to predict a metagenome, I will start by using the 99% cutoff greengenes reference files for closed-reference picking¶

In [6]:

!pick_closed_reference_otus.py -p 99parameters.txt -o greengenes_99_otus -i Demultiplexed_All/seqs.fna -r $reference_seqs_99 -t $reference_tax_99 -a -O 2 -f

In [9]:

!pick_closed_reference_otus.py -p 97parameters.txt -o greengenes_97_otus -i Demultiplexed_All/seqs.fna -r $reference_seqs_97 -t $reference_tax_97 -a -O 2 -f

In [25]:

#!cat greengenes_99_otus/uclust_ref_picked_otus/seqs_otus.log

In [64]:

#!biom summarize-table -i greengenes_99_otus/otu_table.biom -o 99_otu_table_summary.txt
#!cat 99_otu_table_summary.txt

In [1]:

#!biom summarize-table -i greengenes_97_otus/otu_table.biom -o 97_otu_table_summary.txt
#!cat 97_otu_table_summary.txt

cat: 97_otu_table_summary.txt: No such file or directory

In [6]:

#!cat greengenes_97_otus/uclust_ref_picked_otus/seqs_otus.log

Now for chimera-checking (I suppose this could be done earlier, actually.) Next time, I would do it just fter split_libraries_fastq.py). The input must be unaligned fasta sequences in the same orientation as the reference database. That's why I'm using the output from split_libraries_fastq.py, as recommended.¶

In [21]:

#!identify_chimeric_seqs.py -i Demultiplexed_All/seqs.fna -m usearch61 -o 97_usearch61_chimera_detection/ -r $reference_seqs_97
#!identify_chimeric_seqs.py -i Demultiplexed_All/seqs.fna -m usearch61 -o 99_usearch61_chimera_detection/ -r $reference_seqs_99

Remove the chloroplast and mitochondrial sequences. I'm also removing the "Unassigned" sequences. There are 795 of them in 39 OTUs. I manually blasted the representative sequences from each OTU, and they were all chloroplasts. I am also removing all of the remaining singletons.¶

In [4]:

!filter_taxa_from_otu_table.py -i greengenes_97_otus/otu_table.biom -o closed_ref_97_otu_table_no_euks.biom -n c__Chloroplast,f__mitochondria
!filter_taxa_from_otu_table.py -i closed_ref_97_otu_table_no_euks.biom -o closed_ref_97_otu_table_no_euks_no_unassigned.biom -n Unassigned
!filter_otus_from_otu_table.py -i closed_ref_97_otu_table_no_euks_no_unassigned.biom -o closed_ref_97_otu_table_filtered.biom  -n 2

In [37]:

!filter_taxa_from_otu_table.py -i greengenes_99_otus/otu_table.biom -o closed_ref_99_otu_table_no_euks.biom -n c__Chloroplast,f__mitochondria
!filter_taxa_from_otu_table.py -i closed_ref_99_otu_table_no_euks.biom -o closed_ref_99_otu_table_no_euks_no_unassigned.biom -n Unassigned
!filter_otus_from_otu_table.py -i closed_ref_99_otu_table_no_euks_no_unassigned.biom -o closed_ref_99_otu_table_filtered.biom  -n 2

In [4]:

#!biom summarize-table -i closed_ref_99_otu_table_filtered.biom -o closed_ref_99_otu_table_filtered.summary
#!cat closed_ref_99_otu_table_filtered.summary

This is a reminder to look at the biom summary before proceeding. In my case, I still have all of the PhiX and other unassigned reads in my data. So, I created a file called samples_to_keep.txt with a list of my actual sample IDs in it, and I will use this list to remove the Unassigned reads.¶

In [5]:

!filter_samples_from_otu_table.py -i closed_ref_97_otu_table_filtered.biom -o closed_ref_97_otu_table_final.biom --sample_id_fp samples_to_keep.txt

In [68]:

!filter_samples_from_otu_table.py -i closed_ref_99_otu_table_filtered.biom -o closed_ref_99_otu_table_final.biom --sample_id_fp samples_to_keep.txt

Run PICRUSt¶

Normalize otu table by 16S copy number¶

In [6]:

!/Applications/picrust-1.0.0/scripts/normalize_by_copy_number.py -i closed_ref_97_otu_table_final.biom -o normalized_closed_ref_97_otu_table.biom

In [69]:

!/Applications/picrust-1.0.0/scripts/normalize_by_copy_number.py -i closed_ref_99_otu_table_final.biom -o normalized_closed_ref_99_otu_table.biom

Predict the metagenome, note that I am using the -a flag to calculate NSTI.¶

In [7]:

!/Applications/picrust-1.0.0/scripts/predict_metagenomes.py -a normalized_closed_ref_97_NSTI.tab -i normalized_closed_ref_97_otu_table.biom -o metagenome_prediction_from_normalized_closed_ref_97_otu_table.biom

In [70]:

!/Applications/picrust-1.0.0/scripts/predict_metagenomes.py -a normalized_closed_ref_99_NSTI.tab -i normalized_closed_ref_99_otu_table.biom -o metagenome_prediction_from_normalized_closed_ref_99_otu_table.biom

In [ ]:

#!more normalized_closed_ref_97_NSTI.tab

In [72]:

#!more normalized_closed_ref_99_NSTI.tab

Collapse the thousands of predicted functions into higher categories (e.g KOs into KEGG Pathways) and generate STAMP profiles¶

In [8]:

!/Applications/picrust-1.0.0/scripts/categorize_by_function.py -f -i metagenome_prediction_from_normalized_closed_ref_97_otu_table.biom -c KEGG_Pathways -l 3 -o 97_MWE_predicted_metagenomes.L3.txt
!/Applications/picrust-1.0.0/scripts/categorize_by_function.py -f -i metagenome_prediction_from_normalized_closed_ref_97_otu_table.biom -c KEGG_Pathways -l 2 -o 97_MWE_predicted_metagenomes.L2.txt

In [73]:

!/Applications/picrust-1.0.0/scripts/categorize_by_function.py -f -i metagenome_prediction_from_normalized_closed_ref_99_otu_table.biom -c KEGG_Pathways -l 3 -o 99_MWE_predicted_metagenomes.L3.txt
!/Applications/picrust-1.0.0/scripts/categorize_by_function.py -f -i metagenome_prediction_from_normalized_closed_ref_99_otu_table.biom -c KEGG_Pathways -l 2 -o 99_MWE_predicted_metagenomes.L2.txt

In [9]:

!sed '1d' 97_MWE_predicted_metagenomes.L3.txt | rev | cut -f 2- | rev > 97_MWE_predicted_metagenome.L3.spf
!sed '1d' 97_MWE_predicted_metagenomes.L2.txt | rev | cut -f 2- | rev > 97_MWE_predicted_metagenome.L2.spf

In [74]:

!sed '1d' 99_MWE_predicted_metagenomes.L3.txt | rev | cut -f 2- | rev > 99_MWE_predicted_metagenome.L3.spf
!sed '1d' 99_MWE_predicted_metagenomes.L2.txt | rev | cut -f 2- | rev > 99_MWE_predicted_metagenome.L2.spf

The most significant difference between Diet Types is "Other Glycan Degradation" (K00511) at L3 (see figure).
There are no significant differences at L2.

Identify which OTUs are contributing to K00511¶

In [13]:

!/Applications/picrust-1.0.0/scripts/metagenome_contributions.py -l K00511 -i normalized_closed_ref_97_otu_table.biom -o 97_metagenome_contribution_K00511 -g 13_5

Filtering the genome table to include only user-specified functions: ['K00511']

In [14]:

!/Applications/picrust-1.0.0/scripts/metagenome_contributions.py -l K00511 -i normalized_closed_ref_99_otu_table.biom -o 99_metagenome_contribution_K00511 -g 13_5

Filtering the genome table to include only user-specified functions: ['K00511']

In [15]:

!cat 97_metagenome_contribution_K00511

Gene	Sample	OTU	GeneCountPerGenome	OTUAbundanceInSample	CountContributedByOTU	ContributionPercentOfSample	ContributionPercentOfAllSamples
K00511	USDAdinner	4366579	1.0	1.0	1.0	1.0	1.0

In [16]:

!cat 99_metagenome_contribution_K00511

Gene	Sample	OTU	GeneCountPerGenome	OTUAbundanceInSample	CountContributedByOTU	ContributionPercentOfSample	ContributionPercentOfAllSamples
K00511	USDAdinner	106978	1.0	1.0	1.0	1.0	1.0

So, OTU 106978, which is Corallococcus exiguus, is abundant in the USDA dinner, and is the contributor to the "Other Glycan Degradation" in this metagenome¶

Pick OTUs¶

The QIIME developers prefer the open reference OTU-picking approach, and I have no reason not to go with the default 97% cutoff, so that's what I'll do here. The pick_open_reference_otus.py script will cluster all of the sequences, assign taxonomy to the OTUs (when possible, a greengenes ID will be assigned,) choose a representative sequence from each OTU (rep_set), and align and build a phylogenetic tree from the representative sequences.¶

This time I'll remember to get rid of the Unassigned reads before I pick the OTUs!¶

In [81]:

!filter_fasta.py -f MicrobesWeEat.fasta -o MicrobesWeEat_NoUnassigned.fasta --sample_id_fp samples_to_keep.txt

In [96]:

!pick_open_reference_otus.py -p 97parameters.txt -r $reference_seqs_97 -i MicrobesWeEat_NoUnassigned.fasta -o 97_open_reference_otus -f

Sanity check for the otu table.¶

In [8]:

#!biom summarize-table -i 97_open_reference_otus/otu_table_mc2_w_tax_no_pynast_failures.biom -o otu_table_summary_before_cleanup.txt
#!cat otu_table_summary_before_cleanup.txt

Filter out singletons, mitochondria, and chloroplasts.¶

In [88]:

!filter_taxa_from_otu_table.py -i 97_open_reference_otus/otu_table_mc2_w_tax_no_pynast_failures.biom -o open_ref_97_otu_table_no_euks.biom -n c__Chloroplast,f__mitochondria
!filter_taxa_from_otu_table.py -i open_ref_97_otu_table_no_euks.biom -o open_ref_97_otu_table_no_euks_no_unassigned.biom -n Unassigned
!filter_otus_from_otu_table.py -i open_ref_97_otu_table_no_euks_no_unassigned.biom -o open_ref_97_otu_table_no_euks_no_unassigned_no_singletons.biom  -n 2

In [173]:

#sanity check
#!biom summarize-table -i open_ref_97_otu_table_no_euks_no_unassigned_no_singletons.biom -o open_ref_97_otu_table_no_euks_no_unassigned_no_singletons.summary
#!cat open_ref_97_otu_table_no_euks_no_unassigned_no_singletons.summary

In [104]:

#!identify_chimeric_seqs.py -m ChimeraSlayer -i rep_set_aligned.fasta -a core_set_aligned.fasta.imputed -o chimeric_seqs_open_97.txt

I couldn't get ChimeraSlayer to run. QIIME support thinks that the problem might be that I do not have enough memory because he was able to run it on my file. He was nice enough to send me the output, so I can proceed.¶

Now that I have my list of chimeric OTUs, I need to remove them from the OTU table, remove them from my aligned representative sequences, build a new phylogenetic tree from the new rep set alignment (I think, but all of this may not be necessary.)¶

In [170]:

#remove chimeras from aligned rep_set 
!filter_fasta.py -f 97_open_reference_otus/pynast_aligned_seqs/rep_set_aligned.fasta -o non_chimeric_rep_set_aligned.fasta -s chimeric_seqs_open_97.txt -n

In [108]:

!make_phylogeny.py -i non_chimeric_rep_set_aligned.fasta

In [172]:

!filter_otus_from_otu_table.py -i open_ref_97_otu_table_no_euks_no_unassigned_no_singletons.biom -o open_ref_97_otu_table_no_euks_no_unassigned_no_singletons_no_chimeras.biom -e chimeric_seqs_open_97.txt

In [175]:

#sanity check
#!biom summarize-table -i open_ref_97_otu_table_no_euks_no_unassigned_no_singletons_no_chimeras.biom -o open_ref_97_otu_table_no_euks_no_unassigned_no_singletons_no_chimeras.summary
#!cat open_ref_97_otu_table_no_euks_no_unassigned_no_singletons_no_chimeras.summary

The following steps won't work without the metadata added to the biom file, so I'll do that first. I had to dig the observation metadata mapping file out of here: uclust_assigned_taxonomy/rep_set_tax_assignments.txt AND I had to add a header row to make it look like the example files in http://biom-format.org/documentation/adding_metadata.html ¶

In [26]:

!rm 97_open_otu_table_wmd.biom
!biom add-metadata -i open_ref_97_otu_table_no_euks_no_unassigned_no_singletons_no_chimeras.biom -o 97_open_otu_table_wmd.biom --sample-metadata-fp MicrobesWeEat.txt --observation-metadata-fp rep_set_tax_assignments_header.txt 

Core Diversity Analyses¶

Running core_diversity_analyses.py with libraries subsampled to 893 sequences and categorized by DietType. Phylogeny-based analyses will use the new tree with the chimeric sequences removed.¶

In [35]:

#sanity check
#!biom summarize-table -i 97_open_otu_table_wmd.biom -o 97_open_otu_table_wmd.summary
#!cat 97_open_otu_table_wmd.summary

In [37]:

#unique counts instead of totals
#!biom summarize-table -i 97_open_otu_table_wmd.biom --qualitative -o 97_open_otu_table_wmd.qualitative.summary
#!cat 97_open_otu_table_wmd.qualitative.summary

In [38]:

#unique counts instead of totals, but with rarefied OTU table
!biom summarize-table -i core_diversity_analyses_open_ref_97/table_even771.biom --qualitative -o 97_open_otu_table_771.qualitative.summary
!cat 97_open_otu_table_771.qualitative.summary

Num samples: 15
Num observations: 729
Table md5 (unzipped): 0f5f8fd0fc772b57d5af6c39cacea84b

Observations/sample summary:
 Min: 13
 Max: 203
 Median: 116.000
 Mean: 111.067
 Std. dev.: 60.793
 Sample Metadata Categories: Iron_bin; Carotene_beta; phinchID; Fermented; Sodium; Carotene_beta_bin; Protein_bin; Dairy; Forward; Sodium_bin; Sugars_bin; Fiber; Description; BarcodeSequence; PC1_bin; Fruit; Energy_bin; Vitamin_A_bin; Vitamin_A; Fatty_acids; Carbohydrate_bin; Fiber_bin; Cholesterol_bin; DietType; Reverse; Calcium_bin; Cooked; LinkerPrimerSequence; Iron; Cholesterol; HFCS; Meal; Total_lipid_bin; Total_lipid; Calcium; Vitamin_C; Energy; Sugars; Carbohydrate; PC1; Protein; Fatty_Acids_bin; Vitamin_C_bin
 Observation Metadata Categories: taxonomy; confidence

Observations/sample detail:
 USDAsnack1: 13
 USDAsnack2: 26
 AMERICANsnack: 41
 AMERICANlunch: 42
 VEGANsnack1: 58
 VEGANlunch: 94
 VEGANbreakfast: 106
 USDAlunch: 116
 AMERICANbreakfast: 123
 USDAdinner: 154
 VEGANdinner: 167
 USDAbreakfast: 168
 AMERICANdinner: 170
 VEGANsnack3: 185
 VEGANsnack2: 203

In [181]:

#!rm -r DietType_core_diversity_analyses_open_ref_97
#!rm -r core_diversity_analyses_open_ref_97

In [18]:

!core_diversity_analyses.py -p 97parameters.txt -i 97_open_otu_table_wmd.biom -o DietType_core_diversity_analyses_open_ref_97 -m MicrobesWeEat.txt -e 771 -c DietType -t non_chimeric_rep_set_aligned.tre

In [17]:

!core_diversity_analyses.py -p 97parameters.txt -i 97_open_otu_table_wmd.biom -o core_diversity_analyses_open_ref_97 -m MicrobesWeEat.txt -e 771 -t non_chimeric_rep_set_aligned.tre

In [ ]:

The first time I ran this, it did not work. The issue is that my taxonomy metadata in the biom file is included as strings rather than as lists. I'm not sure why this is the case, but I tracked down the fix, which is to add this line to my parameters file:
summarize_taxa:md_as_string True

In [19]:

!nmds.py -i core_diversity_analyses_open_ref_97/bdiv_even771/weighted_unifrac_dm.txt -o NMDS_output

In [20]:

!cat NMDS_output

samples	NMDS1	NMDS2	NMDS3
AMERICANsnack	-1.17027261939	0.160180891398	0.117604120355
VEGANsnack1	0.728025005467	1.73753280116	-0.0163669886395
AMERICANbreakfast	0.933966186973	-0.193126733697	-0.971073635433
VEGANsnack2	0.626862241432	-0.247210344627	0.486249440075
AMERICANlunch	-1.10856720972	0.055161785597	-0.261550124104
VEGANsnack3	0.694201915455	-0.0584857699766	0.392490895575
VEGANdinner	0.791638204242	-0.318713695544	0.16089612955
VEGANlunch	-0.549011849737	-0.571419325572	0.499809089166
USDAdinner	0.692697963831	-0.347314151775	-0.222069120523
AMERICANdinner	0.737339108169	-0.211959856119	0.021761404915
USDAlunch	-0.337601140052	0.372012620336	0.557853344909
VEGANbreakfast	-0.0863078410135	-0.177500463084	-0.459990243303
USDAbreakfast	0.549193082536	-0.253528600159	0.133269140252
USDAsnack2	-1.32177734003	-0.0617229557549	-0.222565931975
USDAsnack1	-1.18038570816	0.116093797813	-0.216317520821

stress	0.0260933547181	0	0
% variation explained	0	0	0

Statistical Analyses¶

want to look at colelctor's curves without rarefaction¶

In [21]:

!alpha_rarefaction.py -i 97_open_otu_table_wmd.biom -m $mapping_file -o 97_alpha_uneven -t 97_open_reference_otus/rep_set.tre -f

test for the significance of dietary pattern on the overall microbial community composition¶

In [24]:

!rm -r permanova_DietTypes
!compare_categories.py --method permanova -m $mapping_file -c DietType -i core_diversity_analyses_open_ref_97/bdiv_even771/weighted_unifrac_dm.txt -o permanova_DietTypes

test for significant differences in taxonomic richness across dietary patterns¶

In [25]:

!compare_alpha_diversity.py -i core_diversity_analyses_open_ref_97/arare_max771/alpha_div_collated/PD_whole_tree.txt -m $mapping_file -c DietType -o compare_alpha_div_DietType_PD

test for the significant variation in frequency of individual OTUs across dietary patterns¶

In [28]:

!biom add-metadata -i core_diversity_analyses_open_ref_97/taxa_plots/table_mc771_sorted_L3.biom -o core_diversity_analyses_open_ref_97/taxa_plots/table_mc771_sorted_L2_wmd.biom --sample-metadata-fp MicrobesWeEat.txt #--observation-metadata-fp rep_set_tax_assignments_header.txt 
!biom add-metadata -i core_diversity_analyses_open_ref_97/taxa_plots/table_mc771_sorted_L3.biom -o core_diversity_analyses_open_ref_97/taxa_plots/table_mc771_sorted_L3_wmd.biom --sample-metadata-fp MicrobesWeEat.txt #--observation-metadata-fp rep_set_tax_assignments_header.txt
!biom add-metadata -i core_diversity_analyses_open_ref_97/taxa_plots/table_mc771_sorted_L4.biom -o core_diversity_analyses_open_ref_97/taxa_plots/table_mc771_sorted_L4_wmd.biom --sample-metadata-fp MicrobesWeEat.txt #--observation-metadata-fp rep_set_tax_assignments_header.txt 
!biom add-metadata -i core_diversity_analyses_open_ref_97/taxa_plots/table_mc771_sorted_L5.biom -o core_diversity_analyses_open_ref_97/taxa_plots/table_mc771_sorted_L5_wmd.biom --sample-metadata-fp MicrobesWeEat.txt #--observation-metadata-fp rep_set_tax_assignments_header.txt 

In [29]:

!group_significance.py -i core_diversity_analyses_open_ref_97/taxa_plots/table_mc771_sorted_L2_wmd.biom -m $mapping_file -c DietType -o group_significance_L2
!group_significance.py -i core_diversity_analyses_open_ref_97/taxa_plots/table_mc771_sorted_L3_wmd.biom -m $mapping_file -c DietType -o group_significance_L3
!group_significance.py -i core_diversity_analyses_open_ref_97/taxa_plots/table_mc771_sorted_L4_wmd.biom -m $mapping_file -c DietType -o group_significance_L4
!group_significance.py -i core_diversity_analyses_open_ref_97/taxa_plots/table_mc771_sorted_L5_wmd.biom -m $mapping_file -c DietType -o group_significance_L5

No metadata in biom table.
No metadata in biom table.
No metadata in biom table.
No metadata in biom table.

In [30]:

!make_otu_heatmap.py -i 97_open_otu_table_wmd.biom

In [39]:

!make_otu_heatmap.py -i core_diversity_analyses_open_ref_97/table_even771.biom

In [34]:

!per_library_stats.py -i 97_open_otu_table_wmd.biom-m $mapping_file -o 97_open_otu_table_wmd.perlibstats

/bin/sh: per_library_stats.py: command not found

Export for Use with Phinch¶

In [4]:

!biom add-metadata -i otu_tables/open_ref_97_otu_table_no_euks_no_singletons.biom -o otu_tables/open_ref_97_otu_table_no_euks_no_singletons_with_metadata.biom -m MicrobesWeEat.txt

I copied the stuff below from a QIIME/iPython notebook tutorial: http://nbviewer.ipython.org/github/qiime/qiime/blob/1.8.0/examples/ipynb/illumina_overview_tutorial.ipynbI'm not sure what all of it does...

Make sure mapping file is good to go. This also provides a quick check for macqiime.¶

Demultiplex and join.¶

Gotta unzip them to work with them.¶

After consulting with a QIIME developer via the forum, I know that there is an issue with the trailing *:N:0 in all of the fastq files. A fix before proceeding is below, but this should not be necessary in future releases.¶

Create a new barcode file in which the forward and reverse barcodes are concatenated into a single barcode.¶

Again remove the trailing *:N:0 in all of the barcode fastq files.¶

Now, "join" the paired ends. This will create sequences that have the forward barcode at the beginning and the reverse barcode at the end.¶

Now, time to demultiplex!¶

I want to know how many reads per library I have.¶

The parameters that I'm using for split_libraries.py were optimized through trial and error. Either pretty much all sequences passed the quality filter or none of them did, so tuning the parameters was pretty straightforward.¶

Pick closed reference OTUs for PICRUSt¶

While there are good reasons NOT to use closed-reference otu-picking in general, because one of the things I'd like to do is to use PICRUSt to predict a metagenome, I will start by using the 99% cutoff greengenes reference files for closed-reference picking¶

Remove the chloroplast and mitochondrial sequences. I'm also removing the "Unassigned" sequences. There are 795 of them in 39 OTUs. I manually blasted the representative sequences from each OTU, and they were all chloroplasts. I am also removing all of the remaining singletons.¶

This is a reminder to look at the biom summary before proceeding. In my case, I still have all of the PhiX and other unassigned reads in my data. So, I created a file called samples_to_keep.txt with a list of my actual sample IDs in it, and I will use this list to remove the Unassigned reads.¶

Run PICRUSt¶

Normalize otu table by 16S copy number¶

Predict the metagenome, note that I am using the -a flag to calculate NSTI.¶

Collapse the thousands of predicted functions into higher categories (e.g KOs into KEGG Pathways) and generate STAMP profiles¶

The most significant difference between Diet Types is "Other Glycan Degradation" (K00511) at L3 (see figure). There are no significant differences at L2.

Identify which OTUs are contributing to K00511¶

So, OTU 106978, which is Corallococcus exiguus, is abundant in the USDA dinner, and is the contributor to the "Other Glycan Degradation" in this metagenome¶

Pick OTUs¶

This time I'll remember to get rid of the Unassigned reads before I pick the OTUs!¶

Sanity check for the otu table.¶

Filter out singletons, mitochondria, and chloroplasts.¶

I couldn't get ChimeraSlayer to run. QIIME support thinks that the problem might be that I do not have enough memory because he was able to run it on my file. He was nice enough to send me the output, so I can proceed.¶

Now that I have my list of chimeric OTUs, I need to remove them from the OTU table, remove them from my aligned representative sequences, build a new phylogenetic tree from the new rep set alignment (I think, but all of this may not be necessary.)¶

Core Diversity Analyses¶

Running core_diversity_analyses.py with libraries subsampled to 893 sequences and categorized by DietType. Phylogeny-based analyses will use the new tree with the chimeric sequences removed.¶

The first time I ran this, it did not work. The issue is that my taxonomy metadata in the biom file is included as strings rather than as lists. I'm not sure why this is the case, but I tracked down the fix, which is to add this line to my parameters file:summarize_taxa:md_as_string True

Statistical Analyses¶

want to look at colelctor's curves without rarefaction¶

test for the significance of dietary pattern on the overall microbial community composition¶

test for significant differences in taxonomic richness across dietary patterns¶

test for the significant variation in frequency of individual OTUs across dietary patterns¶

Export for Use with Phinch¶

I copied the stuff below from a QIIME/iPython notebook tutorial: http://nbviewer.ipython.org/github/qiime/qiime/blob/1.8.0/examples/ipynb/illumina_overview_tutorial.ipynb
I'm not sure what all of it does...

The most significant difference between Diet Types is "Other Glycan Degradation" (K00511) at L3 (see figure).
There are no significant differences at L2.

The first time I ran this, it did not work. The issue is that my taxonomy metadata in the biom file is included as strings rather than as lists. I'm not sure why this is the case, but I tracked down the fix, which is to add this line to my parameters file:
summarize_taxa:md_as_string True