Notebook

In this assignment, you'll be using QIIME to analyze two data sets. First, you'll work through a tiny test data set, which contains about 100 sequences from a few different human body sites collected over a few days. This will familiarize you with running QIIME through the IPython Notebook, and working with QIIME output. You'll then work through an analysis of a real-world data set, where human-associated microbial communities were shown to have forensic potential, potentially allowing investigators to determine who touched an object based on the "microbial fingerprint" they leave behind.

Before starting this assignment, you'll work through some reading that will provide you with the necessary background to complete this assignment.

You should begin by reading Fierer et al (2010), where this forensic study was initially published. You should then review the QIIME Illumina Overview Tutorial, which will cover some of the commands that you'll be using here. Finally, as you work through the assignment, any time you use a QIIME command you should look up that command in QIIME script index to understand what it does and how to use it.

IMPORTANT: This assignment includes some steps that will take some time to run (possibly up to 20 minutes). You should avoid starting this assignment close to the deadline as the server will be backed up at that time if a lot of people are running their analyses at the same time. You are responsible for starting this in a timely manner! If your assignment is late because it did not complete in time, that is your fault, not ours!

IMPORTANT: If you create a file that later needs to be removed (e.g., because it was created incorrectly, and you get an error when trying to re-run the command because the file already exists), you can remove it using !rm or !rmdir (where the ! tells IPython that you're issuing a shell/bash command, rather than a python command). For example, if I needed to remove a directory call misc_junk, I could run:

!rmdir misc_junk

If I needed to remove a file called junk.txt, I could run:

!rm junk.txt

Getting started: notebook configuration¶

IMPORTANT: If you start your assignment, and then leave the notebook server, you will need to re-run the cells in this section to configure your IPython Notebook session.

In [ ]:

from os import chdir, mkdir, rename
from os.path import exists, join, expandvars
from qiime.test import write_test_data
from IPython.display import FileLinks, FileLink

raw_data_base = "/home/jgc53/public_html/qiime_assignment_materials"
reference_seqs = join(raw_data_base, "88_otus.fasta")
reference_tree = join(raw_data_base, "88_otus.tree")
reference_tax = join(raw_data_base, "88_otu_taxonomy.txt")

forensic_seqs = join(raw_data_base, "forensic-seqs.fna")
forensic_map = join(raw_data_base, "forensic-map.txt")

personal_dir = './'

tiny_test_dir = "tiny-test"
tiny_test_path = join(personal_dir, tiny_test_dir)
forensic_dir = "forensic"
forensic_path = join(personal_dir, forensic_dir)

if not exists(tiny_test_path):
    mkdir(tiny_test_path)
    
if not exists(forensic_path):
    mkdir(forensic_path)
    
write_test_data(tiny_test_path)
rename(join(tiny_test_path, "map"), join(tiny_test_path, "map.txt"))
rename(join(tiny_test_path, "seqs"), join(tiny_test_path, "seqs.fna"))

Part 1: Running example commands with the QIIME "Tiny Test" data set¶

You can always see what files are present, and view or download them, using FileLinks (or FileLink, if providing a single file path). Execute this cell and open the map.txt file to view the sample metadata associated with this study.

In [ ]:

FileLinks(tiny_test_dir)

The first step in a QIIME analysis is to prepare the mapping file and validate it. In this case, the mapping file has been prepared for you, so you just need to validate it. We do that with validate_mapping_file.py. You can run that as follows:

In [ ]:

input_map_fp = join(tiny_test_dir, 'map.txt')
output_dir = join(tiny_test_dir, 'vmf_out')

In [ ]:

!validate_mapping_file.py -m $input_map_fp -o $output_dir

Note that there were errors and/or warnings generated during mapping file validation. Open and view the map.html file in the vmf output directory. We see that the only issues were with barcodes, and we are not using the barcodes in this example, so we can continue on.

In [ ]:

FileLinks(output_dir)

Next we're going to run open reference OTU picking at 88% identity using pick_open_reference_otus.py. You should read OTU picking strategies in QIIME to understand what this process does, and why we prefer this to other strategies. This will take a few minutes to run.

In [ ]:

input_seqs_fp = join(tiny_test_dir, 'seqs.fna')
output_dir = join(tiny_test_dir, 'or_otus')

In [ ]:

!pick_open_reference_otus.py -i $input_seqs_fp -o $output_dir -r $reference_seqs -s 0.88

You can now review the output generated by this command. Much of these are intermediary data files, so not useful to view directly, though do open the log file (listed at the top as log...txt) to get an idea of what information is stored in the log file (e.g., how can you compute the run time from the information in the log file?

In [ ]:

or_index = join(output_dir, "index.html")
FileLink(or_index)

We'll create variables pointing to several of the files that we're interested in, and generate a summary of the output.

In [ ]:

otu_table_fp = join(tiny_test_dir, 'or_otus', 'otu_table_mc2_w_tax_no_pynast_failures.biom')
output_fp = join(tiny_test_dir, 'or_otus', 'otu_table_mc2_w_tax_no_pynast_failures_summ.txt')
tree_fp = join(tiny_test_dir, 'or_otus', 'rep_set.tre')

In [ ]:

!biom summarize-table -i $otu_table_fp -o $output_fp

We can look at that summary (via the link below) to see how many sequences were obtained per sample. One thing we need to do here is choose an even sampling depth for our diversity analyses. Any samples with fewer than that number of sequences will be discarded, and any samples with more than that number of sequences will be randomly subsampled (without replacement, i.e., rarefied) to contain that number of sequences. In this case, I'll choose 17, but that number is very study specific. You do not want to use a sampling depth of 17 for the forensic study.

In [ ]:

FileLink(output_fp)

We'll next run core_diversity_analyses.py, which runs several different diversity analysis commands. The primary parameters that will vary for this command are the even sampling depth (mentioned above) and the categories (or mapping file headers). Both of these are study specific. Avoid passing categories that have more than about six different values in the mapping file, as these will take a very, very long time to run.

In [ ]:

even_sampling_depth = 17
output_dir = join(tiny_test_dir, "cd_even%d" % even_sampling_depth)

In [ ]:

!core_diversity_analyses.py -i $otu_table_fp -o $output_dir -t $tree_fp -e $even_sampling_depth -c "SampleType,days_since_epoch" -m $input_map_fp

We can now view the output of our diversity analyses. The file of interest here is index.html, so we'll link directly to that. Review the taxonomic summaries, the alpha diversity results, and the beta diversity results. For the forensic data set, you'll turn in the Rarified BIOM table, which is linked from the index.html.

In [ ]:

cd_index = join(output_dir, "index.html")
FileLink(cd_index)

Part 2: Microbiome-based forensics¶

Next, you'll adapt the commands above to perform a more interesting analysis and answer some questions. The commands will be largely the same as above, but instead you'll use the following variables:

forensic_dir
forensic_seqs
forensic_map

For example, here is the validate_mapping_file.py command:

In [ ]:

input_map_fp = forensic_map
output_dir = join(forensic_dir, 'vmf_out')
!validate_mapping_file.py -m $input_map_fp -o $output_dir

In [ ]:

FileLinks(output_dir)

Next, run pick_open_reference_otus.py (hint: the reference sequences are the same as for the previous data set).

In [ ]:

Next, run biom summarize-table, and choose an even sampling depth (hint: it should probably be over 100).

In [ ]:

Finally, run core_diversity_analyses.py, and be sure to set the even_sampling_depth variable before you execute the command. This may take about 20 minutes. DO NOT pass mapping file headers that contain more than about six values in the column, or this will take much, much longer.

In [ ]:

even_sampling_depth = None

In [ ]:

Part 2: Questions¶

Answer all of the questions below in the notebook, and turn in a copy of your notebook with your rarefied BIOM table (linked from the core_diversity_analyses.py index.html file that you created (right click the link, and Save as...). All of these questions refer to the forensic data set, not the tiny test data set!

Question 1: What was the minimum number of sequences per sample? What was the maximum number of sequences per sample? What even sampling depth did you choose, and why?

Answer Question 1 in this cell.

Question 2: How long did pick_open_reference_otus.py take to run (to the second). Review the log file, and compute the run time from information in that file. How long did core_diversity_analyses.py take to run (again, to the second).

Answer Question 2 in this cell.

Question 3: The focus of the Fierer 2010 paper was to show that it is possible to match an individual to the objects they touch based on the microbial communities that the individual leaves behind. The Unweighted UniFrac emperor plots (linked from the core_diversity_analyses.py index.html file) will allow you to figure out which subject (M2, M3, or M9) touched which keyboard (K1, K2, or K3). Match the individuals to the keyboard they touched, and explain how you came to this answer. There is one right answer to this question.

Answer Question 3 in this cell.

To turn in:

Gzipped Rarified BIOM table, available in the core_diversity_analyses.py output.
The IPython Notebook containing containing the full list of commands that you ran to generate the above data, including answers to the questions in Part 2, noting any problems that you ran into along the way

Part 3: Interpreting and reporting microbiome results¶

This section will focus on working with the results generated by the core_diversity_analyses.py command that you ran above.

For this section, you will write a 2.5 to 3 page paper describing your analysis. Your paper should not be any shorter than 2.5 pages nor any longer than 3 pages. It must have 1.5 line spacing, 1.25" margins, and be written in 12 point Times New Roman font. Figures and tables are included in the page count, though the total space taken by these should be a maximum of one page.

Write this as if you’re submitting to a journal, so your paper should contain:

A title and list of authors (just you, in this case). This should be in no more than 14 point font, and follow the margin/spacing requirements listed above.
An Introduction section describing the goals of the study (you can refer to Fierer et al (2010));
A Methods section containing a brief description of your bioinformatics methods (e.g., what version of QIIME, what type of OTU picking was used) and how the data was generated (e.g., sequencing platform);
And a Results and Conclusions section describing the results of your analysis.

You should address several specific questions in your Results and Conclusions:

Which keyboard belonged to each of the human subjects? Back this up with an Unweighted UniFrac PCoA figure that supports your answer, as well as the beta diversity boxplots that support your answer. Describe what each of these plots tells you that supports your answer. See Beta diversity results in the core_diversity_analyses.py output.
Which bacterial phyla appear to be most useful for matching each individual to their keyboard? See Taxonomic summary results in the core_diversity_analyses.py output.
Which ten OTUs were most significantly different across the subjects (see Group significance results in the core_diversity_analyses.py output). Present a table containing these ten OTUs and their taxonomy.
Which subjects had the most OTUs, on average? Was the number of OTUs across subjects significantly different? See Alpha diversity results in the core_diversity_analyses.py output. Include a plot that illustrates the number of OTUs observed for each subject.