#!/usr/bin/env python
# coding: utf-8

# # Introduction
# This notebook is a companion to the paper published in Nature by Tao Ding & Pat Schloss entitled ["Dynamics and associations of microbial community types across the human body"](http://dx.doi.org/10.1038/nature13178). Dr. Ding was a postdoc in the lab of Dr Schloss within the Department of Microbiology & Immunology at the University of Michigan. If any questions should arise, please feel free to contact Dr. Schloss at pschloss@umich.edu. This notebook was originally developed using mothur ~v.1.30. Since then we have tweaked the running of trim.flows which may have a small effect on the number of samples with 1000 reads. Since then all commands have been re-validated using mothur v.1.33.3. You can find all of the raw data and derivative files generated below (except for the dbGaP data) in the [dingschlossnature bucket](https://dingschlossnature.s3-us-west-2.amazonaws.com/index.html) on the AWS S3 server
# 
# 
# ## The HMP Clinical Sequencing Project
# 
# The underlying data were developed as part of the Human Microbiome Project to understand the structure of the microbiome in healthy individuals [Peterson et al. 2009](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2792171/). The folks at PLoS have been very helpful in collecting the papers published as part of the overall HMP clinical study as well as the other projects associated with HMP funding as the [The Human Microbiome Project Collection](http://www.ploscollections.org/article/browseIssue.action?issue=info:doi/10.1371/issue.pcol.v01.i13) (nb: there are other papers from the HMP that don't appear in this collection). As mentioned in our paper, the individuals in the main clinical study were **very** healthy. A talmudic question for you to consider is whether heatlhy == normal. We contend that if 80% of American's have a cavity in their mouth, they are *normal*, but not healthy. In this study, people with cavities were excluded from the entire study. You can learn more about the exclusion criteria at the [dbGaP project site](http://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000228.v3.p1) and [Aagaard et al. ,2013](http://www.ncbi.nlm.nih.gov/pubmed/23165986).
# 
# Overall, the study sampled 15 bodysites from 150 women and 18 bodysites from 150 men. Of the 300 subjects, 150 were enrolled in Houston, TX and 150 were enrolled in St. Louis, MO. The goal was to sample all subjects twice and 100 subjects a third time. All samples were to be characterized by 16S rRNA gene sequencing and a decent number were sequenced via metagenomics shotgun sequencing. The current study only concerned itself with the 16S rRNA gene sequencing. There were several stages to the study. The first consisted of a "Pilot Production Study" where a set of 24 subjects (12 men and 12 women) were recruited in Houston. The samples were processed there and then split with one replicate sent to one sequencing center and another replicate sent to another sequencing center. There were four sequencing centers in total: The Baylor College of Medicine (BCM), The Washington University Genome Sequencing Center (WUGSC), The Broad Institute (BI), and the J Craig Venter Institute (JCVI). These samples have already been described in earlier studies by us and others ([Schloss et al. 2011](http://www.ploscollections.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0027310), [JCHMPDGWG 2012](http://www.ploscollections.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0039315)).
# 
# Next, the project scaled up to the full 300 subject pool. As we discuss in the Supplementary Information for the paper, the city of origin for each subject was a complicating varaible. The samples from the 150 subjects from St. Louis were extracted and sequenced at WUGSC. The samples from the 150 subjects in Houston were sent to BCM, JCVI, and BI for extraction and sequencing. Thus the city of origin was perfectly confounded with the location of sequencing. As was discussed elsewhere, the largest effect observed in the data was from city of origin ([HMPC 2012](http://www.nature.com/nature/journal/v486/n7402/full/nature11234.html). Although we would love to believe that people vary in their microbiome between Houston and St. Louis and can concoct many reasons why this would be so, the easiest is that there were small differences in how samples were obtained, handled, and processed that affected the microbiome structure. Regardless, prior to our paper, the only data that had been published was from the so-called "May 1, 2010" data freeze. The samples from this data freeze primarily only contained the first sampling point for each subject. Although the second sample was available for some of these subjects by that freeze, the papers published to date (e.g. [HMPC 2012](http://www.nature.com/nature/journal/v486/n7402/full/nature11234.html), [HMPC 2012](http://www.nature.com/nature/journal/v486/n7402/full/nature11209.html), and [others](http://www.ploscollections.org/article/browseIssue.action?issue=info:doi/10.1371/issue.pcol.v01.i13)). Thus, the goal of our paper was to present an analysis of all of the data generated by the HMP clinical study.
# 
# 
# 
# 
# # Data curation
# ## The data
# **Sequence data.** The 16S rRNA gene sequencing data was generated by sequencing the V35 region of gene on the 454 Titanium platform. Initially the project sequenced the V13 and V69 regions; however, these were eventually dropped from the analysis plan. Thus, we focused our effort on the V35 region. All of the 16S rRNA gene sequence data are available at the NCBI Short Read Archive (SRA): the Clinical Pilot Project (accession [SRP002012](http://www.ncbi.nlm.nih.gov/Traces/sra/?study=SRP002012)), Phase I (the May 1 freeze; accession [SRP002395](http://www.ncbi.nlm.nih.gov/Traces/sra/?study=SRP002395)), and Phase II (accession [SRP002860](http://www.ncbi.nlm.nih.gov/Traces/sra/?study=SRP002860)). Because we were part of the original Data Analysis Working Group (DAWG), we originally obtained some of the data from Jonathan Crabtree at the [HMP Data Analsyis Coordination Center (DACC)](http://www.hmpdacc.org). Thus, there may be slight variations in the data that we used and the data available throgh the SRA.
# 
# The 16S rRNA gene sequence analysis pipeline used in this study is based on the results of our previous work analyzing the mock community data generated as part of the PPS ([Schloss et al. 2011](http://www.ploscollections.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0027310)). We used the PyroNoise-based approach with *de novo* chimera checking using the UCHIME algorithm ([Quince et al. 2011](http://www.biomedcentral.com/1471-2105/12/38) and [Edgar et al. 2011](http://bioinformatics.oxfordjournals.org/content/27/16/2194.long)).  All analysis steps were performed within mothur ([Schloss et al. 2009](http://aem.asm.org/content/75/23/7537.long)). As we described there, this curation pipeline yields an error rate of ~0.02%. The mothur package can be obtained from the [mothur website](http://www.mothur.org) or our [GitHub repository](https://github.com/mothur/mothur). The wiki also has a shorter description of our [SOP for processing 454 sequencing data](http://www.mothur.org/wiki/454_SOP).
# 
# **Clinical data.** The clinical data for this project is available through the [dbGaP project site](http://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000228.v3.p1). Unfortunately, because this is considered PHI, I am not at liberty to distribute these data. You may obtain the data from dbGaP by going through the appropriate channels. *Don't ask us to send the files to you!*
# 
# 
# ## Sequence curation
# ### Preliminaries
# To organize the files we created several files in our directory:

# In[ ]:


get_ipython().run_cell_magic('bash', '', 'mkdir scripts oligos_files references sff_files sffinfo trim.flows shhh.flows trim.seqs v13 v35 v69 dbGaP\n')


# You can obtain the oligos_file files from the GitHub repository for this project. If you are trying to run this, you will need to go in to the file and decompress the files (see oligos_files.tgz):

# In[ ]:


get_ipython().run_cell_magic('bash', '', 'cd oligos_files\ntar xvzf oligos_files.tgz\ncd ../\n')


# These oligos files will tell mothur the barcode sequence that corresponds to each sample so that we can map reads to samples. If you open one of these files you'll see something that looks like this:
# 

# In[ ]:


forward	ATTACCGCGGCTGCTGG	v13
forward	CCGTCAATTCMTTTRAGT	v35
forward	TACGGYTACCTTGTTAYGACTT	v69
barcode	ACGAGAAC	phase1.SRR040576.BI.700034678


# The first three lines are the "forward" PCR primer that were adjacent to the sequencing primer. By putting all three possible primers in the file we let mothur figure out which region the seuqences belong to. The barcode line gives the barcode sequence and then the sample name. The "phase 1" indicates that the sample was from before the May 1 data freeze. The "SRR040576" indicates the sequencing run that the sample came from. The "BI" indicates that the sample was sequenced at the Broad Institute". Finally, the "700034678" corresponds to the anonymized sample id. Together this string represents a unique identifier for every sequencing collection (some samples were sequenced twice, some at two centers, and some in multiple phases of the study).
# 
# You will also want to download the sequence data files from the SRA and convert them into sff format using the SRA's tools. Put the resulting sff files into the sff_files folder and return to the parent directory. Because we obtained several of the PPS datasets before they were publically available, the names of our files are slightly different from those on the SRA. At this point you'll need to double check that you have an oligos file for every sff file. Runing the following two commands should give the same number. The third command will tell you what you're missing. If you're using the same files that we used, there will be 3 files extra: F5672XE02, F5672XE02.v13, and F5672XE02.v35. There was somethign screwy about how these were sequenced and so later we will replace the F5672XE02 with the v13 and v35 versions and all will be well with the world.

# In[ ]:


get_ipython().run_cell_magic('bash', '', 'ls sff_files | wc -l\nls oligos_files | wc -l\nls oligos_files/* sff_files/* | cut -f 2 -d "/" | cut -f 1 -d "." | sort | uniq -u\n')


# Next, we want to get the reference files that we'll be using to process the data. We chose to use the SILVA-based reference alignment ([Pruesse et al. 2007](http://nar.oxfordjournals.org/lookup/pmid?view=long&pmid=17947321)) because of its superiority to the greengenes-based alignment. Previous work from our lab has demonstrated this ([Schloss 2010](http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1000844)). In addition, we used a modified version of the RDP's training set ([Wang et al. 2007](http://aem.asm.org/content/73/16/5261.long); v.9). Our modifications included adding several sequences from the Archaea, Eukarya, chloroplasts, and mitochondria.

# In[ ]:


get_ipython().run_cell_magic('bash', '', 'cd references\nwget http://www.mothur.org/w/images/5/59/Trainset9_032012.pds.zip\nunzip Trainset9_032012.pds.zip\n\nwget http://www.mothur.org/w/images/9/98/Silva.bacteria.zip\nunzip Silva.bacteria.zip\nmv silva.bacteria/silva.bacteria.fasta ./\n\ncd ../\n')


# Now we want to extract the flowgram data from the sff files. Let's move to the sff_info folder. Then we'll want to run sffinfo from mothur on each of the sff files. When we actually ran this we submitted these jobs in parallel using pbs rather than in series and we are limited to about 100 jobs per user. You can find the actual sff files that we processed in sff_files.txt. We had 12943 sff files:

# In[ ]:


get_ipython().run_cell_magic('bash', '', 'cd sffinfo\n\n> sffinfo.batch\n\nfor SFF in $(ls ../sff_files/*sff)\ndo\n    #make a big file with all of the sffinfo commands in the file\n    echo "mothur \\"#sffinfo(sff=$SFF, outputdir=./)\\"" >> sffinfo.batch    \ndone\n\n#split sffinfo.batch in to files with 130 lines each and named with sffinfo_??\nsplit -l 130 sffinfo.batch sffinfo_\n\ncat > header.batch << "_EOF_"\n#!/bin/sh\n#PBS -l nodes=1:ppn=1\n#PBS -l walltime=500:00:00\n#PBS -l mem=1gb\n#PBS -j oe\n#PBS -m abe\n#PBS -V\n#PBS -M your@email.here\n#PBS -q first\n\necho "ncpus-2.pbs"\ncat $PBS_NODEFILE\nqstat -f $PBS_JOBID\n\ncd $PBS_O_WORKDIR\n\nNCPUS=`wc -l $PBS_NODEFILE | awk \'{print $1}\'`\n\n_EOF_\n\n\ncat > tail.batch << "_EOF_"\n\necho "qsub working directory absolute is"\necho $PBS_O_WORKDIR\nexit\n_EOF_\n\nfor BATCH in $(ls sffinfo_*)\ndo\n    cat header.batch $BATCH tail.batch > $BATCH.batch\n    qsub $BATCH.batch\ndone\n\n#there were those screwy files above that needed to be copied for v13 and v35 and removed\n#be sure to run these only after the scripts are done\ncp F5672XE02.flow F5672XE02.v13.flow\ncp F5672XE02.flow F5672XE02.v35.flow\nrm F5672XE02.flow\n\n#check to make sure we have the right number of files. If not, some baby sitting may be needed\nls ../sff_files/*sff | wc -l    #should equal 12943\nls *flow | wc -l                #should equal 12944\n')


# As mentioned above there was a PPS stage of the project and then the actual production stages. Accidentally, some of the data from WUGSC was uploaded in the production stage even though it had already been uploaded in the PPS stage. To keep track of what was from the PPS stage and what was from the production stage, we ran the following Perl script (modPPSFlowFiles.pl) on the flow gram data from the PPS stage go ahead and save a copy to the scripts folder:

# In[ ]:


#!perl
use strict;
use warnings;

foreach my $file (@ARGV){
	
	open(INPUT, $file) or die;
	
	$file =~ /(.*).flow/;
	my $outfile = "$1.mod.flow";
	
	open(OUTPUT, ">$outfile") or die;
	
	my $header = <INPUT>;
	print OUTPUT $header;
	
	while(<INPUT>){
		$_ =~ s/^(\S*)/$1p/;
		print OUTPUT $_;
	}
	close(INPUT);
	close(OUTPUT);
}


# This generates new files ending in "mod.flow" where the sequence name has a "p" appended to the end. The SRA files all start with "SRR" and since the PPS files weren't in the SRA, we'll use that to our advantage to pick those files out...

# In[ ]:


get_ipython().run_cell_magic('bash', '', 'perl ../scripts/modPPSFlowFiles.pl `ls *flow | grep -v "^SRR"`\n')


# Now we want to move over to the trim.flows folder and run trim.flows. As described in [Schloss et al. 2009](http://aem.asm.org/content/75/23/7537.long) we will allow one mismatch to the barcode sequence, two mismatches to the primer sequence, no ambiguous base calls, and a maximum homopolymer length of 8 nt. Also, by default, mothur will trim all reads to 450 flows and cull any reads with fewer than 450 flows. This results in reads that are about 270 bp. Again we parallelize this and fire everything off in batches of 130 flow files at a time.

# In[ ]:


get_ipython().run_cell_magic('bash', '', 'cd ../trim.flows\n\n> trimflows.batch\n\nfor FLOW in $(ls ../sffinfo/SRR*flow)\ndo\n    OLIGOS=${FLOW//sffinfo/oligos_files}\n    OLIGOS=${OLIGOS//flow/oligos}\n\n    echo "mothur \\"#trim.flows(flow=$FLOW, oligos=$OLIGOS, pdiffs=2, bdiffs=1, maxhomop=8, outputdir=./)\\"" >> trimflows.batch\ndone\n\nfor FLOW in $(ls ../1_sffinfo/*.mod.flow)\ndo\n    OLIGOS=${FLOW//1_sffinfo/oligos_files}\n    OLIGOS=${OLIGOS//mod.flow/oligos}\n\n    echo "mothur \\"#trim.flows(flow=$FLOW, oligos=$OLIGOS, pdiffs=2, bdiffs=1, maxhomop=8, processors=8, outputdir=./)\\"" >> trimflows.batch\ndone\n\n\nsplit -l 130 trimflows.batch trimflows_   #split trimflows.batch in to files with 130 lines each and named with trimflows_??\n\ncat > header.batch << "_EOF_"\n#!/bin/sh\n#PBS -l nodes=1:ppn=1\n#PBS -l walltime=500:00:00\n#PBS -l mem=1gb\n#PBS -j oe\n#PBS -m abe\n#PBS -V\n#PBS -M your@email.here\n#PBS -q first\n\necho "ncpus-2.pbs"\ncat $PBS_NODEFILE\nqstat -f $PBS_JOBID\n\ncd $PBS_O_WORKDIR\n\nNCPUS=`wc -l $PBS_NODEFILE | awk \'{print $1}\'`\n\n_EOF_\n\n\ncat > tail.batch << "_EOF_"\n\necho "qsub working directory absolute is"\necho $PBS_O_WORKDIR\nexit\n_EOF_\n\nfor BATCH in $(ls trimflows_*)\ndo\n    cat header.batch $BATCH tail.batch > $BATCH.batch\n    qsub $BATCH.batch\ndone\n')


# This generates files ending in \*v13.flow, \*v35.flow, and \*v69.flow as well as a file ending in \*.flow.files that corresponds to each sff file that we started with. After running trim.flows, we had 12944 \*.flow.files files, which represented 42,768 samples. Of these \*.flow.files files 176 had no samples in them indicating that trim.flows was unable to find any sequences with the right barcode or primer. Manual inspection of the original flow files showed that these were samples with very few sequences.

# In[ ]:


get_ipython().run_cell_magic('bash', '', 'wc -l *files > flow.files.counts\n')


# In[ ]:


get_ipython().run_cell_magic('R', '', 'a <-read.table(file="flow.files.counts"); colnames(a) <- c("nsamples", "sample_id")\nsum(a$nsamples)                     #total number of samples remaining\nsum(a$nsamples==0)                  #number of flow.files files without any samples\nnrow(a) - sum(a$nsamples==0)        #number of flow.files with samples in them\nb <- a[a$nsamples!=0,]\nwrite.table(b$sample_id, "good.flow.files", row.names=F, col.name=F, quote=F)    #output the names of the good files\n')


# We pressed on with the 12,768 \*.flow.files files that had samples in them. At this point we could have separated the three 16S rRNA gene sequence regions, but we just plugged ahead with everything together. The next step was to denoise the sequences using the PyroNoise algorithm, which is implemented in mothur as shhh.flows. Investigators interested in repeating our analysis should note that this step involves a random number generator to break ties during the initial clustering step - the effects should be minor. This step may need a bit of baby siting depending on how much RAM you allocate and how much is needed for each job. Some may crap out and you'll have to figure out whether anything crapped out. We'll show how this was done at the end of the step.

# In[ ]:


get_ipython().run_cell_magic('bash', '', 'cd ../shhh.flows\n\n> shhhflows.batch\n\nfor FLOWFILE in $(ls ../trim.flows/*flow.files)\ndo\n    echo "mothur \\"#shhh.flows(file=$FLOWFILE, outputdir=./)\\"" >> shhhflows.batch\ndone\n\n\n\nsplit -l 20 shhhflows.batch shhhflows_   #split trimflows.batch in to files with 20 lines each and named with shhhflows_??\n\ncat > header.batch << "_EOF_"\n#!/bin/sh\n#PBS -l nodes=1:ppn=1\n#PBS -l walltime=500:00:00\n#PBS -l mem=8gb\n#PBS -j oe\n#PBS -m abe\n#PBS -V\n#PBS -M your@email.here\n#PBS -q first\n\necho "ncpus-2.pbs"\ncat $PBS_NODEFILE\nqstat -f $PBS_JOBID\n\ncd $PBS_O_WORKDIR\n\nNCPUS=`wc -l $PBS_NODEFILE | awk \'{print $1}\'`\n\n_EOF_\n\n\ncat > tail.batch << "_EOF_"\n\necho "qsub working directory absolute is"\necho $PBS_O_WORKDIR\nexit\n_EOF_\n\nfor BATCH in $(ls shhhflows_*)\ndo\n    cat header.batch $BATCH tail.batch > $BATCH.batch\n    qsub $BATCH.batch\ndone\n')


# We found that this step required some handholding as certain files needed more RAM devoted to it and others would unexpectedly fail. To make sure everything went right we used a few tricks to make sure we had everything

# In[ ]:


#how many shhh.fasta files were created (ignoring the indivdiual ones for each sample)?
ls *shhh.fasta | grep -cv "v"
#12766 - three short of what we found above...

ls *shhh.fasta | grep -v "v" > shhh.files
cut -c 9- ../trim.flows/flow.files.counts > flow.files
cat shhh.files ../trim.flows/good.flow.files  | rev | cut -f 3- -d "." | rev | sort | uniq -u

#F5672XE02.v13.mod
#F5672XE02.v35.mod
#total


# So this makes sense - the two files that end in mod were inadvertantly left out in the line with "grep -v" and we had a total after running the wc command before. So we're all here adn ready to move on! Running shhh.flows results in a fasta file and a name file. The name file indicates the names of the sequences that were duplicates and were original sequence after denoising. We'll have to include this in all of the commands going forward so that we can have a proper accounting of the sequences.
# 
# Now we were ready to remove the barcodes and primers using trim.seqs. This step also generates a "groups" file, that indicate which sample each sequence belongs to. We will use many of the same parameters as we used in running trim.flows and we will use allfiles=T and flip=T. The former is so that we get a separate fasta, name, and group file for each sample in the original sff file and the latter is to return the reverse complement of the sequences since they were sequenced from the V5 region towards the V3 region

# In[ ]:


get_ipython().run_cell_magic('bash', '', 'cd ../trim.seqs\n\n> trimseqs.batch\n\nfor FASTA in $(ls ../shhh.flows/SRR??????.shhh.fasta)\ndo\n    OLIGOS=${FASTA//shhh.flows/oligos_files}\n    OLIGOS=${OLIGOS//shhh.fasta/oligos}\n\n    NAMES=${FASTA//fasta/names}\n    echo "mothur \\"#trim.seqs(fasta=$FASTA, name=$NAMES, oligos=$OLIGOS, pdiffs=2, bdiffs=1, maxhomop=8, maxambig=0, allfiles=T, flip=T, outputdir=./)\\"" >> trimseqs.batch\ndone\n\nfor FASTA in $(ls ../shhh.flows/*mod.shhh.fasta)\ndo\n    OLIGOS=${FASTA//shhh.flows/oligos_files}\n    OLIGOS=${OLIGOS//mod.shhh.fasta/oligos}\n\n    NAMES=${FASTA//fasta/names}\n    echo "mothur \\"#trim.seqs(fasta=$FASTA, name=$NAMES, oligos=$OLIGOS, pdiffs=2, bdiffs=1, maxhomop=8, maxambig=0, allfiles=T, flip=T, outputdir=./)\\"" >> trimseqs.batch\ndone\n\n\nsplit -l 20 trimseqs.batch trimseqs_   #split trimseqs.batch in to files with 20 lines each and named with trimseqs_??\n\ncat > header.batch << "_EOF_"\n#!/bin/sh\n#PBS -l nodes=1:ppn=1\n#PBS -l walltime=500:00:00\n#PBS -l mem=1gb\n#PBS -j oe\n#PBS -m abe\n#PBS -V\n#PBS -M your@email.here\n#PBS -q first\n\necho "ncpus-2.pbs"\ncat $PBS_NODEFILE\nqstat -f $PBS_JOBID\n\ncd $PBS_O_WORKDIR\n\nNCPUS=`wc -l $PBS_NODEFILE | awk \'{print $1}\'`\n\n_EOF_\n\n\ncat > tail.batch << "_EOF_"\n\necho "qsub working directory absolute is"\necho $PBS_O_WORKDIR\nexit\n_EOF_\n\nfor BATCH in $(ls trimseqs_*)\ndo\n    cat header.batch $BATCH tail.batch > $BATCH.batch\n    qsub $BATCH.batch\ndone\n\nls ls ../trim.flows/*v??.flow | cut -f 3 -d "/" > flow.files; wc -l flow.files   #21383\nls *.v??.fasta > fasta.files; wc -l fasta.files                                  #21383\n')


# Running trim.seqs returns a fasta, name, and group file for each sample. Because of how we named our barcodes in the original oligos files we now have files named like: SRR350309.shhh.phase2.SRR350309.WUGSC.700172315.v35.fasta in our directory. Everything in this file name from phase2 through v35 comes from the oligos file. At this point we will separate our data into separate folders according to the region of the gene that the data were derived from.

# In[ ]:


get_ipython().run_cell_magic('batch', '', '\ncd ..\n\nfor REGION in v13 v35 v69\ndo\n    cd $REGION\n    cat ../trim.seqs/*$REGION.fasta > $REGION.fasta\n    cat ../trim.seqs/*$REGION.names > $REGION.names\n    cat ../trim.seqs/*$REGION.groups > $REGION.groups\n    cd ../\ndone\n\ncut -f 2 v13/v13.groups | sort | uniq | wc -l   #7194\ncut -f 2 v35/v35.groups | sort | uniq | wc -l   #14015\ncut -f 2 v69/v69.groups | sort | uniq | wc -l   #171\n')


# As mentioned above and indicated in running the last three commands, the v35 dataset has the most samples and that is what we based our analysis on for the paper. We'll go ahead and move into the v35 folder, but before proceeding, let's customize the reference files. To improve the analysis and reduce file sizes, we want to limit the reference files to the v35 region that was sequenced. As we are using the SILVA reference alignment ([Pruesse et al. 2007](http://nar.oxfordjournals.org/lookup/pmid?view=long&pmid=17947321)) we figured out that the primers that were used to amplify the DNA meant that a full length amplicon would start at position 6,426 in the alignment space and end at position 27,654 (the full alignment is 50,000 characters long). So we want to filter the silva.bacteria.fasta file based on those coordinates and then align the trainset9 fasta data to that region and filter it to remove the bases outside of the v35 region ([Wang et al. 2007](http://aem.asm.org/content/73/16/5261.long)). 

# In[ ]:


get_ipython().run_cell_magic('bash', '', 'cd v35\nmothur "#pcr.seqs(fasta=../references/silva.bacteria.fasta, start=6426, end=27654, outputdir=./, processors=8);filter.seqs(fasta=silva.bacteria.pcr.fasta, trump=., processors=8)"\ncp silva.bacteria.pcr.filter.fasta silva.v35.fasta\n\nmothur "#align.seqs(fasta=../references/trainset9_032012.pds.fasta, reference=../references/silva.bacteria.fasta, outputdir=./, processors=8); pcr.seqs(fasta=trainset9_032012.pds.align, taxonomy=../references/trainset9_032012.pds.tax, start=6426, end=27654, outputdir=./); degap.seqs(fasta=trainset9_032012.pds.pcr.align)"\ncp trainset9_032012.pds.pcr.tax trainset9_032012.three.tax\ncp trainset9_032012.pds.pcr.ng.fasta trainset9_032012.three.fasta\n')


# Now we are ready to run mothur with the sequences as a single dataset. What follows are the commands to be run within mothur or placed within a batch file (i.e. v35.mothur.batch; shown below) and run from the command line. Relevant references for these commands are available: aligner ([Schloss 2009](http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0008230) and [DeSantis et al. 2006](http://nar.oxfordjournals.org/content/34/suppl_2/W394.full)), pre-clustering ([Schloss et al. 2011](http://dx.plos.org/10.1371/journal.pone.0027310)), UChime ([Edgar et al. 2011](http://bioinformatics.oxfordjournals.org/content/27/16/2194.long)), and the classifier ([Schloss & Westcott 2011](http://aem.asm.org/content/77/10/3219.long) and [Wang et al. 2007](http://aem.asm.org/content/73/16/5261.long)). We'll first load the mothur magic module:

# In[2]:


get_ipython().run_line_magic('install_ext', 'https://raw.github.com/kdiverson/ipython-mothurmagic/master/mothurmagic.py')
get_ipython().run_line_magic('load_ext', 'mothurmagic')


# In[ ]:


get_ipython().run_cell_magic('mothur', '', "\n#unique the sequences because there are duplicate sequences between samples\nunique.seqs(fasta=v35.fasta, name=v35.names)\n\n\n#align the sequences against the customized SILVA reference alignment for the v35 region\n#the user should alter the processors option to however many processors they have available\nalign.seqs(fasta=current, reference=silva.v35.fasta, processors=8)\n\n\n#find the start and end coordinates for the aligned sequences\nsummary.seqs(fasta=current, name=current)\n\n\n#remove the sequences that don't map to the correct region and then remove\n#any vertical gaps or columns that contain missing data\nscreen.seqs(fasta=current, name=current, group=v35.groups, start=10010, end=21219)\nfilter.seqs(fasta=current, vertical=T, trump=.)\n\n\n#unique again because the end trimming creates duplicate sequences\nunique.seqs(fasta=current, name=current)\n\n#pre-cluster sequences to remove residual errors and simplify the dataset.\npre.cluster(fasta=current, group=current, name=current, diffs=2, processors=8)\n\n\n#use chimera.uchime to identify chimeras in the de novo detection mode. to do this, we\n#will process each sample separately and will will only consider a sequence as being \n#chimeric if it is flagged as chimeric in that sample. we then remove the chimeras from\n#the samples they were found in\nchimera.uchime(fasta=current, group=current, name=current, dereplicate=T, processors=8)\nremove.seqs(fasta=current, group=current, name=current, accnos=current, dups=F)\n\n\n#classify the sequences against our customized taxonomy reference based on the RDP\n#training set. we require that an assignment have a confidence score of at least 80%\nclassify.seqs(fasta=current, reference=trainset9_032012.three.fasta, taxonomy=trainset9_032012.three.tax, cutoff=80, processors=8)\n\n\n#remove sequences that classify as either Chloroplasts, Mitochondria, Archaea, Eukaryota,\n#or as being unknown\nremove.lineage(fasta=current, group=current, name=current, taxonomy=current, taxon=Cyanobacteria_Chloroplast-Mitochondria-unknown-Archaea-Eukaryota)\n\n\n#assign sequences to bins / phylotypes based on their taxonomic assignment\nphylotype(taxonomy=current, name=current)\n\n\n#count the number of times each phylotype was observed in each sample at the genus level\n#(label=1)\nmake.shared(list=current, group=current, label=1)\n\n\n#finally, output the taxonomic assignment for each phylotype\nclassify.otu(list=current, group=current, name=current, taxonomy=current, label=1)\n")


# Recall above we gave the samples some pretty funky names to keep every sequencing collection separate. We need to figure out which samples came from which subject and bodysite. We pooled datasets for each subject and bodysite where a sample was sequenced multiple times. Because WUGSC submitted their PPS data for the Phase 1 data, there is some redundancy. Now we need to remove the redundant reads with the help of a perl script (removeRedundant_WUGSC.pl). Go ahead and copy and paste the following into a file in the scripts folder.

# In[ ]:


#!perl

use strict;
use warnings;


#	first we want to get the names of the WUGSC groups that were sequenced in
#	the pps.  we'll create the %groups hash and store the group names in the
#	hash...

my %groups;

open(GROUPS, $ARGV[0]) or die;
while(<GROUPS>){
	chomp(my $group = $_);
	$groups{$group} = 1;
}

close(GROUPS);


#	next, we want to look through the shared file and find any group name that
#	starts "phase1" and ends with the group name in the hash.  if it isn't in
#	the hash then we'll keep it.  if not, we'll remove it.

open(ORIG_SHARED, $ARGV[1]) or die;
while(<ORIG_SHARED>){
	if($_ =~ /phase1\..*\.(WUGSC\.\S*)\t/){
		if(!exists($groups{$1})){
			print "$_";
		}
	}
	else{
		print "$_";
	}
}

close(ORIG_SHARED);


# In[ ]:


get_ipython().run_cell_magic('bash', '', '#get the names of the samples from the shared file that came from WUGSC in the PPS phase\n#return the sequencing center (WUGSC; f=3) and the anonymized sample id (f=4)\ncut -f 2 v35.unique.good.filter.unique.precluster.unique.pick.three.wang.pick.tx.shared | grep "pps.*WUGSC" | cut -f 3,4 -d "." | sort | uniq > wugsc.pps.groups\n\n#remove any samples that start phase1 and end with the fields returned above\nperl ../scripts/removeRedundant_WUGSC.pl wugsc.pps.groups v35.unique.good.filter.unique.precluster.unique.pick.three.wang.pick.tx.shared > v35.three_phases.tx.shared\n')


# There were a number of "control" samples included in the analysis from sequencing mock communities, generous donor samples, and water blanks. Unfortunately, once Phase 1 and 2 started, the sequencing centers became inconsistent in running these controls. Therefore, they don't really tell us anything other than what we've already established in our earlier efforts to build this pipeline. So we'll remove them using a perl script (removeControlSamples.pl)
#!perl

use strict;
use warnings;

open(SHARED, $ARGV[0]) or die;

#the control samples all had underscores and none of the other samples did
while(<SHARED>){
	if(!($_ =~ /_/)){
		print $_;
	}
}
close(SHARED);
# In[ ]:


get_ipython().run_cell_magic('bash', '', 'perl ../scripts/removeControlSamples.pl v35.three_phases.tx.shared > v35.three_phases_nocontrol.tx.shared\n')


# Now we want to figure out who the anonymized sample id corresponds to and which body site it came from as well as the visit number that the sample was taken from. To do this, it is essential that you have the dbGaP files. I'll assume that you have the following files in a folder named dbGaP that is located in the parent directory of v35 (the current directory):
# 
# > phs000228.v3.pht001184.v3.p1.EMMES_HMP_Subject.MULTI.txt  
# > phs000228.v3.pht001185.v3.p1.EMMES_HMP_Sample.MULTI.txt  
# > phs000228.v3.pht001187.v3.p1.c1.EMMES_HMP_DCM.HMP.txt  
# > phs000228.v3.pht001193.v3.p1.c1.EMMES_HMP_GTV.HMP.txt  
# > phs000228.v3.pht002156.v1.p1.c1.EMMES_HMP_DSU.HMP.txt  
# > phs000228.v3.pht002157.v1.p1.c1.EMMES_HMP_DTP_DHX_DVD.HMP.txt  
# > phs000228.v3.pht002158.v1.p1.c1.EMMES_HMP_DEM_ENR.HMP.txt  
# 
# Again, don't ask me for these files, you must get them through dbGaP. For this operation you will need phs000228.v3.pht001193.v3.p1.c1.EMMES_HMP_GTV.HMP.txt and our perl script (mapSamplesToMetadata.pl), which should be in the scripts folder.

# In[ ]:


#!perl

use strict;
use warnings;


my %tracker;

open(METADATA, $ARGV[1]) or die;
while(<METADATA>){
	chomp($_);
	$_ =~ s/ /_/g;

	if($_ =~ /^\d/){
		my @line = split /\t/, $_;

		$line[7] =~ s/(\d\d).*/$1/;
		$line[9] =~ s/.*DNA_(.*)/$1/;

		$tracker{$line[6]} = "$line[13].$line[7].$line[9]";
	}
}
close(METADATA);


open(SAMPLE_IDS, $ARGV[0]) or die;
while(<SAMPLE_IDS>){
	chomp($_);
	if($_ =~ /(\d{9})/){

		print "$_\t$tracker{$1}\n";

	}
}
close(SAMPLE_IDS);


# In[ ]:


get_ipython().run_cell_magic('bash', '', '\n#get the names of all of the samples we have\ncut -f 2 v35.three_phases_nocontrol.tx.shared > sample.ids\n\n\n#map the sample.ids to the subject id, body site, and visit number\nperl ../scripts/mapSamplesToMetadata.pl sample.ids ../dbGaP/phs000228.v3.pht001193.v3.p1.c1.EMMES_HMP_GTV.HMP.txt > sample.design\n\n\n#the output of the perl script is a file that maps the sample names to more \n#meaningful names. using mothur, we can merge the frequency of each phylotype\n#for samples that need to be pooled\nmothur "#merge.groups(shared=v35.three_phases_nocontrol.tx.shared, design=sample.design)"\n')


# At this point we have sample names that look like this:  
# > 160461578.01.Throat  
# > 158256496.02.Anterior_nares  
# > 160643649.01.Mid_vagina  
# > 159005010.02.Tongue_dorsum  
# > 159753524.01.Tongue_dorsum  
# > 160481808.01.R_Antecubital_fossa  
# > 159753524.01.Anterior_nares  
# > 160481808.01.Tongue_dorsum  
# > 159207311.01.R_Retroauricular_crease  
# > 159713063.02.Tongue_dorsum  

# We'd like to separate these samples now so that each body site has its own shared file:

# In[ ]:


get_ipython().run_cell_magic('bash', '', 'head -n 1 v35.three_phases_nocontrol.tx.merge.shared > header.txt\ncp header.txt Anterior_nares.tx.shared\ncp header.txt Buccal_mucosa.tx.shared\ncp header.txt Hard_palate.tx.shared\ncp header.txt Keratinized_gingiva.tx.shared\ncp header.txt L_Antecubital_fossa.tx.shared\ncp header.txt L_Retroauricular_crease.tx.shared\ncp header.txt Mid_vagina.tx.shared\ncp header.txt Palatine_Tonsils.tx.shared\ncp header.txt Posterior_fornix.tx.shared\ncp header.txt R_Antecubital_fossa.tx.shared\ncp header.txt R_Retroauricular_crease.tx.shared\ncp header.txt Saliva.tx.shared\ncp header.txt Stool.tx.shared\ncp header.txt Subgingival_plaque.tx.shared\ncp header.txt Supragingival_plaque.tx.shared\ncp header.txt Throat.tx.shared\ncp header.txt Tongue_dorsum.tx.shared\ncp header.txt Vaginal_introitus.tx.shared\n\ngrep "Anterior_nares" v35.three_phases_nocontrol.tx.merge.shared >> Anterior_nares.tx.shared\ngrep "Buccal_mucosa" v35.three_phases_nocontrol.tx.merge.shared >> Buccal_mucosa.tx.shared\ngrep "Hard_palate" v35.three_phases_nocontrol.tx.merge.shared >> Hard_palate.tx.shared\ngrep "Keratinized_gingiva" v35.three_phases_nocontrol.tx.merge.shared >> Keratinized_gingiva.tx.shared\ngrep "L_Antecubital_fossa" v35.three_phases_nocontrol.tx.merge.shared >> L_Antecubital_fossa.tx.shared\ngrep "L_Retroauricular_crease" v35.three_phases_nocontrol.tx.merge.shared >> L_Retroauricular_crease.tx.shared\ngrep "Mid_vagina" v35.three_phases_nocontrol.tx.merge.shared >> Mid_vagina.tx.shared\ngrep "Palatine_Tonsils" v35.three_phases_nocontrol.tx.merge.shared >> Palatine_Tonsils.tx.shared\ngrep "Posterior_fornix" v35.three_phases_nocontrol.tx.merge.shared >> Posterior_fornix.tx.shared\ngrep "R_Antecubital_fossa" v35.three_phases_nocontrol.tx.merge.shared >> R_Antecubital_fossa.tx.shared\ngrep "R_Retroauricular_crease" v35.three_phases_nocontrol.tx.merge.shared >> R_Retroauricular_crease.tx.shared\ngrep "Saliva" v35.three_phases_nocontrol.tx.merge.shared >> Saliva.tx.shared\ngrep "Stool" v35.three_phases_nocontrol.tx.merge.shared >> Stool.tx.shared\ngrep "Subgingival_plaque" v35.three_phases_nocontrol.tx.merge.shared >> Subgingival_plaque.tx.shared\ngrep "Supragingival_plaque" v35.three_phases_nocontrol.tx.merge.shared >> Supragingival_plaque.tx.shared\ngrep "Throat" v35.three_phases_nocontrol.tx.merge.shared >> Throat.tx.shared\ngrep "Tongue_dorsum" v35.three_phases_nocontrol.tx.merge.shared >> Tongue_dorsum.tx.shared\ngrep "Vaginal_introitus" v35.three_phases_nocontrol.tx.merge.shared >> Vaginal_introitus.tx.shared\n')


# In addition several of the body sites will be combined to facilitate their analysis (i.e. left and right antecubital fossa and retroauricular crease and the vaginal introitus, mid vagina, and posterior fornix).

# In[ ]:


get_ipython().run_cell_magic('bash', '', 'cp header.txt Vagina.tx.shared\ncp header.txt Retroauricular_crease.tx.shared\ncp header.txt Antecubital_fossa.tx.shared\n\ngrep "Vaginal_introitus" v35.three_phases_nocontrol.tx.merge.shared >> Vagina.tx.shared\ngrep "Mid_vagina" v35.three_phases_nocontrol.tx.merge.shared >> Vagina.tx.shared\ngrep "Posterior_fornix" v35.three_phases_nocontrol.tx.merge.shared >> Vagina.tx.shared\ngrep "L_Antecubital_fossa" v35.three_phases_nocontrol.tx.merge.shared >> Antecubital_fossa.tx.shared\ngrep "R_Antecubital_fossa" v35.three_phases_nocontrol.tx.merge.shared >> Antecubital_fossa.tx.shared\ngrep "L_Retroauricular_crease" v35.three_phases_nocontrol.tx.merge.shared >> Retroauricular_crease.tx.shared\ngrep "R_Retroauricular_crease" v35.three_phases_nocontrol.tx.merge.shared >> Retroauricular_crease.tx.shared\n')


# All of our next steps will depend on our ability to subsample the data to a common number of sequences. To figure out what our threshold should be, we will calculate the number of reads per sample and bodysite.

# In[ ]:


get_ipython().run_cell_magic('bash', '', 'mothur "#summary.single(shared=Anterior_nares.tx.shared, calc=nseqs);"\nmothur "#summary.single(shared=Buccal_mucosa.tx.shared, calc=nseqs);"\nmothur "#summary.single(shared=Hard_palate.tx.shared, calc=nseqs);"\nmothur "#summary.single(shared=Keratinized_gingiva.tx.shared, calc=nseqs);"\nmothur "#summary.single(shared=L_Antecubital_fossa.tx.shared, calc=nseqs);"\nmothur "#summary.single(shared=L_Retroauricular_crease.tx.shared, calc=nseqs);"\nmothur "#summary.single(shared=Mid_vagina.tx.shared, calc=nseqs);"\nmothur "#summary.single(shared=Palatine_Tonsils.tx.shared, calc=nseqs);"\nmothur "#summary.single(shared=Posterior_fornix.tx.shared, calc=nseqs);"\nmothur "#summary.single(shared=R_Antecubital_fossa.tx.shared, calc=nseqs);"\nmothur "#summary.single(shared=R_Retroauricular_crease.tx.shared, calc=nseqs);"\nmothur "#summary.single(shared=Saliva.tx.shared, calc=nseqs);"\nmothur "#summary.single(shared=Stool.tx.shared, calc=nseqs);"\nmothur "#summary.single(shared=Subgingival_plaque.tx.shared, calc=nseqs);"\nmothur "#summary.single(shared=Supragingival_plaque.tx.shared, calc=nseqs);"\nmothur "#summary.single(shared=Throat.tx.shared, calc=nseqs);"\nmothur "#summary.single(shared=Tongue_dorsum.tx.shared, calc=nseqs);"\nmothur "#summary.single(shared=Vaginal_introitus.tx.shared, calc=nseqs);"\n')


# So let's use some R to figure out what percentage of the samples at each body site will be kept for various thresholds

# In[ ]:


get_ipython().run_cell_magic('R', '', '\nthreshold <- 1000\t#change this to see how the fraction of samples changes\n\nsummary.files <- c("Anterior_nares.tx.groups.summary", "Buccal_mucosa.tx.groups.summary", \n\t\t\t\t"Hard_palate.tx.groups.summary", "Keratinized_gingiva.tx.groups.summary",\n\t\t\t\t"L_Antecubital_fossa.tx.groups.summary", "L_Retroauricular_crease.tx.groups.summary",\n\t\t\t\t"Mid_vagina.tx.groups.summary", "Palatine_Tonsils.tx.groups.summary",\n\t\t\t\t"Posterior_fornix.tx.groups.summary", "R_Antecubital_fossa.tx.groups.summary",\n\t\t\t\t"R_Retroauricular_crease.tx.groups.summary", "Saliva.tx.groups.summary",\n\t\t\t\t"Stool.tx.groups.summary", "Subgingival_plaque.tx.groups.summary",\n\t\t\t\t"Supragingival_plaque.tx.groups.summary", "Throat.tx.groups.summary",\n\t\t\t\t"Tongue_dorsum.tx.groups.summary", "Vaginal_introitus.tx.groups.summary")\n\t\npercent <- numeric()\n\nfor(file in summary.files){\n\tdata <- read.table(file=file, header=T)\n\tpercent[file] <- sum(data$nseqs >= threshold)/nrow(data)\n}\n\npercent\nhist(percent)\n')


# If you play around with a few thresholds and look at the percentages you'll see that the antecubital foss and vaginal samples have a small number of reads in general. But if you start increasing the threshold above 1000 we start to lose a good number of samples from the other bodysites. We chose to go with 1000 because it gave us a good number of samples and reads per body site. With this information, we subsampled all of the bodysites to 1000 reads per sample. Any samples with fewer than 1000 reads were removed from the analysis.

# In[ ]:


get_ipython().run_cell_magic('mothur', '', 'sub.sample(shared=Antecubital_fossa.tx.shared, size=1000)\nsub.sample(shared=Anterior_nares.tx.shared, size=1000)\nsub.sample(shared=Buccal_mucosa.tx.shared, size=1000)\nsub.sample(shared=Hard_palate.tx.shared, size=1000)\nsub.sample(shared=Keratinized_gingiva.tx.shared, size=1000)\nsub.sample(shared=Palatine_Tonsils.tx.shared, size=1000)\nsub.sample(shared=Retroauricular_crease.tx.shared, size=1000)\nsub.sample(shared=Saliva.tx.shared, size=1000)\nsub.sample(shared=Stool.tx.shared, size=1000)\nsub.sample(shared=Subgingival_plaque.tx.shared, size=1000)\nsub.sample(shared=Supragingival_plaque.tx.shared, size=1000)\nsub.sample(shared=Throat.tx.shared, size=1000)\nsub.sample(shared=Tongue_dorsum.tx.shared, size=1000)\nsub.sample(shared=Vagina.tx.shared, size=1000)\n')


# ### Assigning samples from each body site to community types
# Now we'd like to assign the samples in each body site to community types. We do this using the get.communitytype function in mothur ([Holmes et al. 2012](http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0030126)). This is a function that utilizes a random number generator and so there may be run-to-run variation. In our experience, with complicated datasets such as these, there is very little variation in the number of community types or the assignments. The label of the community type is most likely to vary between runs (e.g. in one run a group of samples might be labeled as "Partition_1" and in another "Partition_3"). We ran the everything through get.communitytype five times in separate folders:

# In[ ]:


get_ipython().run_cell_magic('bash', '', '\nmkdir dmm_analysis\ncd dmm_analysis\nmkdir rep1 rep2 rep3 rep4 rep5\n\n\ncat > header.batch << "_EOF_"\n#!/bin/sh\n#PBS -l nodes=1:ppn=1\n#PBS -l walltime=500:00:00\n#PBS -l mem=1gb\n#PBS -j oe\n#PBS -m abe\n#PBS -V\n#PBS -M your@email.here\n#PBS -q first\n\necho "ncpus-2.pbs"\ncat $PBS_NODEFILE\nqstat -f $PBS_JOBID\n\ncd $PBS_O_WORKDIR\n\nNCPUS=`wc -l $PBS_NODEFILE | awk \'{print $1}\'`\n\n_EOF_\n\n\ncat > tail.batch << "_EOF_"\n\necho "qsub working directory absolute is"\necho $PBS_O_WORKDIR\nexit\n_EOF_\n\n\n\nfor REP in rep1 rep2 rep3 rep4 rep5\ndo\n    cd $REP\n\n    for SHARED in $(ls ../../*subsample.shared)\n    do\n\t\tSITE=${SHARED//..\\/..\\//}\n\t    SITE=${SITE//.tx.1.subsample.shared/}\n\n        cat ../header.batch > $REP.$SITE.batch\n        echo "mothur \\"#get.communitytype(shared=$SHARED, outputdir=./)\\"" >> $REP.$SITE.batch\n        cat ../tail.batch >> $REP.$SITE.batch\n        \n        qsub $REP.$SITE.batch\n        \n    done\n    cd ../\ndone\n')


# Next we needed to go through, find the number of community types that gave the best Laplace value for each body site and move the files to a single folder.

# In[ ]:


get_ipython().run_cell_magic('R', '', '\nbodysites <- c("Antecubital_fossa", "Anterior_nares", "Buccal_mucosa", "Hard_palate", "Keratinized_gingiva", "Palatine_Tonsils", "Retroauricular_crease", "Saliva", "Stool", "Subgingival_plaque", "Supragingival_plaque", "Throat", "Tongue_dorsum", "Vagina")\n\nbs.scores <- matrix(rep(NA, rep(14*3)), ncol=3)\ncolnames(bs.scores) <- c("min.iteration", "min.k", "min.laplace")\nrownames(bs.scores) <- bodysites\n\ndir.create("dmm_best")\n\nfor(bs in bodysites){\n\treps <- c("rep1", "rep2", "rep3", "rep4", "rep5")\n\tscores <- matrix(rep(NA, 5*10), ncol=5)\n\tcolnames(scores) <- reps\n\trownames(scores) <- 1:10\n\n\tfor(r in reps){\n\t\tfilename <- paste("dmm_analysis/", r,"/",bs, ".tx.1.subsample.1.dmm.mix.fit", sep="")\n\t\tdata <- read.table(file=filename, header=T, row.names=1)\n\t\tscores[1:nrow(data),r] <- data$Laplace\n\t}\n\t\n\tmin.index <- which.min(scores)\n\tmin.k <- (min.index) %% 10\n\tmin.index <- min.index - min.k\n\tmin.iteration <- (min.index / 10) + 1\n\t\n\tc(min.k, min.iteration, scores[min.k, min.iteration])\n\n\tbs.scores[bs, ] <- c(min.iteration, min.k, scores[min.k, min.iteration])\n\t\n\tsource.filename <- list.files(path=paste("dmm_analysis/", colnames(scores)[min.iteration], sep=""), pattern=glob2rx(paste(bs, ".*", sep="")))\n\twrite(paste("dmm_analysis/",colnames(scores)[min.iteration], "/", source.filename, sep=""), "")\n\tfile.copy(paste("dmm_analysis/",colnames(scores)[min.iteration], "/", source.filename, sep=""), to="dmm_best/")\n}\n')


# The get.communitytype outputs a number of files that we will use. For now, let's worry about the design file, which will tell us which community type each sample belongs to. Let's get the body specific design files for those bodysites that we pooled

# In[ ]:


get_ipython().run_cell_magic('bash', '', 'cd dmm_best\ngrep "L_Retroauricular_crease" Retroauricular_crease.tx.1.subsample.1.dmm.mix.design > L_Retroauricular_crease.tx.1.subsample.1.dmm.mix.design\ngrep "R_Retroauricular_crease" Retroauricular_crease.tx.1.subsample.1.dmm.mix.design > R_Retroauricular_crease.tx.1.subsample.1.dmm.mix.design\ngrep "L_Antecubital_fossa" Antecubital_fossa.tx.1.subsample.1.dmm.mix.design > L_Antecubital_fossa.tx.1.subsample.1.dmm.mix.design \ngrep "R_Antecubital_fossa" Antecubital_fossa.tx.1.subsample.1.dmm.mix.design > R_Antecubital_fossa.tx.1.subsample.1.dmm.mix.design \ngrep "Vaginal_introitus" Vagina.tx.1.subsample.1.dmm.mix.design > Vaginal_introitus.tx.1.subsample.1.dmm.mix.design\ngrep "Mid_vagina" Vagina.tx.1.subsample.1.dmm.mix.design > Mid_vagina.tx.1.subsample.1.dmm.mix.design\ngrep "Posterior_fornix" Vagina.tx.1.subsample.1.dmm.mix.design > Posterior_fornix.tx.1.subsample.1.dmm.mix.design\n')


# One last thing we'd like to do is to create separate files where each file is a table with the subject along the rows and the body site is along the columns.

# In[ ]:


get_ipython().run_cell_magic('R', '', 'body.design <- c("Anterior_nares.tx.1.subsample.1.dmm.mix.design", "Buccal_mucosa.tx.1.subsample.1.dmm.mix.design",\n\t\t\t\t "Hard_palate.tx.1.subsample.1.dmm.mix.design", "Keratinized_gingiva.tx.1.subsample.1.dmm.mix.design",\n\t\t\t\t "L_Antecubital_fossa.tx.1.subsample.1.dmm.mix.design", "L_Retroauricular_crease.tx.1.subsample.1.dmm.mix.design",\n\t\t\t\t "Mid_vagina.tx.1.subsample.1.dmm.mix.design", "Palatine_Tonsils.tx.1.subsample.1.dmm.mix.design",\n\t\t\t\t "Posterior_fornix.tx.1.subsample.1.dmm.mix.design", "R_Antecubital_fossa.tx.1.subsample.1.dmm.mix.design",\n\t\t\t\t "R_Retroauricular_crease.tx.1.subsample.1.dmm.mix.design", "Saliva.tx.1.subsample.1.dmm.mix.design",\n\t\t\t\t "Stool.tx.1.subsample.1.dmm.mix.design", "Subgingival_plaque.tx.1.subsample.1.dmm.mix.design",\n\t\t\t\t "Supragingival_plaque.tx.1.subsample.1.dmm.mix.design", "Throat.tx.1.subsample.1.dmm.mix.design",\n\t\t\t\t "Tongue_dorsum.tx.1.subsample.1.dmm.mix.design", "Vaginal_introitus.tx.1.subsample.1.dmm.mix.design")\n\t\t\t\t \nsubjects <- character()\n\n#i know this isn\'t efficient...\nfor(design in body.design){\n\td <- read.table(file=design)\n\ts <- gsub("^(\\\\d*).*", "\\\\1", d$V1)\n\tsubjects <- c(subjects, s)\n}\n\nsubjects <- sort(unique(subjects))\nnsubjects <- length(subjects)\t#should be ~300\n\nvisit1 <- matrix(rep(NA, nsubjects*18), nrow=nsubjects)\nrownames(visit1) <- subjects\ncolnames(visit1) <- gsub("^([^.]*).*", "\\\\1", body.design)\n\nvisit2 <- visit1\nvisit3 <- visit1\n\nfor(design in body.design){\t\n\td <- read.table(file=design)\n\ts <- gsub("^(\\\\d*).*", "\\\\1", d$V1)\t\t\t\t#get the subject id\n\tv <- gsub("^\\\\d*.(\\\\d\\\\d).*", "\\\\1", d$V1)\t\t#get the visit number\n\tb <- gsub("^\\\\d*.\\\\d\\\\d.(.*)", "\\\\1", d$V1)\t\t#get the visit number\n\n\tvisit1[as.character(s[v == "01"]), b] <- as.character(d[v == "01", "V2"])\n\tvisit2[as.character(s[v == "02"]), b] <- as.character(d[v == "02", "V2"])\n\tvisit3[as.character(s[v == "03"]), b] <- as.character(d[v == "03", "V2"])\n}\n\nsum.na <- function(x){\n\treturn(sum(is.na(x)))\n}\n\nkeep <- apply(visit1, 1, sum.na) != 18\t#get rid of subject ids that only have NAs\nvisit1 <- visit1[keep,]\n\nkeep <- apply(visit2, 1, sum.na) != 18\t#get rid of subject ids that only have NAs\nvisit2 <- visit2[keep,]\n\nkeep <- apply(visit3, 1, sum.na) != 18\t#get rid of subject ids that only have NAs\nvisit3 <- visit3[keep,]\n\nwrite.table(visit1, "community_types.v1.txt", quote=F)\nwrite.table(visit2, "community_types.v2.txt", quote=F)\nwrite.table(visit3, "community_types.v3.txt", quote=F)\n')


# In[ ]:


get_ipython().run_cell_magic('bash', '', 'mkdir ../../analysis\nmv community_types.v* ../../analysis\n')


# ## Clinical metadata

# ***A lot*** of clinical metadata were collected on the 300 subjects in this study. Unfortunately, you always think of more metadata that you would like to have after you start to look at the data. Another issue with clinical metadata, especially in the case of the HMP dataset was that because the subjects were so young and healthy, there often isn't a lot of variation and the metadata don't turn out to be as interesting as originally hoped. For instance, if only a few women have had a baby, it is difficult to see whether having a baby has affected her microbiome. Ditto for a bunch of other variables. Again, all of the clinical metadata are available through dbGaP and unfortunately, you have to get them there. There is a process to getting the data that is tedious but not horrible. But whatever you do, don't ask me for them. I'll say no. Finally, once you get the dbGaP files, unencrypt them and throw them into the dbgap folder that we made way back at the top of this notebook.
# 

# In[ ]:


get_ipython().run_cell_magic('bash', '', 'cd ../../dbGaP\nls *txt\n\n#should produce...\n#phs000228.v3.pht001184.v3.p1.EMMES_HMP_Subject.MULTI.txt\n#phs000228.v3.pht001185.v3.p1.EMMES_HMP_Sample.MULTI.txt\n#phs000228.v3.pht001187.v3.p1.c1.EMMES_HMP_DCM.HMP.txt\n#phs000228.v3.pht001193.v3.p1.c1.EMMES_HMP_GTV.HMP.txt\n#phs000228.v3.pht002156.v1.p1.c1.EMMES_HMP_DSU.HMP.txt\n#phs000228.v3.pht002157.v1.p1.c1.EMMES_HMP_DTP_DHX_DVD.HMP.txt\n#phs000228.v3.pht002158.v1.p1.c1.EMMES_HMP_DEM_ENR.HMP.txt\n')


# Before we do science, we need to sift through the clinical metadata and see what of it is useful and what isn't. Recall that you can go to the [dbGaP project site](http://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000228.v3.p1) to figure out what the various files contain and what the variable names mean. Because some of these txt files contain apostrophes that were wrecking havoc with R, I used TextWrangeler to zap all of them away. A smarter person could probalby have done it with sed. Good for you. The first thing we'll do is recreate the data in Extended Data Table 1 from the paper.
# 
# Three of these files contain useful categorical metadata that we'll read in, sort, and merge into one table. Also, because patients provided the data at different visits in some of the files, we will have to be careful about picking the right visit (VISNO=="01")...

# In[ ]:


get_ipython().run_cell_magic('R', '', 'categ.1 <-read.table(file="phs000228.v3.pht002157.v1.p1.c1.EMMES_HMP_DTP_DHX_DVD.HMP.txt", header=T, sep="\\t",\n                     na.string=c("", NA))\ncateg.1 <- categ.1[categ.1$VISNO == "01",]\nrownames(categ.1) <- categ.1$RANDSID\ncateg.1 <- categ.1[,-c(ncol(categ.1))]\t#the RANDSID column is the last one\ncateg.1 <- categ.1[sort(rownames(categ.1)),]\n\n\ncateg.2 <-read.table(file="phs000228.v3.pht002158.v1.p1.c1.EMMES_HMP_DEM_ENR.HMP.txt", header=T, sep="\\t", row.names=2,\n                     na.string=c("", NA))\ncateg.2 <- categ.2[sort(rownames(categ.2)),]\n\nduplicates <- intersect(colnames(categ.1), colnames(categ.2)) #the columns that are in both files\ncateg.2 <- categ.2[,-which(colnames(categ.2) %in% duplicates)]\t#get rid of the duplicates\nsum(rownames(categ.1) == rownames(categ.2)) == 300\t#make sure the rows are in the same order\n\ncategorical <- cbind(categ.1, categ.2)\n\n\n#using textwrangler I first zapped all of the apostrophes\ncateg.3 <- read.table(file="phs000228.v3.pht002156.v1.p1.c1.EMMES_HMP_DSU.HMP.txt", header=T, sep="\\t", na.string=c("", NA))\nrownames(categ.3) <- categ.3$RANDSID\ncateg.3 <- categ.3[,-c(ncol(categ.3))]\t#the RANDSID column is the last column\nsetdiff(rownames(categ.3), rownames(categorical))\ndim(categ.3)\t#there weren\'t 300 subjects here and so some are missing\n\nmissing <- rownames(categorical)[!(rownames(categorical) %in% rownames(categ.3))]\ncateg.3[as.character(missing),] <- rep(NA, ncol(categ.3))\ncateg.3 <- categ.3[sort(rownames(categ.3)),]\nduplicates <- intersect(colnames(categorical), colnames(categ.3)) #the columns that are in both files\ncateg.3 <- categ.3[,-which(colnames(categ.3) %in% duplicates)]\t#get rid of the duplicates\nsum(rownames(categ.1) == rownames(categ.3)) == 300\t#make sure the rows are in the same order\n\ncategorical <- cbind(categorical, categ.3)\n\nncats <- ncol(categorical)\t#196\nnpeople <- nrow(categorical)\n')


# So now we have our massive table of 300 subjects by 196 metadata variables. Now we will convert "vague" answers to "NA" and then identify those metadata columns that were NA for every subject:

# In[ ]:


get_ipython().run_cell_magic('R', '', 'for(i in colnames(categorical)){\n\tcategorical[grepl("Dont know/remember",categorical[,i]),i] <- NA\n\tcategorical[grepl("NotEvaluated",categorical[,i]),i] <- NA\n\tcategorical[grepl("Forgot",categorical[,i]),i] <- NA\n}\n\n#there\'s a lot of crap in here, let\'s figure out how to remove the variables that are all NAs\nsum.na <- function(x) sum(is.na(x) | x == "")\nna.count <- apply(categorical, 2, sum.na)\nnotall.na <- na.count <= 250\t#empirically determined number\n\n#things with more than 250 NAs - ethnicities, tobacco usage, medical history, some drug usage\n\ncategorical <- categorical[,notall.na]\nncats <- ncol(categorical)\t#101\n    \nsummary(categorical)\n')


# So we effectively reduced the amount of metadata in half. With the output of the summary command we see the various metadata fields. Looking through the dbGaP site we get a sense of what they are. A couple things pop out.  First, many of the variables are numerically coded categoritcal varaibles (e.g. SITE) and some are the same thing but coded with a string (e.g. SITE_C). We only need one of each of these. Second, some variables aren't useful (e.g. PROT). Third, some variables are continuous (e.g. AGEENR) and will be dealt with separately. Finally, some variables merely indicate whether a sample was taken at that bodysite (e.g. DVDINTRO_C). Here are the variables we decided to remove and why...
# 
# 
# **Variables to remove: redundant / non-informative**  
# > SITE - redundant with SITE_C  
# > DVDEDLVL_C - redundant with DVDEDLVL  
# > DVDINSRH - redundant with DVDINSRH_C  
# > DVDINSRD - redundant with DVDINSRD_C  
# > DVDOCPTN - redundant with DVDOCPTN_C  
# > DVDTOBC - redundant with DVDTOBC_C  
# > DVDPGRES - Pregnant? no one was pregnant  
# > DVDPGRES_C - Pregnant? no one was pregnant  
# > START - starting quarter of the study  
# > GENDER - redundnat with GENDER_C  
# > ETHNIC - redundant with ETHNIC_C  
# > AMERIND - american indian (only 1)  
# > AMERIND_C - american indian (only 1)  
# > ASIAN - redundant with ASIAN_C  
# > PACIFC - pacific islander (none)  
# > PACIFC_C - pacific islander (none)  
# > WHITE - redundant with WHITE_C  
# > BRTHCTRY - redundant with BRTHCTRY_C  
# > MOMCTRY - redundant with MOMCTRY_C  
# > DADCTRY - redundant with DADCTRY_C  
# > DSUSCRN - redundant with DSUSCRN_C  
# > DSUCON - redundant with DSUCON_C  
# > DSUDIET_C - redundant with DSUDIET  
# > DSUBIR - redunant with DSUBIR_C  
# > DSUBFED - redundant with DSUBFED_C  
# > BLACK - redundant with BLACK_C  
# > DVDVIRRT / DVDVIRRT_C - vaginal or vulval irritation - only 8 yes
# 
# 
# **not useful**  
# > dbGaP.SubjID - dbGaP Subject ID  
# > PROT - Protocol  
# > PROTSEG - Protocol   
# > protseg - Protocol  
# > VISNO - Visit number  
# > study_day - day of the study  
# > dsuotrpd - time since last oral treatment - eh.  
# > dsuvspd - time since enrollment  
# > IDSUZIP - zip code  
# > DSUSCRN_C - screened more than once  
# > DSUCON_C - subject contact  
# 
# 
# **Variables to remove: continuous variables**  
# > DVDINTPH - vagina introitus pH  
# > DVDPFPH - posterior fornix pH  
# > DVDTMPF - temperature in F  
# > DVDTMPC - temperature in C  
# > AGEENR - age  
# 
# 
# **Variables to remove: indicate whether various body sites were sampled**  
# > DVDBLOOD / DVDBLOOD_C - blood  
# > DVDSERUM / DVDSERUM_C - serum  
# > DVDSTOOL / DVDSTOOL_C - stool  
# > DVDSAL / DVDSAL_C - saliva  
# > DVDGING / DVDGING_C - gingiva  
# > DVDTONG / DVDTONG_C - tongue  
# > DVDPTON / DVDPTON_C - tonsils  
# > DVDHPAL / DVDHPAL_C - hard palate  
# > DVDTHRO / DVDTHRO_C - throat   
# > DVDBUCC / DVDBUCC_C - buccal mucosa  
# > DVDSUPRA / DVDSUPRA_C - supragingival plaque  
# > DVDSUB / DVDSUB_C - subgingial plaque  
# > DVDRACR / DVDRACR_C - right retroauricular crease  
# > DVDRACL / DVDRACL_C - left retroauricular crease  
# > DVDACRR / DVDACRR_C - right antecuibtal fossa  
# > DVDACRL / DVDACRL_C - left antecuibtal fossa  
# > DVDNASAL / DVDNASAL_C - anterior nares  
# > DVDINTRO / DVDINTRO_C - vaginal introitus  
# > DVDPFORN / DVDPFORN_C - posterior fornix  
# > DVDMIDVG / DVDMIDVG_C - mid vagina  

# In[ ]:


get_ipython().run_cell_magic('R', '', 'to.remove <- c("dbGaP.SubjID", "PROT", "PROTSEG", "SITE_C", "protseg", "VISNO",\n\t\t\t   "study_day", "DVDEDLVL_C", "DVDINSRH", "DVDINSRD", "DVDOCPTN",\n\t\t\t   "DVDTOBC", "DVDPGRES", "DVDPGRES_C", "START", "GENDER", "ETHNIC",\n\t\t\t   "AMERIND", "AMERIND_C", "ASIAN", "PACIFC", "PACIFC_C", "WHITE",\n\t\t\t   "BRTHCTRY", "MOMCTRY", "DADCTRY", "DSUSCRN", "DSUCON", "BLACK",\n\t\t\t   "DSUDIET_C", "DSUBIR", "dsuotrpd", "dsuvspd", "DSUZIP",\n\t\t\t   "DVDINTPH", "DVDPFPH", "DVDTMPF", "DVDTMPC", "AGEENR", "DSUBFED",\n\t\t\t   "DVDBLOOD", "DVDBLOOD_C", "DVDSERUM", "DVDSERUM_C", "DVDSTOOL",\n\t\t\t   "DVDSTOOL_C", "DVDSAL", "DVDSAL_C", "DVDGING", "DVDGING_C",\n\t\t\t   "DVDTONG", "DVDTONG_C", "DVDPTON", "DVDPTON_C", "DVDHPAL",\n\t\t\t   "DVDHPAL_C", "DVDTHRO", "DVDTHRO_C ", "DVDBUCC", "DVDBUCC_C",\n\t\t\t   "DVDSUPRA", "DVDSUPRA_C", "DVDSUB", "DVDSUB_C", "DVDRACR",\n\t\t\t   "DVDRACR_C", "DVDRACL", "DVDRACL_C", "DVDACRR", "DVDACRR_C",\n\t\t\t   "DVDACRL", "DVDACRL_C", "DVDNASAL", "DVDNASAL_C", "DVDVIRRT",\n\t\t\t   "DVDVIRRT_C", "DVDINTRO", "DVDINTRO_C", "DVDPFORN", "DVDPFORN_C",\n\t\t\t   "DVDMIDVG", "DVDMIDVG_C", "DVDTHRO_C", "DSUSCRN_C", "DSUCON_C")\n\t\t\t   \ncategorical <- categorical[,-which(colnames(categorical) %in% to.remove)]\n\ndim(categorical) #17 categories remain\nsummary(categorical)\n')


# So we're left with 17 variables at this point Looking at the output of the summary command we see that many of them are now binary variables:  
# 
# > SITE - city of origin - Houston / StLouis  
# > GENDER_C - gender - Male / Female  
# > DVDTOBC_C - tobacco user - yes / no  
# > BLACK_C - yes / no  
# > WHITE_C - yes /no  
# > DSUBFED_C - Were you breastfed as a child? (don't know was pooled with NA)  
# > DVDINSRH_C - health insurance?  
# > DVDINSRD_C - dental insurance?  
# 
# and others have numerous levels without much heft to any of them. Therefore we re-keyed the other variables to make them binary:

# In[ ]:


get_ipython().run_cell_magic('R', '', '\n#DVDEDLVL_C - education level - 8 different levels\n#turn it into BS or higher\ncategorical$DVDEDLVL <- categorical$DVDEDLVL == 8 | categorical$DVDEDLVL == 9 | categorical$DVDEDLVL == 10 |\n                        categorical$DVDEDLVL == 11 \n\n\n#DVDOCPTN_C - Occupation - 8 different levels\n#turn it into Student or not\ncategorical$DVDOCPTN_C <- categorical$DVDOCPTN_C == "42 Student"\n\n\n#BRTHCTRY_C, MOMCTRY_C, DADCTRY_C - birth country of subject and parents\n#14 different levels - turn into whether or not "Canada/US"\ncategorical$BRTHCTRY_C <- categorical$BRTHCTRY_C == "Canada/US"\ncategorical$MOMCTRY_C <- categorical$MOMCTRY_C == "Canada/US"\ncategorical$DADCTRY_C <- categorical$DADCTRY_C == "Canada/US"\n\n\n#DSUDIET - four levels of meat consumption - two with meat and two without\n#turn it into meat or not\ncategorical$DSUDIET <- categorical$DSUDIET == 1 | categorical$DSUDIET == 2\n\n\n#DSUBIR_C - whether subject had given birth never, once or more than once\n#turn it into whether the subject had given birth\ncategorical$DSUBIR_C <- categorical$DSUBIR_C == "More than once" | categorical$DSUBIR_C == "Once"\n\n\n#ETHNIC_C - Hispanic / latino / spanish?\ncategorical$ETHNIC_C <- categorical$ETHNIC_C == "Hispanic or Latino or Spanish"\n\nsummary(categorical)\n')


# When we looked through the medication data we noticed that a number of people were chronically on medications. So we decided to make this a categorical varible. We found that the most commonly used medications were the following:  
# 
#  - 13=Contraceptives (oral, implant, injectable)(N=102)
#  - 21=Vitamins, minerals, food supplements (N=54)
#  - 9=Antidepressants/mood-altering drugs (N=37)
#  - 8=Antibiotics/(anti)-infectives, parasitics, microbials (N=33)
#  - 10=Antihistamines/Decongestants (N=28)
#  
# To get going, we want to get the start and stop dates for each subject's medication usage as well as the dates for each subject's visit:

# In[ ]:


get_ipython().run_cell_magic('R', '', '\nmedication <- read.table(file="phs000228.v3.pht001187.v3.p1.c1.EMMES_HMP_DCM.HMP.txt", header=T, sep="\\t", na.string=c("", NA))\n\nsort(summary(as.factor(medication$DCMCODE_C)))\n\n#the most commonly used medications in the study...\n#13=Contraceptives (oral, implant, injectable) \t\t\t\t102\n#21=Vitamins, minerals, food supplements      \t\t\t\t54\n#9=Antidepressants/mood-altering drugs \t\t\t\t\t\t37\n#8=Antibiotics/(anti)-infectives, parasitics, microbials    33\n#10=Antihistamines/Decongestants\t\t\t\t\t\t    28 \n\n#pull out thse five medication classes\nmedication <- medication[,c("RANDSID", "DCMCODE", "MSTART", "MSTOP")]\npopular.user <- medication$DCMCODE == 13 | medication$DCMCODE == 21 | medication$DCMCODE == 9 |\n                    medication$DCMCODE == 8 | medication$DCMCODE == 10\nmedication <- medication[popular.user,] #only want those lines from the file with these medications\nmedication <- medication[order(medication$RANDSID),]   #make sure everything is in the right order\n\n\n#figure out the study date that each person visited\nmetadata <- read.table(file="phs000228.v3.pht002157.v1.p1.c1.EMMES_HMP_DTP_DHX_DVD.HMP.txt", header=T, sep="\\t",\n                       na.string=c("", NA))\nnormalvisit <- metadata$VISNO == "01" | metadata$VISNO == "02" | metadata$VISNO == "03"\nmetadata <- metadata[normalvisit,]  #only get those rows that corresponded to a normal visit\nsubjects <- levels(as.factor(metadata$RANDSID))\nsubjects == sort(subjects)   #make sure everything is in the right order\n\n\n#now we want to populate a table with the date for each person\'s visits\nvisit.dates <- matrix(rep(NA, length(subjects)*3), ncol=3) #start with everyone at NA\nrownames(visit.dates) <- subjects #subject ids in the rows\ncolnames(visit.dates) <- c("Visit1", "Visit2", "Visit3")  #visits in the columns\n\n\n#loop through and fill in the study dates for each person\'s first, second, and third visit\nfor(r in rownames(metadata)){\n\tif(metadata[r, "VISNO"] == "01"){\n\t\tvisit.dates[as.character(metadata[r,"RANDSID"]), "Visit1"] = metadata[r, "study_day"]\n\t}\n\telse if(metadata[r, "VISNO"] == "02"){\n\t\tvisit.dates[as.character(metadata[r,"RANDSID"]), "Visit2"] = metadata[r, "study_day"]\n\t}\n\telse if(metadata[r, "VISNO"] == "03"){\n\t\tvisit.dates[as.character(metadata[r,"RANDSID"]), "Visit3"] = metadata[r, "study_day"]\n\t}\n}\n')


# Note that at this point if someone didn't have a second or third visit, they will be listed as NA. Now we want to figure out whether a visit corresponded to the subjects being on the medications. We defined subjects that last took the medication within 30 days of a visit as "being on the medication" for that visit. Then we defined chronic usage as being on the medication for all of that subject's visits. With these criteria the time from taking the medication didn't have a big effect on the number of chronic users.

# In[ ]:


get_ipython().run_cell_magic('R', '', '\n#this function asks whether the subject was on the medication at their visit\ngetDrugVisitData <- function(drug.metadata){\n\n\tdrug <- matrix(rep(NA, length(subjects)*3), ncol=3)\n\tcolnames(drug) <- c("Visit1", "Visit2", "Visit3")\n\trownames(drug) <- rownames(visit.dates)\n\tdrug[!is.na(visit.dates)] <- FALSE  #if the subject visited, then we assume they\'re negative to start\n\tdrug[is.na(visit.dates)] <- NA      #if they didn\'t visit, we don\'t care\n\n\n\tvisitwindow <- 30\t#the person discontinued the drug in the last 30 days\n\n\tfor(r in rownames(drug.metadata)){\n\t\ts <- drug.metadata[as.character(r), "RANDSID"]\n\t\tstart <- drug.metadata[as.character(r), "MSTART"]\n\t\tstop <- drug.metadata[as.character(r), "MSTOP"]\n\t\tif(is.na(stop))\tstop = 100000   #if the stop date is NA then assume that they were on the medication through the end\n                                        #of the study\n\t\n            \n        #these work like so... - subject is assumed false initially and if they\'re subsequently changed to true, then we can\'t\n        #change them to false. if the start date is before the visit date adn the stop date is earlier than the stop date plus\n        #the visitwindow (30 days) then we consider them true. otherwise, keep them as false.\n            \n\t\tif(drug[as.character(s), "Visit1"] == FALSE & start <= visit.dates[as.character(s), "Visit1"] &\n           stop+visitwindow >= visit.dates[as.character(s), "Visit1"] & !is.na(visit.dates[as.character(s), "Visit1"])){\n\t\t\tdrug[as.character(s), "Visit1"] = TRUE\n\t\t} else if(drug[as.character(s), "Visit1"] == FALSE & !is.na(visit.dates[as.character(s), "Visit1"]))\n\t\t\tdrug[as.character(s), "Visit1"] = FALSE\n\n\t\tif(drug[as.character(s), "Visit2"] == FALSE & start <= visit.dates[as.character(s), "Visit2"] &\n           stop+visitwindow >= visit.dates[as.character(s), "Visit2"] & !is.na(visit.dates[as.character(s), "Visit2"])){\n\t\t\tdrug[as.character(s), "Visit2"] = TRUE\n\t\t} else if(drug[as.character(s), "Visit2"] == FALSE & !is.na(visit.dates[as.character(s), "Visit2"]))\n\t\t\tdrug[as.character(s), "Visit2"] = FALSE\n\n\t\tif(drug[as.character(s), "Visit3"] == FALSE & start <= visit.dates[as.character(s), "Visit3"] &\n           stop+visitwindow >= visit.dates[as.character(s), "Visit3"] & !is.na(visit.dates[as.character(s), "Visit3"])){\n\t\t\tdrug[as.character(s), "Visit3"] = TRUE\n\t\t} else if(drug[as.character(s), "Visit3"] == FALSE & !is.na(visit.dates[as.character(s), "Visit3"]))\n\t\t\tdrug[as.character(s), "Visit3"] = FALSE\n\t}\n\t\n    #send back a table with visits as columns, subjects as rows, and values indicate whether the subject was on the medication\n\treturn(drug)\n}\n\n            \n#this function defines whether the subject was a chronic user or not\ngetChronic <- function(drug){\n\tchronic <- rep(FALSE, nrow(drug))\n    \n    #subject had to be on medication for all of their visits to be considered a chronic user\n\tchronic[((drug[,"Visit1"] == T & is.na(drug[,"Visit2"]) & is.na(drug[,"Visit3"])) |\n\t\t\t (drug[,"Visit1"] == T & drug[,"Visit2"] == T & is.na(drug[,"Visit3"])) |\n\t\t\t (drug[,"Visit1"] == T & drug[,"Visit2"] == T & drug[,"Visit3"] == T))] <- T\n\n    #returns a vector indicating whether each subject was chronic\n\treturn(chronic)\n\t\n}\n\n\ncontra <- getDrugVisitData(medication[medication$DCMCODE == 13,])\nvit_min <- getDrugVisitData(medication[medication$DCMCODE == 21,])\nantidep <- getDrugVisitData(medication[medication$DCMCODE == 9,])\nantibiot <- getDrugVisitData(medication[medication$DCMCODE == 8,])\nantihist <- getDrugVisitData(medication[medication$DCMCODE == 10,])\n\ncontra.chronic <- getChronic(contra)\nvit_min.chronic <- getChronic(vit_min)\nantidep.chronic <- getChronic(antidep)\nantihist.chronic <- getChronic(antihist)\nantibiot.chronic <- getChronic(antibiot)\ncontra.chronic[categorical$GENDER_C=="Male"] <- NA\n\n            \nsum(contra.chronic, na.rm=T); sum(apply(contra, 1, sum, na.rm=T) > 0)  #72; 87\nsum(vit_min.chronic); sum(apply(vit_min, 1, sum, na.rm=T) > 0)         #34; 44\nsum(antidep.chronic); sum(apply(antidep, 1, sum, na.rm=T) > 0)         #22; 31\nsum(antihist.chronic); sum(apply(antihist, 1, sum, na.rm=T) > 0)       #9; 22\nsum(antibiot.chronic); sum(apply(antibiot, 1, sum, na.rm=T) > 0)       #2; 11\n#not enough antibiotic.chronic subjects to do anything\n\n#finally, append chronic contraception, vitamin/mineral, antidepressents, and antihistamine users to categorical table\ncategorical <- cbind(categorical, contra.chronic, vit_min.chronic, antidep.chronic, antihist.chronic)\n')


# There's one more piece of categorical data we'd like to get - the subjects' BMI expressed as a categorical varaible.

# In[ ]:


cont <- read.table(file="phs000228.v3.pht002157.v1.p1.c1.EMMES_HMP_DTP_DHX_DVD.HMP.txt", header=T, sep="\t",
                   na.string=c("", NA))
cont <- cont[cont$VISNO == "00",]  #BMI was recorded based on the intake visit
rownames(cont) <- cont$RANDSID
cont <- cont[order(rownames(cont)),]

bmi <- cont[,"DTPBMI"] #this is the only variable we need from the table

BMI_C <- character()
BMI_C[bmi<18.5] <- "underweight"
BMI_C[bmi>=18.5 & bmi < 25] <- "normal"
BMI_C[bmi>=25 & bmi < 30] <- "overweight"
BMI_C[bmi>=30] <- "obese"
BMI_C <- factor(BMI_C)
rownames(cont) == rownames(categorical) #double check that our rows are in the correct order


#let's append the BMI_C varaiable to the end of categorical
categorical <- cbind(categorical, BMI_C)
for(c in colnames(categorical)){	categorical[,c] <- factor(categorical[,c])	}


#output the table to a file
write.table(categorical, "categorical.metadata", quote=F)


# With that last command we have outputted the categorical data to a file. Let's move on to the continuous data. We found the following continuous data via dbGaP:  
# 
# > DVDINTPH - pH - vaginal introitus  
# > DVDPFPH - pH - posterior fornix  
# > DTPPULSE - pulse  
# > DTPBMI - BMI  
# > DTPSYSTL - diastolic pressure  
# > DTPDIAST - systolic pressure  
# 

# In[ ]:


#Get continuous data...
cont.1 <- read.table(file="phs000228.v3.pht002157.v1.p1.c1.EMMES_HMP_DTP_DHX_DVD.HMP.txt", sep="\t",
                     header=T, na.string=c(NA, ""))

#some variables were taken at the intake appointment
continuous <- cont.1[cont.1$VISNO=="00",c("DTPPULSE","DTPBMI","DTPSYSTL","DTPDIAST", "RANDSID")]
rownames(continuous) <- continuous$RANDSID

#define a continuous table variable
continuous <- continuous[order(rownames(continuous)),]


#vaginal pH was taken at each visit. we created a new file - vaginal_ph.txt - that contains
#the two pH values for each woman at each visit:
cont.1$VISNO <- gsub("^(\\d\\d).*", "\\1", cont.1$VISNO)
pH <- cont.1[cont.1$VISNO == "01" | cont.1$VISNO == "02" | cont.1$VISNO == "03",]
pH <- pH[,c("RANDSID", "VISNO", "SITE", "DVDINTPH", "DVDPFPH")]
pH$VISNO <- as.numeric(gsub("^0", "", pH$VISNO))
pH <- pH[!is.na(pH$DVDINTPH) | !is.na(pH$DVDPFPH),]
write.table(pH, file="vaginal_ph.txt", quote=F, row.names=F, sep="\t")


#let's add the subjects' ages
cont.2 <- read.table(file="phs000228.v3.pht002158.v1.p1.c1.EMMES_HMP_DEM_ENR.HMP.txt", sep="\t", header=T, na.string=c(NA, ""))
rownames(cont.2) <- cont.2$RANDSID
cont.2 <- cont.2[order(rownames(cont.2)),]
rownames(cont.2) == rownames(continuous)  #make sure the rows are in the right order
#The following was the only continuous data avaialble in this metadata file
#AGEENR - Age in years
AGEENR <- cont.2$AGEENR
continuous <- cbind(continuous, AGEENR) 

#remove the RANDSID column from the table
continuous <- continuous[,-which(colnames(continuous) %in% c("RANDSID"))]

#write continuous data out to a file
write.table(continuous, "continuous.metadata", quote=F)


# At this point all of the categorical and continuous data have been identified and stored to a file. We have enverything that was used to generate Extended Data Table 1:

# In[ ]:


summary(categorical)

#count bfed "not remember"
bfed <- categ.3$DSUBFED_C
bfed[bfed == ""] <- NA
summary(factor(bfed))

#count transients
sum(apply(contra, 1, sum, na.rm=T) > 0)-sum(contra.chronic, na.rm=T); 
sum(apply(vit_min, 1, sum, na.rm=T) > 0)-sum(vit_min.chronic, na.rm=T); 
sum(apply(antidep, 1, sum, na.rm=T) > 0)-sum(antidep.chronic, na.rm=T); 
sum(apply(antihist, 1, sum, na.rm=T) > 0)-sum(antihist.chronic, na.rm=T); 

summary(continuous)

pH.summary <- pH[,c("DVDINTPH", "DVDPFPH")]
apply(pH.summary, 2, summary)


# Let's copy these metadata files to our analysis folder and head over there to do our analysis.

# In[ ]:


get_ipython().run_cell_magic('bash', '', 'cp categorical.metadata continuous.metadata vaginal_ph.txt ../analysis/\ncd ../analysis/\n')


# # Analysis
# Now we have the sequence data and clinical metadata curated and into a form that we can handle. Let's do some analysis. We have several tasks ahead of us...  
# 
# * Describe community types
# * Identify associations between body sites and clinical metadata
# * Identify associations between body sites
# * Quantify rate of change between community types for each body site
# * Effect of time on stability of community types
# * Validate DMM models as a community type approach relative to the PAM-based approach
# 
# 
# ##Describe community types
# First we wanted to inspect the DMM model fits and see the top taxa in each community type. This analysis was performed using the data found in the dmm_best folder. Let's generate the types of plots shown in Figure 1A and 1B. First we'll make the line plot of the Laplace fit metric vs. the number of community types (e.g. Figure 1A). For fun we'll bold and color in red the x-axis label that corresponds to the minimum Laplace metric. These will be outputted as pdfs:

# In[ ]:


get_ipython().run_cell_magic('R', '', 'dmm_best_path <- "../v35/dmm_best/"\nfit.files <- c("Antecubital_fossa.tx.1.subsample.1.dmm.mix.fit", "Anterior_nares.tx.1.subsample.1.dmm.mix.fit",\n\t\t\t\t"Buccal_mucosa.tx.1.subsample.1.dmm.mix.fit", "Hard_palate.tx.1.subsample.1.dmm.mix.fit",\n\t\t\t\t"Keratinized_gingiva.tx.1.subsample.1.dmm.mix.fit", "Palatine_Tonsils.tx.1.subsample.1.dmm.mix.fit",\n\t\t\t\t"Retroauricular_crease.tx.1.subsample.1.dmm.mix.fit", "Saliva.tx.1.subsample.1.dmm.mix.fit",\n\t\t\t\t"Stool.tx.1.subsample.1.dmm.mix.fit", "Subgingival_plaque.tx.1.subsample.1.dmm.mix.fit",\n\t\t\t\t"Supragingival_plaque.tx.1.subsample.1.dmm.mix.fit", "Throat.tx.1.subsample.1.dmm.mix.fit",\n\t\t\t\t"Tongue_dorsum.tx.1.subsample.1.dmm.mix.fit", "Vagina.tx.1.subsample.1.dmm.mix.fit")\n\nfor(bs in fit.files){\n\tpdf.file <- paste(bs, ".pdf", sep="")\n\tfit <- read.table(file=paste(dmm_best_path, bs, sep=""), header=T)\n\n\tpdf(file=pdf.file)\n\t\n\tbs <- gsub("([^.]*).*", "\\\\1", bs)\n\tbs <- gsub("_", " ", bs)\n\t\n\tpar(mar=c(5, 5, 0.5, 0.5))\n\tplot(fit$Laplace~fit$K, xlab="Number of Dirichlet\\nComponents", ylab=paste("Model Fit\\n(",bs, ")", sep=""), type="l")\n\tpoints(fit$Laplace~fit$K, pch=19)\n\t\n\taxis(1, at=which.min(fit$Laplace), label=which.min(fit$Laplace), col.axis="red", font=2)\n\t\n\tdev.off()\n}\n')


# Now let's make the box plots that show the relative abundance for the top 5 genera in each community type for each body site (e.g. Figure 1B)

# In[ ]:


get_ipython().run_cell_magic('R', '', '\ndmm_best_path <- "../v35/dmm_best/"\n\nsites <- c("Antecubital_fossa.tx.1.subsample.1.dmm.mix.design", "Anterior_nares.tx.1.subsample.1.dmm.mix.design",\n\t\t"Buccal_mucosa.tx.1.subsample.1.dmm.mix.design", "Hard_palate.tx.1.subsample.1.dmm.mix.design",\n\t\t"Keratinized_gingiva.tx.1.subsample.1.dmm.mix.design", "Palatine_Tonsils.tx.1.subsample.1.dmm.mix.design",\n\t\t"Retroauricular_crease.tx.1.subsample.1.dmm.mix.design", "Saliva.tx.1.subsample.1.dmm.mix.design",\n\t\t"Stool.tx.1.subsample.1.dmm.mix.design", "Subgingival_plaque.tx.1.subsample.1.dmm.mix.design",\n\t\t"Supragingival_plaque.tx.1.subsample.1.dmm.mix.design", "Throat.tx.1.subsample.1.dmm.mix.design",\n\t\t"Tongue_dorsum.tx.1.subsample.1.dmm.mix.design", "Vagina.tx.1.subsample.1.dmm.mix.design")\n\n#need to get the last level that was given a classification for each phylotype\ntaxonomy <- read.table(file="../v35/v35.unique.good.filter.unique.precluster.unique.pick.three.wang.pick.tx.1.cons.taxonomy",\n                       header=T, row.names=1)\ntaxonomy$Taxonomy <- gsub("\\\\(\\\\d*\\\\)", "", taxonomy$Taxonomy)\ntaxonomy$Taxonomy <- gsub("unclassified;", "", taxonomy$Taxonomy)\ntaxonomy$Taxonomy <- gsub(";$", "", taxonomy$Taxonomy)\ntaxonomy$Taxonomy <- gsub(".*;", "", taxonomy$Taxonomy)\ntaxonomy$Taxonomy <- gsub(\'"\', "", taxonomy$Taxonomy)\n\nfor(bs in sites){\n\tdesign.fname <- bs #need the design file to know which sample belongs to each community type\n   \tdesign <- read.table(file=paste(dmm_best_path, design.fname, sep=""))\n    \n\tshared.fname <- gsub("1.dmm.mix.design", "shared", bs) #need the relative abundance data to know\n\tshared <- read.table(file=paste("../v35/", shared.fname, sep=""), header=T, row.names=2)#the abundance of each\n\tshared <- shared[,-c(1,2)]                         #OTU in each sample\n\n    summary.fname <- gsub("design", "summary", bs) #need the summary file to know the top 5 OTUs\n\tsummary <- read.table(file=paste(dmm_best_path, summary.fname, sep=""), header=T, row.names=1)#for defining each commmunity type\n    \n    site <- gsub("^([^.]*).*", "\\\\1", design.fname)\n\tsite <- gsub("_", " ", site)\n\t\n\tntypes <- length(levels(design$V2)) #get the number of community types\n\ttop.otus <- summary[1:5,]\n\totu.ids <- rownames(top.otus) #get the names of the top 5 phylotypes\n\totu.tax <- taxonomy[otu.ids,"Taxonomy"] #get their taxonomic affiliation\n\n\totu.shared <- shared[,otu.ids]/1000\t#get the relative abundance\n\trownames(otu.shared) == design$V1\t#check they\'re in the right order\n\n\tpdf(file=paste(summary.fname, ".pdf", sep=""))  #let\'s save these as pdfs\n\tpar(mar=c(10, 4, 0.5, 0.5))\n\tplot(x=c(0,5*(ntypes+2)), y=c(0,1), type="n", axes=F, xlab="", ylab="Relative abundance (%)")\n\n\twidth = 0.5\n\tleft <- 1\n\tclrs <- rainbow(ntypes)\n\n\tfor(i in otu.ids){\n        #calculate the intra quartile ranges\n\t\tiqr <- aggregate(otu.shared[,i], by=list(design$V2), quantile, probs=c(0.025, 0.25, 0.50, 0.75, 0.975))\n\t\n        #error bars represent the 95% confindence interval\n\t\tarrows(x0=left:(left+ntypes-1), x1=left:(left+ntypes-1), y0=iqr$x[,"50%"], y1=iqr$x[,"97.5%"], angle=90,\n               length=0.05, lwd=2)\n\t\tarrows(x0=left:(left+ntypes-1), x1=left:(left+ntypes-1), y0=iqr$x[,"50%"], y1=iqr$x[,"2.5%"], angle=90,\n               length=0.05, lwd=2)\n\n        #rectangles represent the 50% confidence interval with a line in the middle for the median\n        rect(xleft=(left:(left+ntypes-1))-width, xright=(left:(left+ntypes-1))+width, ytop=iqr$x[,"75%"],\n             ybottom=iqr$x[,"50%"], col=clrs, lwd=3)\n\t\trect(xleft=(left:(left+ntypes-1))-width, xright=(left:(left+ntypes-1))+width, ytop=iqr$x[,"25%"],\n             ybottom=iqr$x[,"50%"], col=clrs, lwd=3)\n\n        #bounce down the x-axis for the next phylotype\n\t\tleft <- left + ntypes + 2\n\t}\n\n\taxis(2, label=seq(0,100,20), at=seq(0,1,0.2), las=2)\n\tstart <- (ntypes+1) / 2\n\tincrement <- ntypes + 2\n\n    #label the phylotypes on the x-axis\n\taxis(1, label=otu.tax, at=c(start, start+increment, start+2*increment, start+3*increment, start+4*increment),\n         cex=0.5, las=2)\n\tbox()\n\n    #get a pretty legend\n\tlegend("topright", legend=paste("Community Type", LETTERS[1:ntypes]), fill=clrs, pt.cex=1.5, cex=1, bty="n")\n\tmtext(1, line=8.5, text=site, font=2)  #tell us what the body site was\n\tdev.off()\n}\n#expect some warning messages: these indicate that the median for some phylotypes was zero and so\n#the bottom rectangle for the IQR is zero. whatever.\n')


# ## Identify associations between body sites and clinical metadata
# ### Categorical metadata
# In attempting to deal with the community type data we have a small difficulty that we have up to 3 community type assignments for each person at each body site but the clinical metadata did not change. Our strategy was to use an iterative procedure. Our basic method for each body site was the following. First, we randomly picked one visit from each subject and used Fisher's exact test to test for an association with each of the clinical variables. We corrected for multiple comparisons using the Benjamini-Hochberg algorithm. Second, we again randomly selected one visit from each subject and tested. We did a total of 1000 iterations and quantified the median P-value and the fraction of iterations that yielded a significant test. The output from this section is discussed in the manuscript and provdied in SourceDataFigure3.xlsx

# In[ ]:


get_ipython().run_cell_magic('R', '', 'totalIters <- 1000\n\n#Set up the community type data - recall we generated these above to describe the community type\n#each subject at each visit\nv1<-read.table(file="community_types.v1.txt", header=T, row.names=1)\nv2<-read.table(file="community_types.v2.txt", header=T, row.names=1)\nv3<-read.table(file="community_types.v3.txt", header=T, row.names=1)\n\n\n#get the collection of the ids for all 300 subjects\nrandsids <- union(rownames(v1), union(rownames(v2), rownames(v3)))\n\n\n#Set up the categorical data\ncateg <- read.table(file="categorical.metadata", header=T)\ncateg.good <- categ[randsids,]\t#only want those metadata that we have microbiome data for\n\nvariables <- colnames(categ.good)\nbodysites <- colnames(v1)\n\n#this table will hold the fraction of iterations that a bodysite had a significant test\nsig.rate <- matrix(rep(0, length(variables)*length(bodysites)), nrow=length(variables))\nrownames(sig.rate) <- variables\ncolnames(sig.rate) <- bodysites\n\n\n#this table will hold the median P-value across the iterations for each bodysite\nmedians <- matrix(rep(0, length(variables)*length(bodysites)), nrow=length(variables))\nrownames(medians) <- variables\ncolnames(medians) <- bodysites\n\nfor(i in bodysites){    \n\tprint(i)\n\ttimes.sig <- rep(0, length(variables))  #set up the counting vector for the number of sig iters\n\tnames(times.sig) <- variables\n\ttotal <- 0\n\n\tsig <- rep(NA, length(variables))       #\n\tnames(sig) <- variables\n\n    #the record of the community type for each subject (rows) at each visit (column)\n\tvisits <- data.frame(matrix(rep(NA, length(randsids)*3), ncol=3))\n\tcolnames(visits) <- c("v1", "v2", "v3")\n\trownames(visits) <- randsids\n\n\tvisits[rownames(v1), "v1"] <- as.character(v1[,i])\n\tvisits[rownames(v2), "v2"] <- as.character(v2[,i])\n\tvisits[rownames(v3), "v3"] <- as.character(v3[,i])\n\n    \n    #set up the table of p-values for each body site and iteration\n\tmedian.iter <- matrix(rep(NA, length(variables)*totalIters), nrow=length(variables))\n\trownames(median.iter) <- variables\n\tcolnames(median.iter) <- 1:totalIters\n    \n\tfor(iter in 1:totalIters){  #start looping over the iterations\n        iter <- 1\n\t\ttype <- as.character()\n\t\tfor(j in randsids){     #for each subject, randomly select the community type at one visit\n\t\t\ttype[j] <- sample(na.omit(as.character(visits[j,])))[1] #omits those visits that didn\'t happen\n\t\t}\n\n\t\tfor(j in variables){  #now loop through the categorical variables\n            contingency <- table(type, categ.good[,j]) #make a contingency table\n\n\t\t\tif(sum(apply(contingency, 2, sum)>10) > 1){ #if either value of the metadata was seen less than 10 times, move on\n\t\t\t\tif(ncol(contingency)>2 | nrow(contingency) > 2){ \t#if we have more than 2 communitytypes or variable levels\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t#we\'ll simulate the p-values\n\t\t\t\t\tsig[j] <- fisher.test(contingency, simulate.p.value=T, B=1e5)$p.value\n\t\t\t\t} else{\n\t\t\t\t\tsig[j] <- fisher.test(contingency)$p.value\n\t\t\t\t}\n\t\t\t} else{\n\t\t\t\tsig[j] <- NA\n\t\t\t}\t\t\n\t\t}\n\n\t\tadjusted <- p.adjust(sig, method="BH")\n\t\tcorrect <- adjusted < 0.05\n\n\t\ttimes.sig <- correct + times.sig\n\t\ttotal <- total + 1\n\t\tmedian.iter[, iter] <- sig\n\t}\n\tsig.rate[,i] <- times.sig / total\t#calculate the fraction of significant tests\n\tmedians[,i] <- apply(median.iter, 1, median)\t#calculate the median p-value\n}\n\nwrite.table(sig.rate, file="categorical.sig.rate", quote=F, sep="\\t")\nwrite.table(medians, file="categorical.medians", quote=F, sep="\\t")\n\ninteresting <- apply(sig.rate, 1, max, na.rm=T) > 0.5  #cut through the data to see which data had any signficance\nsig.rate[interesting,]\n')


# In the following cell we see that there are 9 bodysite/metadata combinations that were significant in more than 75% of the iterations. We inadvertantly left out the observation that there is a significant association between Asians and their community type. We go ahead and make bar plots representing the contingency tables (e.g. Figure 1C and D and Extended Figures 2 and 3) 

# In[ ]:


threshold <- 0.75   #what's the minimum fraction of iterations that need to be significant
totalIters <- 100   #number of iterations to calculate the average contingency table for
                    #we dont need a lot of precision here

#we read back in the categorical significance rates but will ignore the SITE data becuase
#of the confounding described above and chronic Antihistamine usage since all the tests
#resuled in NAs.
sig.rate <- read.table(file="categorical.sig.rate", header=T)
sig.rate <- sig.rate[-which(rownames(sig.rate) %in% c("SITE", "antihist.chronic")),]

#Let's get the most frequently significant bodysites and metadata
row.max <- apply(sig.rate, 1, max, na.rm=T) > threshold
col.max <- apply(sig.rate, 2, max, na.rm=T) > threshold
most.sig <- sig.rate[row.max, col.max]
#there were 8 combinations


#we're essentially going to recreate the radonomized contingency tables like we did to 
#run the Fisher's exact test
v1<-read.table(file="community_types.v1.txt", header=T, row.names=1)
v2<-read.table(file="community_types.v2.txt", header=T, row.names=1)
v3<-read.table(file="community_types.v3.txt", header=T, row.names=1)
good.ids <- unique(c(rownames(v1), rownames(v2), rownames(v3)))

categ <- read.table(file="categorical.metadata", header=T)
categ.good <- categ[good.ids,]

for(md in rownames(most.sig)){      #scan over the metadata in most.sig
	for(bs in colnames(most.sig)){  #and scan over the bodysites in most.sig

		if(!is.na(sig.rate[md, bs] ) & sig.rate[md, bs] > threshold){#if it isn't an NA and
                                                                     #is over our threshold...
            #let's set up a visits matrix like before
            visits <- data.frame(matrix(rep(NA, length(good.ids)*3), ncol=3))
			colnames(visits) <- c("v1", "v2", "v3")
			rownames(visits) <- good.ids
			visits[rownames(v1), "v1"] <- as.character(v1[,bs])
			visits[rownames(v2), "v2"] <- as.character(v2[,bs])
			visits[rownames(v3), "v3"] <- as.character(v3[,bs])
			
            #get the possible community types that were observed at the bodysite 
			types <- levels(factor(c(visits[,1], visits[,2], visits[,3])))
			ntypes <- length(types)
			
            #get the possible values for the metadata
			levels <- levels(as.factor(categ.good[,md]))
			nlevels <- length(levels)
            
            #initialize the contingency table
			contingency <- as.table(matrix(rep(0, ntypes*nlevels), nrow=ntypes))
			rownames(contingency) <- types
			colnames(contingency) <- levels
			
            #let it rip...
			for(iter in 1:totalIters){
				type <- as.character()
				for(k in good.ids){	#four nested for loops!
					type[k] <- sample(na.omit(as.character(visits[k,])))[1]
				}
				type <- factor(type)
				contingency <- contingency + table(type, categ.good[,md])
			}
            
            #add up the contingency tables
			contingency <- contingency / totalIters    #get the proportions of each community type
			p.contingency <- prop.table(contingency, 2)#in each metadata class

            #make a pretty barplot and save it as a pdf
			pdf(file=paste(bs, ".", md, ".pdf", sep="")) 
			par(mar=c(6,5, 0.5,0.5))
			barplot(as.matrix(t(p.contingency)), beside=T, yaxt="n", names=LETTERS[1:nrow(p.contingency)], 
                    legend.text=colnames(p.contingency), args.legend=c(cex=1.2), ylim=c(0,max(p.contingency)+0.05))

			axis(2, at=seq(0,max(p.contingency+0.05),0.10), label=seq(0,max(p.contingency+0.05)*100, 10), las=1)

			mtext(side=1, line=2.5, text=paste(gsub("_", " ", bs), "Community Type"), cex=1.2)
			mtext(side=2, line=2.5, text="Percentage of Samples", cex=1.2)
			mtext(side=1, line=4, text=md)
			box()
			dev.off()
		}
		
	}
}


# In[ ]:


totalIters <- 1000

#Set up the community type data
v1<-read.table(file="community_types.v1.txt", header=T, row.names=1)
v2<-read.table(file="community_types.v2.txt", header=T, row.names=1)
v3<-read.table(file="community_types.v3.txt", header=T, row.names=1)

randsids <- union(rownames(v1), union(rownames(v2), rownames(v3)))


#Set up the continuous data
cont <- read.table(file="continuous.metadata", header=T, na.string=c("", NA))

#recall that we need to bring in the pH values from each visit - let's just initialize it for now
cont <- cbind(cont, "DVDINTPH"=rep(NA, nrow(cont)), "DVDPFPH"=rep(NA, nrow(cont)))

cont.good <- cont[randsids,]
variables <- colnames(cont.good)[2:ncol(cont.good)]
bodysites <- colnames(v1)

#now we'll get that pH data
ph.file <- read.table(file="vaginal_ph.txt", header=T)

#get the subject ids that we have pH data for
v1.names <- as.character(ph.file[ph.file$VISNO==1,1])
v2.names <- as.character(ph.file[ph.file$VISNO==2,1])
v3.names <- as.character(ph.file[ph.file$VISNO==3,1])
ph.names <- union(v1.names, union(v2.names, v3.names)) #get the unique list

#let's make a table of vaginal introitus pH values for each woman at each visit
ph.int <- data.frame("visit1"=rep(NA, length(ph.names)), "visit2"=rep(NA, length(ph.names)),
                     "visit3"=rep(NA, length(ph.names)), row.names=ph.names)
ph.int[v1.names, "visit1"] <- ph.file[ph.file$VISNO==1, "DVDINTPH"]
ph.int[v2.names, "visit2"] <- ph.file[ph.file$VISNO==2, "DVDINTPH"]
ph.int[v3.names, "visit3"] <- ph.file[ph.file$VISNO==3, "DVDINTPH"]

#let's make a table of posterior fornix pH values for each woman at each visit
ph.pf <- data.frame("visit1"=rep(NA, length(ph.names)), "visit2"=rep(NA, length(ph.names)),
                    "visit3"=rep(NA, length(ph.names)), row.names=ph.names)
ph.pf[v1.names, "visit1"] <- ph.file[ph.file$VISNO==1, "DVDPFPH"]
ph.pf[v2.names, "visit2"] <- ph.file[ph.file$VISNO==2, "DVDPFPH"]
ph.pf[v3.names, "visit3"] <- ph.file[ph.file$VISNO==3, "DVDPFPH"]


#set up a table with the frequency of significant tests for each variable / bodysite combinatino
sig.rate <- matrix(rep(0, (length(variables))*length(bodysites)), nrow=(length(variables)))
rownames(sig.rate) <- c(variables)
colnames(sig.rate) <- bodysites

#set up a table with the median p-value for each variable / bodysite combinatino
medians <- matrix(rep(0, (length(variables))*length(bodysites)), nrow=(length(variables)))
rownames(medians) <- c(variables)
colnames(medians) <- bodysites


for(i in bodysites){
	print(i)
	times.sig <- rep(0, length(variables))
	names(times.sig) <- variables
	total <- 0

	sig <- rep(NA, length(variables))
	names(sig) <- variables

	visits <- data.frame(matrix(rep(NA, length(randsids)*3), ncol=3))
	colnames(visits) <- c("v1", "v2", "v3")
	rownames(visits) <- randsids

	visits[rownames(v1), "v1"] <- as.character(v1[,i])
	visits[rownames(v2), "v2"] <- as.character(v2[,i])
	visits[rownames(v3), "v3"] <- as.character(v3[,i])

	median.iter <- matrix(rep(NA, length(variables)*totalIters), nrow=length(variables))
	rownames(median.iter) <- variables
	
	for(iter in 1:totalIters){
		type <- character()

		for(j in randsids){
			x<-1:3
			
			rand.vis <- sample(x)
			if(!is.na(visits[j,rand.vis[1]])){
				rand.vis <- rand.vis[1]
			} else if(!is.na(visits[j,rand.vis[2]])){
				rand.vis <- rand.vis[2]
			} else {
				rand.vis <- rand.vis[3]
			}		
			type[j] <- visits[j,rand.vis]
			cont.good[j, "DVDINTPH"] <- ph.int[j, rand.vis]#fill in the pH values for that visit
			cont.good[j, "DVDPFPH"] <- ph.pf[j, rand.vis] 
		}

		for(j in variables){  #do the anova
			sig[j] <- anova(lm(cont.good[,j]~factor(type)))$Pr[1]
		}

		adjusted <- p.adjust(sig, method="BH") #correct for multiple comparisons
		correct <- adjusted < 0.05  #significant?

		times.sig <- correct + times.sig #add 1 if significant
		total <- total + 1
		median.iter[, iter] <- sig #keep track of p-values
	}
	sig.rate[,i] <- times.sig / total #get rate of significance
	medians[,i] <- apply(median.iter, 1, median) #get median p-values
}

write.table(sig.rate, file="continuous.sig.rate", quote=F, sep="\t")
write.table(medians, file="continuous.medians", quote=F, sep="\t")

interesting <- apply(sig.rate, 1, max, na.rm=T) > 0.5  #cut through the data to see which data had any signficance
sig.rate[interesting,]  #maybe not all that interesting afterall


# There were weak associations between vaginal pH and the vaginal community types

# ### Continuous metadata
# Now, let's use a similar procedure to look at the coninuous data. Instead of using Fisher's exact test, we'll use ANOVA and we'll use the pH values that were obtained from that day's sampling.

# In[ ]:


rm(list=ls())
#set up the community type data
#first we get the subject ids
categ <- read.table(file="categorical.metadata", header=T, row.names=1)
randsids <- rownames(categ)

#next we read in the community type assignments
v1<-read.table(file="community_types.v1.txt", header=T, row.names=1)
v2<-read.table(file="community_types.v2.txt", header=T, row.names=1)
v3<-read.table(file="community_types.v3.txt", header=T, row.names=1)

#now we build the visit tables for the three vaginal sites - intialize
vi.types <- matrix(rep(NA, length(randsids)*3), nrow=length(randsids))
rownames(vi.types) <- randsids
colnames(vi.types) <- c("visit1", "visit2", "visit3")
mv.types <- vi.types
pf.types <- vi.types
pf.ph <- vi.types
int.ph <- vi.types

#now we build the visit tables for the three vaginal sites - fill
vi.types[rownames(v1), "visit1"] <- as.character(v1[,"Vaginal_introitus"])
vi.types[rownames(v2), "visit2"] <- as.character(v2[,"Vaginal_introitus"])
vi.types[rownames(v3), "visit3"] <- as.character(v3[,"Vaginal_introitus"])

mv.types[rownames(v1), "visit1"] <- as.character(v1[,"Mid_vagina"])
mv.types[rownames(v2), "visit2"] <- as.character(v2[,"Mid_vagina"])
mv.types[rownames(v3), "visit3"] <- as.character(v3[,"Mid_vagina"])

pf.types[rownames(v1), "visit1"] <- as.character(v1[,"Posterior_fornix"])
pf.types[rownames(v2), "visit2"] <- as.character(v2[,"Posterior_fornix"])
pf.types[rownames(v3), "visit3"] <- as.character(v3[,"Posterior_fornix"])


#now we'll get that pH data 
ph.file <- read.table(file="vaginal_ph.txt", header=T)
ph1 <- ph.file[ph.file$VISNO==1, c("RANDSID", "DVDINTPH", "DVDPFPH")]
ph2 <- ph.file[ph.file$VISNO==2, c("RANDSID", "DVDINTPH", "DVDPFPH")]
ph3 <- ph.file[ph.file$VISNO==3, c("RANDSID", "DVDINTPH", "DVDPFPH")]

rownames(ph1) <- ph1$RANDSID; ph1 <- ph1[,-1]
rownames(ph2) <- ph2$RANDSID; ph2 <- ph2[,-1]
rownames(ph3) <- ph3$RANDSID; ph3 <- ph3[,-1]

#introitus pH
int.ph[rownames(ph1),"visit1"] <- ph1[, "DVDINTPH"]
int.ph[rownames(ph2),"visit2"] <- ph2[, "DVDINTPH"]
int.ph[rownames(ph3),"visit3"] <- ph3[, "DVDINTPH"]

#posterior pH
pf.ph[rownames(ph1),"visit1"] <- ph1[, "DVDPFPH"]
pf.ph[rownames(ph2),"visit2"] <- ph2[, "DVDPFPH"]
pf.ph[rownames(ph3),"visit3"] <- ph3[, "DVDPFPH"]

#concatenate the commmunity type data from the three visits into a single vector
vi <- c(vi.types[,"visit1"], vi.types[,"visit2"], vi.types[,"visit3"])
mv <- c(mv.types[,"visit1"], mv.types[,"visit2"], mv.types[,"visit3"])
pf <- c(pf.types[,"visit1"], pf.types[,"visit2"], pf.types[,"visit3"])

#concatenate the pH data from the three visits into a single vector
intph <- c(int.ph[,"visit1"], int.ph[,"visit2"], int.ph[,"visit3"])
pfph <- c(pf.ph[,"visit1"], pf.ph[,"visit2"], pf.ph[,"visit3"])

#build the stripcharts...
pdf(file="Vagina_introitus.DVDINTPH.pdf")
	par(mar=c(5,4,0.5,0.5))
	good <- !is.na(vi) & !is.na(intph)
	stripchart(intph[good]~factor(vi[good]), pch=19, vertical=T, method="jitter", xlab="", ylab="",
                               ylim=c(3,7), yaxt="n", xaxt="n", col="red")
	title(ylab="Vaginal Introitus pH")
	mtext(1, line=3, text="Vagina Introitus Community Type")
	ave <- aggregate(intph[good], by=list(factor(vi[good])), mean)
	segments(x0=seq(0.8,4.8,1), x1=seq(1.2,5.2, 1), y0=ave$x, lwd=4)
	axis(1, at=1:5, label=paste(LETTERS[1:5]))
	axis(2, las=2)
dev.off()

pdf(file="Vagina_introitus.DVDPFPH.pdf")
	par(mar=c(5,4,0.5,0.5))
	good <- !is.na(vi) & !is.na(pfph)
	stripchart(pfph[good]~factor(vi[good]), pch=19, vertical=T, method="jitter", xlab="", ylab="",
                               ylim=c(3,7), yaxt="n", xaxt="n", col="red")
	title(ylab="Posterior Fornix pH")
	mtext(1, line=3, text="Vagina Introitus Community Type")
	ave <- aggregate(pfph[good], by=list(factor(vi[good])), mean)
	segments(x0=seq(0.8,4.8,1), x1=seq(1.2,5.2, 1), y0=ave$x, lwd=4)
	axis(1, at=1:5, label=paste(LETTERS[1:5]))
	axis(2, las=2)
dev.off()


pdf(file="Mid_vagina.DVDINTPH.pdf")
	par(mar=c(5,4,0.5,0.5))
	good <- !is.na(mv) & !is.na(intph)
	stripchart(intph[good]~factor(mv[good]), pch=19, vertical=T, method="jitter", xlab="", ylab="",
                               ylim=c(3,7), yaxt="n", xaxt="n", col="red")
	title(ylab="Vaginal Introitus pH")
	mtext(1, line=3, text="Mid Vagina Community Type")
	ave <- aggregate(intph[good], by=list(factor(mv[good])), mean)
	segments(x0=seq(0.8,4.8,1), x1=seq(1.2,5.2, 1), y0=ave$x, lwd=4)
	axis(1, at=1:5, label=paste(LETTERS[1:5]))
	axis(2, las=2)
dev.off()

pdf(file="Mid_vagina.DVDPFPH.pdf")
	par(mar=c(5,4,0.5,0.5))
	good <- !is.na(mv) & !is.na(pfph)
	stripchart(pfph[good]~factor(mv[good]), pch=19, vertical=T, method="jitter", xlab="", ylab="",
                               ylim=c(3,7), yaxt="n", xaxt="n", col="red")
	title(ylab="Posterior Fornix pH")
	mtext(1, line=3, text="Mid Vagina Community Type")
	ave <- aggregate(pfph[good], by=list(factor(mv[good])), mean)
	segments(x0=seq(0.8,4.8,1), x1=seq(1.2,5.2, 1), y0=ave$x, lwd=4)
	axis(1, at=1:5, label=paste(LETTERS[1:5]))
	axis(2, las=2)
dev.off()


pdf(file="Posterior_fornix.DVDINTPH.pdf")
	par(mar=c(5,4,0.5,0.5))
	good <- !is.na(pf) & !is.na(intph)
	stripchart(intph[good]~factor(pf[good]), pch=19, vertical=T, method="jitter", xlab="", ylab="",
                               ylim=c(3,7), yaxt="n", xaxt="n", col="red")
	title(ylab="Vaginal Introitus pH")
	mtext(1, line=3, text="Posterior Fornix Community Type")
	ave <- aggregate(intph[good], by=list(factor(pf[good])), mean)
	segments(x0=seq(0.8,4.8,1), x1=seq(1.2,5.2, 1), y0=ave$x, lwd=4)
	axis(1, at=1:5, label=paste(LETTERS[1:5]))
	axis(2, las=2)
dev.off()

pdf(file="Posterior_fornix.DVDPFPH.pdf")
	par(mar=c(5,4,0.5,0.5))
	good <- !is.na(pf) & !is.na(pfph)
	stripchart(pfph[good]~factor(pf[good]), pch=19, vertical=T, method="jitter", xlab="", ylab="",
                               ylim=c(3,7), yaxt="n", xaxt="n", col="red")
	title(ylab="Posterior Fornix pH")
	mtext(1, line=3, text="Posterior Fornix Community Type")
	ave <- aggregate(pfph[good], by=list(factor(pf[good])), mean)
	segments(x0=seq(0.8,4.8,1), x1=seq(1.2,5.2, 1), y0=ave$x, lwd=4)
	axis(1, at=1:5, label=paste(LETTERS[1:5]))
	axis(2, las=2)
dev.off()


# ### The SITE variable
# The variable that was most frequently detected as being sigificantly associated with the various community types was SITE - the city that the subject was recruited from - Houston and St. Louis. This was also seen in the original Nature paper from 2012. As described way back at the beginning, the SITE varaible was perfectly confounded with where the DNA extraction, PCR and sequencing were done. Although it's entirely possible that people in Houston and St Louis have very different microbiomes, we wouldn't want to rest on this dataset for that comparison. Let's see whether any of our other metadata are associated with the SITE variable. Let's start with the categorical data.

# In[ ]:


categ <- read.table(file="categorical.metadata", header=T)
site <- categ$SITE
categ <- categ[,-1]

p.values <- rep(0, ncol(categ))
names(p.values) <- colnames(categ)

for(md in colnames(categ)){
	p.values[md] <- fisher.test(site, categ[,md])$p.value
}

p.adjust(sort(p.values), method="BH")
p.adjust(sort(p.values), method="BH") < 0.05


# What we glean from this is that their mom and dad's country of origin, whether they were Asian, white, or hispanic, whether they chronically use vitamins and mineral supplements, whether they had dental insurance, and their education level were significantly associated with the city of origin. This may cause us to not over interpret the result relating the vaginal community type to education level; however, considering that association was among the strongest observed, it is hard to discount.  
#   
# Now we want to turn to the continuous metadata:

# In[ ]:


cont <- read.table(file="continuous.metadata", header=T)
rownames(cont) == rownames(categ)

p.values <- rep(0, ncol(cont))
names(p.values) <- colnames(cont)

for(md in colnames(cont)){
	p.values[md] <- anova(lm(cont[,md]~factor(site)))$Pr[1]
}

p.adjust(p.values, method="BH")
p.adjust(p.values, method="BH") < 0.05


# For some reason the diastolic blood pressure was significantly different between both cities. None of these variable were significantly associated with our community types so we'll move on.

# ## Identify associations between body sites
# Next we wanted to know whether the community type at one bodysite was more likely to be associated with the community type at a different bodysite. While we expected things like the left and right antecubital fossa to be associated, we were also interested in associations from disparate types of bodysites. Here we again used a method very similar to what we did for associating communitytypes with categorical metadata. An important difference is that instead of picking a random visit for each bodysite, we picked a random visit for each person and used all of the bodysite data for that person from that visit. We then iterated a bunch. These data became the basis for the heatmap in Figure 2.

# In[ ]:


get_ipython().run_cell_magic('R', '', '\ntotalIters <- 1000\n\n#Set up the community type data\nv1<-read.table(file="community_types.v1.txt", header=T, row.names=1)\nv2<-read.table(file="community_types.v2.txt", header=T, row.names=1)\nv3<-read.table(file="community_types.v3.txt", header=T, row.names=1)\n\nrandsids <- unique(c(rownames(v1), rownames(v2), rownames(v3)))\nbodysites <- colnames(v1)\n\n#make a table to keep track of when a subject visited. we will want\n#to take a person\'s entire microbiome, not just individual sites\nvisits <- matrix(as.numeric(rep(NA, length(randsids)*3)), ncol=3)\nrownames(visits) <- randsids\ncolnames(visits) <- c("v1", "v2", "v3")\nvisits[rownames(v1),1] <- 1\nvisits[rownames(v2),2] <- 2\nvisits[rownames(v3),3] <- 3\n\n\ntotal <- 0\ntimes.sig <- matrix(rep(0, 18*18), ncol=18)\nrownames(times.sig) <- bodysites\ncolnames(times.sig) <- bodysites\n\nmedian.iter <- array(NA, dim=c(18,18,totalIters))\n\nfor(iter in 1:totalIters){  #this is going to take awhile (triple loop!)\n\tif(iter %% 10 == 0){\n\t\tprint(iter)\n\t}\n\t\n\t#create a random visit across all individuals\n\trandom.visit <- matrix(ncol=18, nrow=0)\n\tcolnames(random.visit) <- bodysites\n\n\tfor(i in randsids){#for each individual, randomly pick one of their visits...\n\t\tvisit <- as.numeric(sample(na.omit(as.character(visits[i,])))[1])\n\t\n\t\t#and once we\'ve got that random visit, take all of their bodysite\n\t\t#data for that visit\n\t\tif(visit == 1){\n\t\t\trandom.visit <- rbind(random.visit, v1[i,])\n\t\t} else if(visit ==2){\n\t\t\trandom.visit <- rbind(random.visit, v2[i,])\n\t\t} else if(visit == 3){\n\t\t\trandom.visit <- rbind(random.visit, v3[i,])\n\t\t} else {\n\t\t\trandom.visit <- rbind(random.visit, rep(NA, length(bodysites)))\n\t\t}\n\t}\n\n\t#let\'s build a p-value matrix for each bodysite vs. each bodysite\n\tp.values <- matrix(rep(NA, 18*18), ncol=18)\n\t\n\tfor(i in 2:18){\n\t\tfor(j in 1:i){\n\t\t\tif(j<i){ #no need to do everything twice...\n\t\t\t\tcontingency <- table(random.visit[,i], random.visit[,j])\n\t\t\t\tnrows <- sum(apply(contingency, 1, sum)>0)\n\t\t\t\tncols <- sum(apply(contingency, 2, sum)>0)\n\t\t\t\t\n\t\t\t\tif(nrows == 1 | ncols == 1){#not sure this would happen, but...\n\t\t\t\t\tp.values[i,j] <- NA\n\t\t\t\t}\n\t\t\t\telse if(nrows > 2 | ncols > 2){ #if it\'s more than 2x2, bootstrap\n\t\t\t\t\tp.values[i,j] <- fisher.test(contingency, simulate.p.value=T, B=1e5)$p.value\n\t\t\t\t} else{\t#but if it\'s 2x2 just do the normal method\n\t\t\t\t\tp.values[i,j] <- fisher.test(contingency)$p.value\n\t\t\t\t}\n\t\t\t}\t\t\n\t\t}\n\t}\n\n\tadjusted <- p.adjust(p.values, method="BH")\t#adjust the p.values\n\tcorrect <- adjusted < 0.05\t#significant?\n\n\ttimes.sig <- correct + times.sig #increment if significant\n\ttotal <- total + 1\t\t\t\t\t#increment number of times through\n\tmedian.iter[, , iter] <- p.values\t#keep track of p.values\n}\n\nsig.rate <- times.sig / total\t\t#get the fraction significant\nsig.rate <- as.matrix(as.dist(sig.rate, upper=T))#make the matrix symmetric\ndiag(sig.rate) <- 1\t\t\t\t\t\t\t\t#make the diagonal 1\nrownames(sig.rate) <- bodysites\t#clean things up a bit\ncolnames(sig.rate) <- bodysites #clean thigns up a bit\nwrite.table(sig.rate, file="body_site.sig.rate", quote=F, sep="\\t")#output!\n\n#ditto on the median p-values\nmedians <- as.matrix(as.dist(apply(median.iter, c(1,2), median)))\ndiag(medians) <- 1e-6\nrownames(medians) <- bodysites\ncolnames(medians) <- bodysites\nwrite.table(medians, file="body_site.medians", quote=F, sep="\\t")\n\n\n#now to build the heatmap of p-values...\nbsite_ordered <- c(\t\t#want to sort the bodysites to make sense\n\t4,\t#"Keratinized_gingiva",\n\t15,\t#"Supragingival_plaque",\n\t14,\t#"Subgingival_plaque",\n\t2,\t#"Buccal_mucosa",\n\t3,\t#"Hard_palate",\n\t17,\t#"Tongue_dorsum",\n\t16,\t#"Saliva",\n\t8,\t#"Palatine_tonsils",\n\t16,\t#"Throat",\n\t13,\t#"Stool",\n\t1,\t#"Anterior_nares",\n\t5,\t#"L_Antecubital_fossa",\n\t10,\t#"R_Antecubital_fossa",\n\t6,\t#"L_Retroauricular_crease",\n\t11,\t#"R_Retroauricular_crease",\n\t18,\t#"Vaginal_introitus",\n\t7,\t#"Mid_vagina",\n\t9\t#"Posterior_fornix"\n)\n\n#scale the values between 1e-5 and 1\nscaled.values <- 7 - ceiling(as.matrix((-log10(medians[bsite_ordered,bsite_ordered]))))\nscaled.values[scaled.values == 0] <- 1\nscaled.values[is.na(scaled.values)] <- 6\n\npdf(file="body_site.pdf")\nheatmap(scaled.values, Rowv=NA, Colv=NA, scale="none", \n\tlabRow=gsub("_", " ", colnames(medians)[bsite_ordered]),\n    labCol=gsub("_", " ", colnames(medians)[bsite_ordered]),margins=c(10,10), revC=T,\n    col=c(heat.colors(5, alpha=0.8), "white"), font=2)\n\npar(xpd=T)\t#let\'s write anywhere!\n#loc<-locator()\nloc <- list("x"=0.83, "y"=0.1)\n\nlegend(loc, legend=c(expression(paste("P<10"^"-1")), expression(paste("P<10"^"-2")),\n\t\t\texpression(paste("P<10"^"-3")), expression(paste("P<10"^"-4")),\n\t\t\texpression(paste("P<10"^"-5"))), fill=rev(heat.colors(5, alpha=0.8)))\npar(xpd=F)\t#no more writing anywhere :(\ndev.off()\n')


# ## Quantify rate of change between community types for each body site
# Now we turned to the question of whether some bodysites were more stable than others. Since the subjects didn't come in on a regular interval, it was difficult to quantify true rates of change or to model the change. We tried several things all of which pointed to the fact that time alone could not explain the change. So we just present the rates of change for each community type and body site (see Figure 3). First we'll start by generating transition matrices for each bodysite

# In[ ]:


get_ipython().run_cell_magic('R', '', '\nv1 <- read.table(file="community_types.v1.txt", sep=" ", header=T)\nv2 <- read.table(file="community_types.v2.txt", sep=" ", header=T)\nv3 <- read.table(file="community_types.v3.txt", sep=" ", header=T)\n\nbodysites <- colnames(v1)\n\nfor(bs in bodysites){#for each bodysite\n\tv1.samples <- rownames(v1)#get the the randsid\n\tv1.parts <- as.character(v1[, grepl(bs, colnames(v1))])#get the community type\n\tv1.samples <- v1.samples[!is.na(v1.parts)]\t#exclude those sampes where we don\'t have a\n\tv1.parts <- v1.parts[!is.na(v1.parts)]\t    #community type\n\t\n\tv2.samples <- rownames(v2)\n\tv2.parts <- as.character(v2[, grepl(bs, colnames(v2))])\n\tv2.samples <- v2.samples[!is.na(v2.parts)]\t\n\tv2.parts <- v2.parts[!is.na(v2.parts)]\t\n\n\tv3.samples <- rownames(v3)\n\tv3.parts <- as.character(v3[, grepl(bs, colnames(v3))])\n\tv3.samples <- v3.samples[!is.na(v3.parts)]\t\n\tv3.parts <- v3.parts[!is.na(v3.parts)]\n    \n    #now we know the samples that have community types\n\n\tsamples <- unique(sort(c(v1.samples, v2.samples, v3.samples))) #the list of samples with types\n\tvisits <- data.frame(v1=rep(NA, length(samples)), v2=rep(NA, length(samples)),\n                         v3=rep(NA, length(samples)), row.names=samples)#initialize a frame\n\t\n\tvisits[(v1.samples), 1] <- v1.parts #fill the data frame with community type data\n\tvisits[(v2.samples), 2] <- v2.parts #from each visit\n\tvisits[(v3.samples), 3] <- v3.parts\n\n\tfirst <- c(visits[,1], visits[,2]) #create a vector for visits 1 and 2\n\tsecond <- c(visits[,2], visits[,3])#create a vector for visits 2 and 3\n\t#now we can compare first and second to see whether a subject changed between visits 1 and 2\n    #and between visits 2 and 3\n    \n\t#need to make sure that we have all the community types in the rows and columns\n\ttypes <- levels(factor(c(first, second)))\t\n\tntypes <- length(types)\n\ttransition <- matrix(rep(0, ntypes*ntypes), nrow=ntypes)\n\trownames(transition) <- types\n\tcolnames(transition) <- types\n\t\n\tt <- table(first, second)\n\ttransition[rownames(t), colnames(t)] <- t #now we should have a square transition matrix\n\t\n    #output\n\twrite.table(transition, paste(bs,".transition.matrix", sep=""), sep="\\t", quote=F)\n}\n')


# Now that we have all of the transition matrices for each bodysite, let's generate a bubble plot indicating the stability of each bodysite along with its relative frequency (e.g. Figure 3A). Along the x-axis we'll plot the stability of each community type, along the y-axis we'll plot the bodysite (bunched into general types) and the icon will be sized proportionally to the fraction of samples in that community type. A vertical line will be drawn for each bodysite to indicate the average community type stability across all subjects.

# In[ ]:


get_ipython().run_cell_magic('R', '', '\n#need to read in the transition matrix and calculate the stability values and fraction\n#of subjects\ngetStabProp <- function(bodysite){\n\tdata <- read.table(file=bodysite, header=T)\n\tparts <- unique(c(colnames(data), rownames(data)))\n\n    #get the proportions - what fraction of the rows go to columns\n\tprop <- prop.table(as.matrix(data), 1) \n\t\n\t#define stability as 1 - the proportion of change\n    stability <- 100*(1-diag(prop)) #get stability values\n    \n    #what fraction of people are in each community type\n\tproportion <- rowSums(data)/sum(data)\t\n\tdata.frame(row.names=rownames(data), stability=stability, proportion=proportion)\n}\n\n\n#need to scale the size of the blip to be proportional to the proportion of people in\n#that community type\ngetCEX <- function(proportion){\n\tmax.cex=5\n\tmin.cex=0.25\n\tcex <- (max.cex-min.cex)*proportion+min.cex\n\treturn(cex)\n}\n\n\nll=0.25  #length of the line indicating the average above the y-position\njitter =0.1 #amount to jitter points if there are duplicate stability values\n\n#need to plot the stability data along a line for each chunk of bodysites. If there are\n#duplicate stability values for a bodysite, need to jitter the position on the y-axis\nplotStability <- function(site.vector, pos){\n\n\tfor(site in site.vector){#do on all of the bodysites within a bodytype chunk\n\n\t\tdata <- getStabProp(site)#get the stability / proportion data\n\t\n\t\typos <- rep(pos, nrow(data))#set the default position on the y-axis\n        \n        #if there are duplicate stability values jitter them to either position on the y-axis\n\t\typos[duplicated(data$stability)] <- ypos[duplicated(data$stability)] + jitter\n\t\typos[duplicated(data$stability, fromLast=T)] <- ypos[duplicated(data$stability, fromLast=T)] - jitter\n\n        #plot the blips on the line sized in proportion to relative abundance\n\t\tpoints(x=data$stability, y=ypos, pch=21, bg="gray", cex=getCEX(data$proportion))\n\n        #calculate the average stability for the bodysite\n        avg <- sum(data$stability*data$proportion)\n        \n        #make a vertical line at the average\n\t\tlines(x=c(avg, avg), y=c(pos-ll, pos+ll), col="black", lwd=3)\n\n        #decrement the position on the y-axis\n        pos <- pos-1\n\t}\n\n}\n\n\n\nheader.line = 11  #position left of the y-axis for the body type heading\nindent.line = 9   #position left of the y-axis for the bodysite heading\n\npdf("bodysite.change.rate.pdf", height=11, width=8.5)\npar(mar=c(5, header.line+0.5, 0.5, 0.5)) #need padding on the right\n\nplot(x=c(0,100), y=c(1,27), type="n", main="", xlab="% Change", ylab="", axes=F)\naxis(1)\nbox()\n\n\n#gastrointestinal\nsites <- c("Stool.transition.matrix")\nnames <- gsub(".transition.matrix", "", sites)\nnames <- gsub("_", " ", names)\nmtext(side=2, line=header.line, at=27, text="Gastrointestinal", las=1, adj=0)\nmtext(side=2, line=indent.line, at=26, text="Stool", las=1, adj=0)\nabline(h=26, col="gray")#make horizontal gray line \n\nstart <- 26\nplotStability(sites, start)\n\n\n#oral\nsites <- c("Supragingival_plaque.transition.matrix", "Palatine_Tonsils.transition.matrix", \n\t\t\t"Subgingival_plaque.transition.matrix", "Throat.transition.matrix",\n\t\t\t"Tongue_dorsum.transition.matrix", "Keratinized_gingiva.transition.matrix", \n\t\t\t"Buccal_mucosa.transition.matrix", "Hard_palate.transition.matrix", "Saliva.transition.matrix")\nnames <- gsub(".transition.matrix", "", sites)\nnames <- gsub("_", " ", names)\nmtext(side=2, line=header.line, at=24, text="Oral", las=1, adj=0)\nmtext(side=2, line=indent.line, at=15:23, text=rev(names), las=1, adj=0)\nabline(h=15:23, col="gray")\n\nstart <- 23\nplotStability(sites, start)\n\n\n#pulmonary\nsites <- c("Anterior_nares.transition.matrix")\nnames <- gsub(".transition.matrix", "", sites)\nnames <- gsub("_", " ", names)\nmtext(side=2, line=header.line, at=13, text="Pulmonary", las=1, adj=0)\nmtext(side=2, line=indent.line, at=12, text=rev(names), las=1, adj=0)\nabline(h=12, col="gray")\n\nstart <- 12\nplotStability(sites, start)\n\n\n#skin\nsites <- c("R_Retroauricular_crease.transition.matrix", "L_Retroauricular_crease.transition.matrix",\n\t\t\t"R_Antecubital_fossa.transition.matrix", "L_Antecubital_fossa.transition.matrix")\nnames <- gsub(".transition.matrix", "", sites)\nnames <- gsub("_", " ", names)\nmtext(side=2, line=header.line, at=10, text="Skin", las=1, adj=0)\nmtext(side=2, line=indent.line, at=6:9, text=rev(names), las=1, adj=0)\nabline(h=6:9, col="gray")\n\nstart <- 9\nplotStability(sites, start)\n\n\n#vagina\nsites <- c("Vaginal_introitus.transition.matrix","Mid_vagina.transition.matrix", \n\t\t\t"Posterior_fornix.transition.matrix")\nnames <- gsub(".transition.matrix", "", sites)\nnames <- gsub("_", " ", names)\nmtext(side=2, line=header.line, at=4, text="Vagina", las=1, adj=0)\nmtext(side=2, line=indent.line, at=1:3, text=rev(names), las=1, adj=0)\nabline(h=1:3, col="gray")\n\nstart <- 3\nplotStability(sites, start)\n\n\n#make a legend. for Figure 3A we separated the points a bit so they were easier to see\nlegend(x=80, y=13, legend=c("1%", "10%", "25%", "50%", "75%"), pch=21, pt.bg="gray",\n                    bg="white", pt.cex=getCEX(c(0.01, 0.10, 0.25, 0.50, 0.75)))\n\ndev.off()\n')


# Now we'd like to graphically depict the transition matrices as we did in Figure 3B. These will allow us to see how frequently one community type changes into another or stays the same. This took a fair amount of messing with parameters to get the formatting to look decent...

# In[ ]:


get_ipython().run_cell_magic('R', '', '\ninstall.packages("diagram")\nlibrary(diagram)\n\n#set the layouts to use in plotmat for different number of community types\ngetPos <- function(n) {\n\tif(n==2)\t\t{\t\treturn(c(2))\t\t}\t\t\t\t\t\n\telse if(n==3)\t{\t\treturn(c(1,2))\t\t}\t\t\t\t\t\n\telse if(n==4)\t{\t\treturn(c(2,2))\t\t}\t\t\t\t\t\n\telse if(n==5)\t{\t\treturn(c(2,1, 2))\t}\t\t\t\t\t\n\telse if(n==6)\t{\t\treturn(c(2,2,2))\t}\n\telse if(n==7)\t{\t\treturn(c(2,3,2))\t}\n}\n\n        \n#describe the curvature when there are different numbers of community types\ngetCurves <- function(n) {\n\tif(n==2){\n\t\treturn(matrix(c(0,\t0.1,\n\t\t\t\t\t\t0.1,\t0), nrow=2))\n\t}\t\t\t\t\t\n\telse if(n==3){\n\t\treturn(matrix(c(0,\t0.08,\t0.08,\n\t\t\t\t\t\t0.08,\t0,\t0.08,\n\t\t\t\t\t\t0.08,\t0.08,\t0), nrow=3))\n\t}\t\t\t\t\t\n\telse if(n==4){\n\t\treturn(matrix(c(0.00,\t0.08,\t0.08,\t0.65,\n\t\t\t\t\t\t0.08,\t0.00,\t0.65,\t0.08,\n\t\t\t\t\t\t0.08,\t0.65,\t0.00,\t0.08,\n\t\t\t\t\t\t0.65,\t0.08,\t0.08,\t0.00), nrow=4))\n\t}\t\t\t\t\t\n\telse if(n==5){\n\t\treturn(matrix(c(0,\t0.05,\t0.1,\t0.05,\t0.6,\n\t\t\t\t\t\t0.05,\t0,\t0.1,\t0.6,\t0.05,\n\t\t\t\t\t\t0.1,\t0.1,\t0,\t0.1,\t0.1,\n\t\t\t\t\t\t0.05,\t0.6,\t0.1,\t0,\t0.05,\n\t\t\t\t\t\t0.6,\t0.05,\t0.1,\t0.05,\t0), nrow=5))\n\t}\n\telse if(n==6){\n\t\treturn(matrix(c(0,\t0.05,\t0.05,\t0.1,\t0.2,\t0.15,\n\t\t\t\t\t\t0.05,\t0,\t0.1,\t0.05,\t0.15,\t0.2,\n\t\t\t\t\t\t0.05,\t0.1,\t0,\t0.05,\t0.05,\t0.1,\n\t\t\t\t\t\t0.1,\t0.05,\t0.05,\t0,\t0.1,\t0.05,\n\t\t\t\t\t\t0.2,\t0.15,\t0.05,\t0.1,\t0,\t0.05,\n\t\t\t\t\t\t0.15,\t0.2,\t0.1,\t0.05,\t0.05,\t0), nrow=6))\n\t}\t\t\t\t\t\n\telse if(n==7){\n\t\treturn(matrix(c(0,\t0.15,\t0.15,\t0.15,\t0.1,\t0.1,\t0.25,\n\t\t\t\t\t\t0.15,\t0,\t0.1,\t0.15,\t0.15,\t0.25,\t0.1,\n\t\t\t\t\t\t0.15,\t0.1,\t0,\t0.15,\t0.25,\t0.15,\t0.1,\n\t\t\t\t\t\t0.15,\t0.15,\t0.15,\t0,\t0.15,\t0.15,\t0.15,\n\t\t\t\t\t\t0.1,\t0.15,\t0.25,\t0.15,\t0,\t0.1,\t0.15,\n\t\t\t\t\t\t0.1,\t0.25,\t0.15,\t0.15,\t0.1,\t0,\t0.15,\n\t\t\t\t\t\t0.25,\t0.1,\t0.1,\t0.15,\t0.15,\t0.15,\t0), nrow=7))\n\t}\t\t\t\t\t\n}\n\n        \n#describe on which side (left/right) and how big self arrows should be\ngetShiftX <- function(n) {\n\tif(n==2)\t\t{\treturn(c(-0.09, 0.09))\t\t\t\t\t\t\t\t\t}\t\n\telse if(n==3)\t{\treturn(c(0,-0.09,0.09))\t\t\t\t\t\t\t\t\t}\t\n\telse if(n==4)\t{\treturn(c(-0.09, 0.09, -0.09, 0.09))\t\t\t\t\t\t}\n\telse if(n==5)\t{\treturn(c(-0.09, 0.09, 0.09, -0.09, 0.09))\t\t\t\t}\n\telse if(n==6)\t{\treturn(c(-0.09, 0.09, -0.09, 0.09, -0.09, 0.09))\t\t}\n\telse if(n==7)\t{\treturn(c(-0.09, 0.09, -0.09, 0.09, 0.09,-0.09, 0.09))\t}\n}\n\n\n#describe on which side (top/bottom) and how big self arrows should be\ngetShiftY <- function(n) {\n\tif(n==2)\t\t{ return(0) }\n\telse if(n==3)\t{ return(c(0.09,-0.04,-0.04)) }\n\telse if(n==4)\t{ return(0) }\n\telse if(n==5)\t{ return(0) }\n\telse if(n==6)\t{ return(0) }\n\telse if(n==7)\t{ return(0) }\n}\n\n\n\nmatrices <- c("Anterior_nares.transition.matrix", "Buccal_mucosa.transition.matrix",\n\t\t\t"Hard_palate.transition.matrix", "Keratinized_gingiva.transition.matrix",\n\t\t\t"L_Antecubital_fossa.transition.matrix", "L_Retroauricular_crease.transition.matrix",\n\t\t\t"Mid_vagina.transition.matrix", "Palatine_Tonsils.transition.matrix",\n\t\t\t"Posterior_fornix.transition.matrix", "R_Antecubital_fossa.transition.matrix",\n\t\t\t"R_Retroauricular_crease.transition.matrix", "Saliva.transition.matrix",\n\t\t\t"Stool.transition.matrix", "Subgingival_plaque.transition.matrix",\n\t\t\t"Supragingival_plaque.transition.matrix", "Throat.transition.matrix",\n\t\t\t"Tongue_dorsum.transition.matrix", "Vaginal_introitus.transition.matrix")\n\nfor(m in matrices){#go through all of the bodysites\n\n\ttrans <- read.table(file=m, header=T)        #want to know the fraction of\n\ttrans <- 100*prop.table(as.matrix(trans), 1) #community type row that become community type column\t\n\ttrans <- format(trans, nsmall=1, digits=1)   #try to make the numbers pretty\n\tntypes <- nrow(trans) #get the number of types\n\n\tbs <- gsub("^([^.]*).*", "\\\\1", m) #get the bodysite\n\tbs <- gsub("_", " ", bs)\n\t\n    #plot the transition state diagram\n\tpdf(file=gsub("matrix", "pdf", m))\n\tplotmat(\n\t\tt(trans),\n\t\tpos = getPos(ntypes),\n\t\tcurve = getCurves(ntypes),\n\t\tself.shiftx = getShiftX(ntypes),\n\t\tself.shifty = getShiftY(ntypes),\n\n\t\tname = LETTERS[1:ntypes],\n\t\tlwd = 1.5,\n\t\tbox.lwd = 2,\n\t\tcex.txt = 0.75,\n\t\tbox.cex = 1.2,\n\t\tbox.size = 0.06,\n\t\tarr.length = 0.5,\n\t\tbox.type = "circle",\n\t\tbox.prop = 1,\n\t\tshadow.size = 0,\n\t\tself.cex = 0.6,\n\t\tmy = -0.075,\n\t\tmx = -0.01,\n\t\trelsize = 0.95,\n\t\tcex.main=1.5,\n\t\tbox.col="gray",\n\t\ttxt.col="black",\n\t\tarr.col="black",\n\t\tdtext=0.5,\n\t\tabsent=-0.1,\n\t\tmain = bs, \n        latex=T\n\t)\n\tdev.off()\n}\n')


# ## Effect of time on stability of community types
# Since the community types were not fixed for each person and some community types were not stable, we next endeavored to determine whether we could associate changes in community types with the duration between samplings. As we mention in the paper, it was not possible to identify any significant relationships.

# In[ ]:


get_ipython().run_cell_magic('R', '', '\n#get the community type data for each visit\nv1 <- read.table(file="community_types.v1.txt", sep=" ", header=T)\nv2 <- read.table(file="community_types.v2.txt", sep=" ", header=T)\nv3 <- read.table(file="community_types.v3.txt", sep=" ", header=T)\nbodysites <- colnames(v1)\n\n#get the dates of the various visits\nmd <- read.table(file="../dbGaP/phs000228.v3.pht002157.v1.p1.c1.EMMES_HMP_DTP_DHX_DVD.HMP.txt", header=T, sep="\\t", na.string=c(NA, ""))\nmd1 <- md[md$VISNO == "01", c("RANDSID", "study_day")]\nmd2 <- md[md$VISNO == "02", c("RANDSID", "study_day")]\nmd3 <- md[md$VISNO == "03", c("RANDSID", "study_day")]\n\nrandsids <- levels(factor(md$RANDSID))\ndates <- matrix(rep(NA, length(randsids)*3), nrow=length(randsids))\nrownames(dates) <- randsids\ncolnames(dates) <- c("visit1", "visit2", "visit3")\n\n#create a dates matrix so we know when each visit occured\ndates[as.character(md1$RANDSID), "visit1"] <- md1$study_day\ndates[as.character(md2$RANDSID), "visit2"] <- md2$study_day\ndates[as.character(md3$RANDSID), "visit3"] <- md3$study_day\n\n\nfor(bs in bodysites){#go through the bodysites\n\ttypes <- matrix(rep(NA, length(randsids)*3), ncol=3) #create a types matrix for the bodysite so we know the type\n\tcolnames(types) <- c("visit1", "visit2", "visit3")   #at each visit\n\trownames(types) <- randsids\n\t\n\ttypes[rownames(v1), "visit1"] <- v1[,bs] #fill the matirx\n\ttypes[rownames(v2), "visit2"] <- v2[,bs]\n\ttypes[rownames(v3), "visit3"] <- v3[,bs]\n\t\n    #figure out who had the same community type at visits 1 and 2 and at visits 2 and 3\n\tv12.same <- !is.na(types[,"visit1"]) & !is.na(types[,"visit2"]) & types[,"visit1"] == types[,"visit2"]\n\tv23.same <- !is.na(types[,"visit2"]) & !is.na(types[,"visit3"]) & types[,"visit2"] == types[,"visit3"]\n\n    #get the number of days between visits that had the same community type\n    dates.same <- c(dates[v12.same, "visit2"] - dates[v12.same, "visit1"], dates[v23.same, "visit3"] -\n                    dates[v23.same, "visit2"])\t\n\t\n\n    #figure out who had different community type at visits 1 and 2 and at visits 2 and 3\n    v12.different <- !is.na(types[,"visit1"]) & !is.na(types[,"visit2"]) & types[,"visit1"] != types[,"visit2"]\n\tv23.different <- !is.na(types[,"visit2"]) & !is.na(types[,"visit3"]) & types[,"visit2"] != types[,"visit3"]\n\n    #get the number of days between visits that had different community types\n    dates.different <- c(dates[v12.different, "visit2"] - dates[v12.different, "visit1"], dates[v23.different, "visit3"] -\n                         dates[v23.different, "visit2"])\t\n\n    \n    #test for a significant difference in the days between visits with different community types vs. those that were the same\n\tpooled.dates <- c(dates.same, dates.different)\n\tcategory <- c(rep("Same\\nCommunity Type", length(dates.same)), rep("Different\\nCommunity Type", length(dates.different)))\n\tp.value <- format(wilcox.test(pooled.dates~category)$p.value, digits=3)\n\n\t\n    #plot the data\n    pdf(file=paste(bs, ".time.pdf", sep=""))\n\tpar(mar=c(5, 8, 0.5, 0.5))\n\tstripchart(pooled.dates~category, method="jitter", yaxt="n", xaxt="n", ylab="", xlab="Days between samplings",\n               xlim=c(0,max(md3$study_day)))\n\taxis(1)\n\taxis(2, at=c(1,2), label=levels(factor(category)), tick=F, las=2)\n\ttext(x=450, y=2.18, label=paste(bs, "\\n(P=", p.value, ")", sep=""), pos=2)\n\tdev.off()\n}\n')


# Running this and looking at the resulting pdfs makes it clear that none of the times between samplimgs for the community types that changed and those that didn't change were significantly different. The P-value in the upper right corner of each plot is from a non-parametric Wilcox Test.

# ## Validate DMM models as a community type approach relative to the PAM-based approach
# Nearly every study to perform community typing has used partitioning around the medoid with distances between samples as the basis for identifying community types. Metrics (e.g. CH and Silouette) are then used to pick the optimal number of community types. For some reason [Koren et al. 2013](http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1002863) did not use the DMM approach eventhough it was published well before their study. We did our best to replicate their methods of simulating a community with four types and evaluate the PAM and DMM-based approach. We also, simulated a community with only one community type to look at the likelihood of falsely detecting a community type. Finally, we generated PAM-based community types for each bodysite and evaluated their Laplace score for those types. These results are summarized in the Supplemental Data section of our paper. 
# 
# ### Simulating communities with one and four community types
# The above mentioned [Koren et al. 2013](http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1002863) study simulated four community types. Here's their methods:  
# 
# > "We generated a synthetic dataset of 100 communities each containing 3,000 “sequences” belonging to 500 mock OTUs. For each synthetic community, 90% (2,700 sequences, or OTU observations) was drawn from the same randomly generated lognormal abundance distribution (shared across all communities) and the remaining 10% (300 sequences) drawn from one of four unique lognormal distributions, forcing the data into four clusters. We then applied the enterotyping methods as described above."
# 
# This was a bit vague (shape parameters?). Plus it didn't include a negative control, which would be a single community type. Plus they didn't do the DMM analysis. So we attempated to replicate and improve upon their analysis with two sets of imulations:
# 
#  * **Single community type.**  We constructed a truncated lognormal distribution with a normal mean of 1.25 and a normal standard deviation of 3.00 and a maximum richness of 500.  For each of 100 simulated communities we drew 3,000 random integers (i.e. sequences) from this distribution where each integer represented a specific OTU.  
#  * **Four community types.**  We used the same truncated lognormal distribution described for the single cluster simulation to draw 2,700 integers sequences for each of 25 samples.  We then permuted the distribution and drew an additional 300 integers from the permuted distribution and added those counts to non-permuted distribution.  We repeated this procedure three additional times so that we had effectively simulated sampling 3,000 sequences from 100 samples representing 4 community types.  
#  
# In the following code chunk we'll generate the **quad**ruple simulated data and the single simulated data and output them as shared files so we can analyze them using the PAM and DMM-based approaches

# In[ ]:


get_ipython().run_cell_magic('bash', '', 'cd ..\nmkdir simulation\ncd simulation/\n')


# In[3]:


get_ipython().run_cell_magic('R', '', '\ngetTruncatedLogNormal <- function(nseqs, limit, mu, sd){\n    mu <- 1.25\n    sd <- 3.00\n    \n\tdist <- round(rlnorm(nseqs*10, meanlog=mu, sdlog=sd))+1   #pull 10-fold more sequences than we need\n\tdist <- dist[dist < limit]                                #in case some go over limit (maximum richness)\n\tdist <- dist[0:nseqs]                                     #return the first nsequences\n}\n\n\niterations <- 100\n\nlower <- 1\nupper <- 500\n\nfrac.z <- 0.90                       #fraction to draw from the common distribution - z = common distribution\nnseqs <- 3000                        #number of sequences we want to draw\nncommunities <- 100                  #number of communities to sample from\nsamples_per <- ncommunities * 0.25   #we want each community to have 25% of the samples\n\na.sample <- sample(1:upper)          #here is the permutation of the ordering of the OTUs that corresponds\nb.sample <- sample(1:upper)          #to each of the four community types. this way the "extra" distribution\nc.sample <- sample(1:upper)          #will be the same shaped distribution, just in a different order\nd.sample <- sample(1:upper)\n\nfor(iter in 1:iterations){\n    #we\'re going to fill each of the 4 community types with data\n    \n\ta<-matrix(nrow=samples_per, ncol=upper)  \n\tfor(i in 1:samples_per){   #now looping over each of the 25 samples...\n\t\ta.dist <- getTruncatedLogNormal(nseqs*(1-frac.z), upper, mu, sd) #the a component draws from the LN distro and reorders\n\t\tz.dist <- getTruncatedLogNormal(nseqs*frac.z, upper, mu, sd)     #the z component is the regular ordered LN distro\n\t\ta.table <- table(c(a.sample[a.dist], z.dist))                    #get the number of times each OTU/integer shows up\n\t\tx <- rep(0, upper); names(x) <-1:500       #make sure everything is on the same OTU naming scheme\n\t\tx[names(a.table)] <- a.table\n\t\ta[i,] <- x     #viola - the a component of the shared file - repeat for b, c, and d\n\t}\n\n\tb<-matrix(nrow=samples_per, ncol=upper)\n\tfor(i in 1:samples_per){\n\t\tb.dist <- getTruncatedLogNormal(nseqs*(1-frac.z), upper, mu, sd)\n\t\tz.dist <- getTruncatedLogNormal(nseqs*frac.z, upper, mu, sd)\n\t\tb.table <- table(c(b.sample[b.dist], z.dist))\n\t\tx <- rep(0, upper); names(x) <-1:500\n\t\tx[names(b.table)] <- b.table\n\t\tb[i,] <- x\n\t}\n\n\tc<-matrix(nrow=samples_per, ncol=upper)\n\tfor(i in 1:samples_per){\n\t\tc.dist <- getTruncatedLogNormal(nseqs*(1-frac.z), upper, mu, sd)\n\t\tz.dist <- getTruncatedLogNormal(nseqs*frac.z, upper, mu, sd)\n\t\tc.table <- table(c(c.sample[c.dist], z.dist))\n\t\tx <- rep(0, upper); names(x) <-1:500\n\t\tx[names(c.table)] <- c.table\n\t\tc[i,] <- x\n\t}\n\n\td<-matrix(nrow=samples_per, ncol=upper)\n\tfor(i in 1:samples_per){\n\t\td.dist <- getTruncatedLogNormal(nseqs*(1-frac.z), upper, mu, sd)\n\t\tz.dist <- getTruncatedLogNormal(nseqs*frac.z, upper, mu, sd)\n\t\td.table <- table(c(d.sample[d.dist], z.dist))\n\t\tx <- rep(0, upper); names(x) <-1:500\n\t\tx[names(d.table)] <- d.table\n\t\td[i,] <- x\n\t}\n\n\tshared <- rbind(a, b, c, d)  #make shared file\n\n\tshared <- shared[, apply(shared, 2, sum) > 0]  #remove columns that only have zeros\n\tnotus <- ncol(shared) #build the shared file...\n\tlabel <- rep(1, ncommunities)  #the first column has a common label\n    \n    #the second column has the sample name - here we indicate the community type for each\n\tgroups <- c(paste(1:samples_per, ".A", sep=""),\n                paste(1:samples_per, ".B", sep=""),\n                paste(1:samples_per, ".C", sep=""),\n                paste(1:samples_per, ".D", sep=""))\n\tnotus.col <- rep(notus, ncommunities)  # the third column has the number of OTUs\n\tfull.shared <- cbind(label, groups, notus.col, shared)  #...build the shared file\n\tcolnames(full.shared) <- c("label", "Group", "numOTUs", 1:notus)\n\twrite.table(file=paste("quad_", iter, ".shared", sep=""), full.shared, row.names=F, quote=F, sep="\\t")\n    #done!\n}\n\n\n#generate the single case...\nfrac.z <- 1.0000  #now we want to build a shared file where there\'s only one community type\nfor(iter in 1:iterations){\n\tz<-matrix(nrow=ncommunities, ncol=upper)\n\tfor(i in 1:ncommunities){\n\t\tz.dist <- getTruncatedLogNormal(nseqs*frac.z, upper, mu, sd)  #again draw from our lognormal disribution\n\t\tz.table <- table(z.dist)                  #everything else is pretty much the same...\n\t\tx <- rep(0, upper); names(x) <-1:500\n\t\tx[names(z.table)] <- z.table\n\t\tz[i,] <- x\n\t}\n\tshared <- z\n\tshared <- shared[, apply(shared, 2, sum) > 0]\n\tnotus <- ncol(shared)\n\tlabel <- rep(1, ncommunities)\n\tgroups <- c(paste(1:ncommunities, ".A", sep=""))\n\tnotus.col <- rep(notus, ncommunities)\n\n\tfull.shared <- cbind(label, groups, notus.col, shared)\n\tcolnames(full.shared) <- c("label", "Group", "numOTUs", 1:notus)\n\n\twrite.table(file=paste("single_", iter, ".shared", sep=""), full.shared, row.names=F, quote=F, sep="\\t")\n}\n')


# Now that we've built the single and quadruple community type shared files we want to know whether we can detect those community types. First we'll start with the PAM-based approach. As we mention in the supplement, the Silhouette Index (SI) is prefered over the Calinski-Harabasz (CH) statistic because it had a more objective basis. They asserted that SI values below 0.25 have a lack of substantial structure to the clusters. Let's test this out...

# In[ ]:


require(cluster)
require(fpc)

n.simulations <- 100

dist.JSD <- function(inMatrix, pseudocount=0.000001, ...) {
	
	KLD <- function(x,y) sum(x *log(x/y))
	
#nb:	the line below is from the MetaHit group, which is actually the 
#		root JSD, not the straight up JSD
#	JSD <- function(x,y) sqrt(0.5 * KLD(x, (x+y)/2) + 0.5 * KLD(y, (x+y)/2))
	JSD <- function(x,y) (0.5 * KLD(x, (x+y)/2) + 0.5 * KLD(y, (x+y)/2))
	
	matrixColSize <- length(colnames(inMatrix))
	matrixRowSize <- length(rownames(inMatrix))
	colnames <- colnames(inMatrix)
	resultsMatrix <- matrix(0, matrixColSize, matrixColSize)

	#I'm not sure I "believe" in JSD, but it's what all the enterotype papers
	#have used so I use it here. There seems to be a funkiness here in that
	#you have to add a very small number to the relative abundances since you 
	#can easily have a divide by zero. Sketchy for such sparse data matrices. IMHO.
	inMatrix = apply(inMatrix,1:2,function(x) ifelse (x==0,pseudocount,x))

	for(i in 1:matrixColSize) {
		for(j in 1:matrixColSize) { 
			if(j > i){	#no need to do everything twice.
				resultsMatrix[i,j] <- JSD(as.vector(inMatrix[,i]),
											as.vector(inMatrix[,j]))
				resultsMatrix[j,i] <- resultsMatrix[i,j]
			}
		}
	}
	colnames -> colnames(resultsMatrix) -> rownames(resultsMatrix)
	as.dist(resultsMatrix)->resultsMatrix
	attr(resultsMatrix, "method") <- "dist"
	return(resultsMatrix) 
}

#use the built in pam clustering algoirthm
pam.clustering <- function(x,k) { # x is a distance matrix and k the number of clusters
	cluster <- as.vector(pam(as.dist(x), k, diss=TRUE)$clustering)
	return(cluster)
}


pam.approach <- function(prefix){
    max.ch <- vector()
    max.silhouette <- vector()
    
    write(paste("simulation", "k_CH", "max_CH", "k_Silhouette", "max_Silhoutte"),
          paste(prefix, "enterotype_output.txt", sep=""), append=F)
    
    for(i in 1:n.simulations){
    
        #read back in the simulated shared file
        data <- read.table(paste(prefix, i, ".shared", sep=""), header=T, row.names=2)
        data <- data[,-c(1,2)]
        data <- data / apply(data, 1, sum)
        data <- t(data)#need OTUs in rows and samples in columns
    
        data.dist <- dist.JSD(data)#get the distance matrix
    
        ch <- vector()
        width <- vector()
    
        for (k in 2:10) { #calcualte the ch and silhouette for k=2:10 (can't do k=1)
            data.cluster <- pam.clustering(data.dist, k)
            stats <- cluster.stats(data.dist, data.cluster)
            ch[k] <- stats$ch
            width[k] <- stats$avg.silwidth
        }
        
        max.ch[i] <- which.max(ch)	#get the maximum
        max.silhouette[i] <- which.max(width) #get the maximum
    
        write(paste(i, max.ch[i], max(ch, na.rm=T), max.silhouette[i], max(width, na.rm=T)),
              paste(prefix, "enterotype_output.txt", sep=""), append=T)
    }
}

pam.approach("single_")
pam.approach("quad_")


single <- read.table(file="single_enterotype_output.txt", header=T, row.names=1)
single$k_CH <- factor(single$k_CH)
single$k_Silhouette <- factor(single$k_Silhouette)
summary(single)

quadruple <- read.table(file="quad_enterotype_output.txt", header=T, row.names=1)
quadruple$k_CH <- factor(quadruple$k_CH)
quadruple$k_Silhouette <- factor(quadruple$k_Silhouette)
summary(quadruple)


# And so we see from these final commands that the PAM-based methods can't really call a single community type case (although the scores are admittedly low) and while it can call the four community type case, the scores are pretty mediocre by Koren et al.'s standards. Next we want to run the DMM models on the community type data. Like we did for the microbiome data, we will run these in parallel to control for picking the wrong starting condition and the natural variation in the algorithm. 

# In[4]:


get_ipython().run_cell_magic('bash', '', '\nmkdir dmm.sim.r1 dmm.sim.r2 dmm.sim.r3 dmm.sim.r4 dmm.sim.r5\n\n>simulated.batch\n\nfor SHARED in $(ls single_*.shared quad_*.shared)\ndo\n    echo "mothur \\"#get.communitytype(shared=../$SHARED, outputdir=./)\\"" >> simulated.batch\ndone\n\nsplit -l 10 simulated.batch simulated_\n\n\ncat > header.batch << "_EOF_"\n#!/bin/sh\n#PBS -l nodes=1:ppn=1\n#PBS -l walltime=500:00:00\n#PBS -l mem=1gb\n#PBS -j oe\n#PBS -m abe\n#PBS -V\n#PBS -M your@email.here\n#PBS -q first\n\necho "ncpus-2.pbs"\ncat $PBS_NODEFILE\nqstat -f $PBS_JOBID\n\ncd $PBS_O_WORKDIR\n\nNCPUS=`wc -l $PBS_NODEFILE | awk \'{print $1}\'`\n\n_EOF_\n\n\ncat > tail.batch << "_EOF_"\n\necho "qsub working directory absolute is"\necho $PBS_O_WORKDIR\nexit\n_EOF_\n\n\nfor DIR in dmm.sim.r1 dmm.sim.r2 dmm.sim.r3 dmm.sim.r4 dmm.sim.r5\ndo\n    cd $DIR\n    \n    cp ../simulated_?? ./\n    for BATCH in $(ls simulated_??)\n    do\n        cat ../header.batch $BATCH ../tail.batch > $BATCH.$DIR.batch\n        qsub $BATCH.$DIR.batch\n    done\n    \n    cd ../\ndone\n')


# Now that those jobs have completed we want to find the best fit for each simulation

# In[ ]:


get_ipython().run_cell_magic('R', '', '\nniters <- 100\n\ngetNTypes <- function(prefix){\n\tntypes <- rep(NA, niters)\n\n\tfor(i in 1:niters){\n\t\tr1 <- read.table(file=paste("dmm.sim.r1/", prefix, i, ".1.dmm.mix.fit", sep=""), header=T, row.names=1)\n\t\tr2 <- read.table(file=paste("dmm.sim.r2/", prefix, i, ".1.dmm.mix.fit", sep=""), header=T, row.names=1)\n\t\tr3 <- read.table(file=paste("dmm.sim.r3/", prefix, i, ".1.dmm.mix.fit", sep=""), header=T, row.names=1)\n\t\tr4 <- read.table(file=paste("dmm.sim.r4/", prefix, i, ".1.dmm.mix.fit", sep=""), header=T, row.names=1)\n\t\tr5 <- read.table(file=paste("dmm.sim.r5/", prefix, i, ".1.dmm.mix.fit", sep=""), header=T, row.names=1)\n\n\t\tr1[is.infinite(as.matrix(r1))] <- NA\n\t\tr2[is.infinite(as.matrix(r2))] <- NA\n\t\tr3[is.infinite(as.matrix(r3))] <- NA\n\t\tr4[is.infinite(as.matrix(r4))] <- NA\n\t\tr5[is.infinite(as.matrix(r5))] <- NA\n\n\t\tmaxK <- max(c(nrow(r1), nrow(r2), nrow(r3), nrow(r4), nrow(r5)))\n\t\tlaplace <- matrix(rep(NA, 5*maxK), nrow=maxK)\n\t\tcolnames(laplace) <- c("r1", "r2", "r3", "r4", "r5")\n\t\trownames(laplace) <- 1:maxK\n\t\t\n\t\tlaplace[1:nrow(r1),"r1"] <- r1[, "Laplace"]\n\t\tlaplace[1:nrow(r2),"r2"] <- r2[, "Laplace"]\n\t\tlaplace[1:nrow(r3),"r3"] <- r3[, "Laplace"]\n\t\tlaplace[1:nrow(r4),"r4"] <- r4[, "Laplace"]\n\t\tlaplace[1:nrow(r5),"r5"] <- r5[, "Laplace"]\n\t\t\n\t\tlaplace[apply(is.na(laplace), 1, sum) ==5 ,] <- rep(1e6,5)\n\n\t\tcomposite <- apply(laplace, 1, min, na.rm=T)\n\t\tntypes[i] <- which.min(composite)\n\t}\n\treturn(ntypes)\n}\n\nn.singletypes <- getNTypes("single_")\ntable(n.singletypes)\n\nn.quadtypes <- getNTypes("quad_")\ntable(n.quadtypes)\n')


# As these results indicate, the DMM-based approach correctly called all of the single and quadruple community type iterations.

# ### Applying PAM-based approach to HMP data
# Next we wanted to see how the PAM-based approach did on the observed community data. First, we assigned all of the bodysites to community types for varying numbers of clusters and found the best assignment using SI and CH. Next, we used those assignments to calculate the Laplace metric. This allowed us to compare the PAM and DMM-based assignments directly. Note that a key difference between the PAM and DMM-based approaches is that PAM-based methods require the use of a distance calculator whereas the DMM-based approach does not.

# In[ ]:


get_ipython().run_cell_magic('R', '', '\nshared <- c("Antecubital_fossa.tx.1.subsample.shared", "Anterior_nares.tx.1.subsample.shared", \n\t"Buccal_mucosa.tx.1.subsample.shared", "Hard_palate.tx.1.subsample.shared", \n\t"Keratinized_gingiva.tx.1.subsample.shared", "Palatine_Tonsils.tx.1.subsample.shared", \n\t"Retroauricular_crease.tx.1.subsample.shared", "Saliva.tx.1.subsample.shared", \n\t"Stool.tx.1.subsample.shared", "Subgingival_plaque.tx.1.subsample.shared", \n\t"Supragingival_plaque.tx.1.subsample.shared", \t"Throat.tx.1.subsample.shared", \n\t"Tongue_dorsum.tx.1.subsample.shared", "Vagina.tx.1.subsample.shared")\n\npath <- "../v35/"\n\n#set up a summary table where the rows are the bodysites an the columns are the silhouette width, the number\n#of clusters based on SI, the CH index, and the number of clusters based on the CH index\nsummary <- data.frame(matrix(rep(NA, 14*4), nrow=14, ncol=4))\nrownames(summary) <- gsub(".tx.1.subsample.shared", "", shared)\ncolnames(summary) <- c("width", "w_clusters", "ch", "ch_clusters")\n\t\nfor(s in shared){\n\twrite(s, "")\n\tstub <- gsub("shared", "", s)\n    file <- paste(path, s, sep="")\n\tdata <- read.table(file, header=T, row.names=2)  #get the shared file and turn it into relative abundance data\n\tdata <- data[,-c(1,2)]\n\tdata <- data / apply(data, 1, sum) \n\tdata <- t(data) #transpose it\n\n\tdata.dist <- dist.JSD(data)  #get the distance matrix\n\n\tch <- vector()\n\twidth <- vector()\n\n\tfor (k in 2:10) {  #get the si and ch indices for 2 to 10 clusters\n\t\tdata.cluster <- pam.clustering(data.dist, k)\n\t\tstats <- cluster.stats(data.dist, data.cluster)\n\t\tch[k] <- stats$ch\n\t\twidth[k] <- stats$avg.silwidth\n\t}\n\t\n\tpdf(paste(stub, "enterotype.pdf", sep=""), height=10, width=5)  #generate a spike plot for the SI and CH indices\n\tpar(mfrow=c(2,1))\n\tplot(ch, type="h", xlab="k clusters", ylab="CH index")\n\tplot(width~seq(1:10), type="h", xlab="kclusters", ylab="Average Silhouette Width")\n\tdev.off()\n\t\n\tmax.ch <- which.max(ch)  #get the number of types by CH\n\tmax.silhouette <- which.max(width)  #get the number of types by SI\n\t\n\tsummary[gsub(".tx.1.subsample.shared", "", s),] <- c(max(width, na.rm=T), max.silhouette, max(ch, na.rm=T), max.ch)#update the summary file\n\twrite.table(cbind(colnames(data), paste("Partition_", pam.clustering(data.dist, max.ch), sep="")), #output a design file\n                paste(stub, "pam.ch.enterotype", sep=""), quote=F, row.names=F, col.names=F) #based on the CH index\n\n    write.table(cbind(colnames(data), paste("Partition_", pam.clustering(data.dist, max.silhouette), sep="")), #output a design\n                paste(stub, "pam.sw.enterotype", sep=""), quote=F, row.names=F, col.names=F) #file based on the SI index\n}\t\n\nwrite.table(summary, file="pam.enterotype.summary", quote=F, sep="\\t") #output the summary file\n')


# Now that we have the community types by the PAM-based approach, let's run the community type assignments into the DMM model and calculate the Laplace value for each bodysite based on the number of clusters found using the SI and CH indices. To do this, we use a homebrew C++ program that will take in a shared file and a design file. You can get our code from [GitHub](https://github.com/SchlossLab/pds_dmm):

# In[ ]:


get_ipython().run_cell_magic('bash', '', '\ngit clone git@github.com:SchlossLab/pds_dmm.git \ncd pds_dmm\nmake\ncd ..\n\nfor ENTEROTYPE in $(ls *enterotype)\ndo\n    SHARED=${ENTEROTYPE//pam.*/shared}\n\tpds_dmm/pds_dmm -shared ../v35/$SHARED -design $ENTEROTYPE\ndone\n')


# Now we want to synthesize everything to make a table that looks like Extended Data Table 2.

# In[ ]:


body_stubs <- c("Antecubital_fossa", "Anterior_nares", "Buccal_mucosa", "Hard_palate", 
				"Keratinized_gingiva", "Palatine_Tonsils", "Retroauricular_crease",
				"Saliva", "Stool", "Subgingival_plaque", "Supragingival_plaque", "Throat", 
				"Tongue_dorsum", "Vagina")

sausage <- ".tx.1.subsample."

mix_suffix <- "1.dmm.mix.fit"
sw_suffix <- "pam.sw.fit"
ch_suffix <- "pam.ch.fit"

pam.summary <- read.table(file="pam.enterotype.summary", header=T)

extended.table.2 <- matrix(rep(NA, 8*length(body_stubs)), ncol=8)
rownames(extended.table.2) <- body_stubs
colnames(extended.table.2) <- c("SI.clusters", "SI.score", "SI.Laplace", "CH.clusters",
								"CH.score", "CH.Laplace", "DMM.clusters", "DMM.Laplace")

for(bs in body_stubs){
	dmm.fit <- read.table(file=paste("../v35/dmm_best/", bs, sausage, mix_suffix, sep=""), header=T)
	dmm.K <- which.min(dmm.fit$Laplace)
	dmm.Laplace <- min(dmm.fit$Laplace, na.rm=T)
	
	sw.Laplace <- read.table(file=paste(bs, sausage, sw_suffix, sep=""), header=T)$Laplace
	ch.Laplace <- read.table(file=paste(bs, sausage, ch_suffix, sep=""), header=T)$Laplace

	pam <- pam.summary[bs,]
	extended.table.2[bs,] <- c(pam$w_clusters, pam$width, sw.Laplace, pam$ch_clusters, pam$ch, ch.Laplace, dmm.K, dmm.Laplace)
}

write.table(file="extended_table_2.txt", extended.table.2, quote=F)


# ### Representing community types in ordinations
# We strongly resisted the reviewer's request to put in an ordination of the samples colored by the community types. Why? Well, because we know that many OTUs are responsible for dividing the samples amongst the community types and an ordination really only allows you to use a couple of variables. Futhermore, construction of an ordination requires filtering the data through a distance matrix, which is one of the problems with the PAM-based approach. Whatever, here we present code generating the ordinations for each bodysite with coloring for each community type. First we use mothur to calculate the JSD distances between samples. We do this by rarefying the samples to 1000 sequences 100 times. After all of the iterations we calculate the average JSD matrix:

# In[ ]:


get_ipython().run_cell_magic('bash', '', '\nfor SHARED in $(ls ../v35/*subsample.shared)\ndo\n    DIST=${SHARED//shared/jsd.1.lt.dist}\n    DIST=${DIST//..\\/v35\\//}\n\tmothur "#dist.shared(shared=$SHARED, calc=jsd, processors=8, outputdir=./); nmds(phylip=$DIST, maxdim=2)"\ndone\n')


# Now we build the NMDS ordination figures and output them as PDFs.

# In[ ]:


bodysites <- c("Antecubital_fossa", "Anterior_nares", "Buccal_mucosa", "Hard_palate", "Keratinized_gingiva",
               "Palatine_Tonsils", "Retroauricular_crease", "Saliva", "Stool", "Subgingival_plaque",
               "Supragingival_plaque", "Throat", "Tongue_dorsum", "Vagina")


for(bs in bodysites){
    axes <- read.table(file=paste(bs, ".tx.1.subsample.jsd.1.lt.nmds.axes", sep=""), header=T, row.names=1)
    axes <- axes[order(rownames(axes)),]
    
    stress <- read.table(file=paste(bs, ".tx.1.subsample.jsd.1.lt.nmds.stress", sep=""), header=T)
    iter <- which.min(stress$Stress)
    stress.score <- format(stress[iter, "Stress"], digits=2)
    r.squared <- format(stress[iter, "Rsq"], digits=2)
    
    pdf(file=paste(bs,".ordination.pdf", sep=""), width=8.5, height=11)
    
    par(mfrow=c(2,1))
    par(mar=c(1, 4, 3, 1))
    dmm <- read.table(file=paste("../v35/dmm_best/", bs, ".tx.1.subsample.1.dmm.mix.design", sep=""), row.names=1)
    rownames(dmm)==rownames(axes)

    ntypes <- length(levels(factor(dmm$V2)))
    clrs <- rainbow(ntypes)
    
    plot(axes, col=clrs[dmm$V2], pch=19, cex=0.75, xaxt="n", xlab="", ylab="NMDS Axis 2", ylim=c(-0.7,0.7), xlim=c(-0.7,0.7))
    text(x=-0.75, y=0.68, "DMM", cex=1.5, font=2, pos=4)
    legend("bottomleft", legend=paste("Community Type", LETTERS[1:ntypes]),
           pch=19, col=clrs, pt.cex=1.5)
    mtext(side=3, line=1.25, text=gsub("_", " ", bs), at=0.7, adj=1, cex=1.5, font=2)
    
    
    par(mar=c(4, 4, 0, 1))
    pam <- read.table(file=paste(bs, ".tx.1.subsample.pam.sw.enterotype", sep=""), row.names=1)
    rownames(pam)==rownames(axes)
    ntypes <- length(levels(factor(pam$V2)))
    clrs <- rainbow(ntypes)
    
    plot(axes, col=clrs[pam$V2], pch=19, cex=0.75, ylab="NMDS Axis 2", xlab="NMDS Axis 1", ylim=c(-0.7,0.7), xlim=c(-0.7,0.7))
    text(x=-0.75, y=0.68, "PAM (SI)", cex=1.5, font=2, pos=4)
    legend("bottomleft", legend=paste("Community Type", LETTERS[1:ntypes]), pch=19, col=rainbow(ntypes), pt.cex=1.5)

    mtext(side=1, line=2, at=-0.9, text=paste("Stress: ", stress.score, "\nRsquared: ", r.squared, sep=""), adj=0)
    dev.off()
}


# # Conclusions
# We hope you've found this notebook useful at providing greater details into the methods we used to characterize the full HMP 16S rRNA gene sequencing dataset. Here are the take home points from the study and our methods development / application:
# 
#  * Identifying community types is a useful way of categorizing community data at a bodysite and can be used as a tool to reduce the complexity of the data in order to associate these types with clinical metadata (e.g. whether one was breastfed, gender, level of education, etc.)
#  * Your mouth is connected to your rectum. You wouldn't believe the number of people (including one reviewer!) that were shocked to hear that the community type in your mouth is predictive of those in feces. Again, it is important to note that the bacterial taxa themselves are not necessarily the same, but the type of community you have in one site is predictive of others.
#  * Community types are not fixed and each type appears to have its own level of variation. Unfortunately, we're somewhat powerless to figure out why they change. Other studies suggest the same thing. For example [Wu et al. 2012]() found two community types (using PAM) since they were interested in diet and couldn't find a short term result concluded that it was based on long term diet changes. We just don't know.
#  * We appreciate that the concept of community types has been controversial. We have demonstrated that applying the DMM-based approach is more robust than the previously-applied PAM-based approaches, which do not seem robust. This is especially the case when there is only one community type since it will not be detected by the current PAM methods.  
# 
# 
# Some things for the future:  
# 
#  * If we can bin people into community types we can begin to associate community types with cases of disease or risk of disease. This has already been done in a number of studies (see the paper for references).  
#  * We need better/more temporal metadata that we can use to explain changes in community types  
#  * We suspect that as we get more and more subjects we will see more and more community types. The types represent clusters of points. We know that each individual is more like themselves than anyone else. Therefore, if we sample more points from each person, we would expect them to form their own type.  
#  
# 
# Finally, if you find this set of methods useful, please let us know (pschloss@umich.edu)!
# 
# 
# # Postscript
# The data analysis that was performed in the original study used a phylotype-based approach to characterize the structure of the human microbiome. We have also been interested in using the data to generate OTUs and mothur-based shared files. We did this as follows:

# In[ ]:


mothur "#cluster.split(fasta=data/raw/v35.unique.good.filter.unique.precluster.unique.pick.pick.fasta, name=data/raw/v35.unique.good.filter.unique.precluster.unique.pick.pick.names, taxonomy=data/raw/v35.unique.good.filter.unique.precluster.unique.pick.three.wang.pick.taxonomy, taxlevel=5, classic=T, processors=4, cluster=T, outputdir=data/mothur/);
        make.shared(list=data/mothur/v35.unique.good.filter.unique.precluster.unique.pick.pick.an.list, group=data/raw/v35.good.pick.pick.groups, label=0.03, outputdir=mothur);
        classify.otu(list=data/mothur/v35.unique.good.filter.unique.precluster.unique.pick.pick.an.list, group=data/raw/v35.good.pick.pick.groups, name=data/raw/v35.unique.good.filter.unique.precluster.unique.pick.pick.names, taxonomy=data/raw/v35.unique.good.filter.unique.precluster.unique.pick.three.wang.pick.taxonomy, label=0.03, outputdir=data/mothur);
        get.oturep(fasta=data/raw/v35.unique.good.filter.unique.precluster.unique.pick.pick.fasta, name=data/raw/v35.unique.good.filter.unique.precluster.unique.pick.pick.names, list=data/mothur/v35.unique.good.filter.unique.precluster.unique.pick.pick.an.list, label=0.03, method=abundance);"


# We then went back and ran these output files through the various scripts that we did above starting at the block staring "Recall above we gave the samples..."