#!/usr/bin/env python # coding: utf-8 # ###Quality trim all fastq.gz files using [Trimmomatic (v0.30)](http://www.usadellab.org/cms/?page=trimmomatic) # ####Code explanation of for loop below: # 1. ```%%bash``` specifies to use the shell for this Jupyter cell # 2. ```for file in /Volumes/nightingales/C_gigas/2212_lane2_[^N]*``` initiates a for loop to handle all files beginning with ```2212_lane2_``` and only those that do not have the letter "N" at that position in the file name. # 3. ```do``` tells the for loop what to do with each of the files. # 4. ```newname=${file##*/}``` takes the value of the ```$file``` variable (which is ```/Volumes/nightingales/C_gigas/2212_lane2_[^N]*```) and trims the longest match from the beginning of the pattern (the pattern is ```*/```; the ```##``` is a bash command to specifiy how to trim). The resulting output (which is just the file name without the full path) is then stored in the ```newname``` variable. # 5. This line initiates Trimmomatic and uses the following arguments to specify order of execution: # 1. single end reads (```SE```) # 1. number of threads (```-threads 16```), # 2. type of quality score (```-phred33```), # 3. input file location (```"$file"```), # 4. output file name/location (```/Volumes/Data/Sam/scratch/20140414_trimmed_$newname```), # 5. single end Illumina TruSeq adaptor trimming (```ILLUMINACLIP:/usr/local/bioinformatics/Trimmomatic-0.30/adapters/TruSeq3-SE.fa:2:30:10```), # 6. cut number of bases at beginning of read if below quality threshold (```LEADING:3```) # 7. cut number of bases at end of read if below quality threshold (```TRAILING:3```) # 8. cut if average quality within 4 base window falls below 15 (```SLIDINGWINDOW:4:15```) # 6. ```done``` closes for loop. # In[14]: get_ipython().run_cell_magic('bash', '', 'for file in /Volumes/nightingales/C_gigas/2212_lane2_[^N]*\ndo\nnewname=${file##*/}\njava -jar /usr/local/bioinformatics/Trimmomatic-0.30/trimmomatic-0.30.jar SE -threads 16 -phred33 "$file" /Volumes/Data/Sam/scratch/20140414_trimmed_$newname ILLUMINACLIP:/usr/local/bioinformatics/Trimmomatic-0.30/adapters/TruSeq3-SE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15;\ndone\n') # ###FASTQC on all trimmed files using [FASTQC (v0.11.2)](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) # In[16]: get_ipython().run_cell_magic('bash', '', 'for file in /Volumes/Data/Sam/scratch/20150414_trimmed_2212*; do fastqc "$file" --outdir=/Volumes/Eagle/Arabidopsis/; done\n') # ###Unzip all FASTQC .zip files # In[17]: cd /Volumes/Eagle/Arabidopsis/ # In[19]: get_ipython().run_cell_magic('bash', '', 'for file in 20150414_trimmed_2212*.zip; do unzip "$file"; done\n') # In[20]: ls 20150414_*fastqc # ###Concatenate two groups of sequences into single file # ####400ppm (control) sequences - Index GCCAAT # In[7]: cd /Volumes/Data/Sam/scratch/ # In[8]: get_ipython().run_cell_magic('bash', '', '#gunzips all matching files in folder and appends the data to a single file:\n#20150414_trimmed_2212_lane2_400ppm_GCCAAT.fastq\nfor file in 20150414_trimmed_2212_lane2_G*\ndo\ngunzip -c "$file" >> 20150414_trimmed_2212_lane2_400ppm_GCCAAT.fastq\ndone\n') # In[9]: get_ipython().run_cell_magic('bash', '', '#Gzip file\ngzip 20150414_trimmed_2212_lane2_400ppm_GCCAAT.fastq\n') # ####1000ppm (acidification) sequences - Index CTTGTA # In[10]: get_ipython().run_cell_magic('bash', '', '#gunzips all matching files in folder and appends the data to a single file:\n#20150414_trimmed_2212_lane2_1000ppm_CTTGTA.fastq\nfor file in 20150414_trimmed_2212_lane2_C*\ndo\ngunzip -c "$file" >> 20150414_trimmed_2212_lane2_1000ppm_CTTGTA.fastq\ndone\n') # In[11]: get_ipython().run_cell_magic('bash', '', '#Gzip file\ngzip 20150414_trimmed_2212_lane2_1000ppm_CTTGTA.fastq\n') # ###Copy files to Eagle for web-based access # In[12]: get_ipython().run_cell_magic('bash', '', 'for file in 2015*e2_[14]*; do cp "$file" /Volumes/Eagle/Arabidopsis/; done\n')