Notebook

Git/GitHub + `bedtools` +TSCC + YOU!¶

To bring it all together, you will combine everything you've learned so far about UNIX, bash shell, TSCC, git, and github to collaboratively write a submitter script to TSCC.

Identify your randomly assigned partner that was emailed out.
Pick one person to be reponsible for exercises 1-3 from 6_tf_binding_promoters.ipynb (person1) and the the other to be responsible for exercises 4-6 (person2)
Have person1 create a GitHub repo called biom262-hw1 and write a short README.md file
Have person2 add a LICENSE file copied from the "Text of the UC Copyright Notice" section here
Collaboratively work on a submitter script called tf_binding.sh to TSCC which has all the ##PBS flags shown in the TSCC Quick Start Guide below, and contribute the lines of code that you are responsible for. Use git blame tf_binding.sh to make sure person1 is to "blame" for exercises 1-3 and person2 is to "blame" for exercises 4-5. The "blame" of the remaining file doesn't matter.
Have person1 add the line: echo "Hello I am a message in standard out (stdout)" and have person2 add the line echo "Hello I am a message in standard error (stderr) >&2" (the >&2 outputs to "secondary" aka "error" output)
Have both people add, commit, and push their changes to the server. Are there merge conflicts? How do you solve them?
Submit your script to TSCC!
Wait for it to run.
Check the output. Is it correct? Were the correct files added?
Add the resulting .o##### and .e##### (or if you were fancy and redirected your stdout and stderr to something else then include those files)
Your final repository should have at least five files:

LICENSE
README.txt
tf_binding.sh
tf_binding.sh.o######
tf_binding.sh.e######

Run git blame on tf_binding.sh. You should get output that looks like this, which has commit ID in the first column, the name of the user, the timestamp, and the line number for each line. Make sure different people wrote their exercise sections, and the echo stuff too.

53317857 (Jaclyn Einstein 2016-01-19 15:26:59 -0800  1) #!/bin/bash
53317857 (Jaclyn Einstein 2016-01-19 15:26:59 -0800  2) #PBS -q hotel
34b09512 (Jaclyn Einstein 2016-01-20 18:01:55 -0800  3) #PBS -l nodes=2:ppn=2
34b09512 (Jaclyn Einstein 2016-01-20 18:01:55 -0800  4) #PBS -l walltime=:00:20:00
53317857 (Jaclyn Einstein 2016-01-19 15:26:59 -0800  5) #PBS -N tf_binding.sh
53317857 (Jaclyn Einstein 2016-01-19 15:26:59 -0800  6) 
34b09512 (Jaclyn Einstein 2016-01-20 18:01:55 -0800  7) cd ~/code/biom262-hw1/data
34b09512 (Jaclyn Einstein 2016-01-20 18:01:55 -0800  8) module load biotools
b63c5509 (Jaclyn Einstein 2016-01-18 16:37:52 -0800  9) 
b63c5509 (Jaclyn Einstein 2016-01-18 16:37:52 -0800 10) #Exercise 1
b63c5509 (Jaclyn Einstein 2016-01-18 16:37:52 -0800 11) # Filter the tf.bed file for only the NFKB\n
b63c5509 (Jaclyn Einstein 2016-01-18 16:37:52 -0800 12) awk '{if($4=="NFKB") print}' tf.bed > tf.nfkb.bed
b63c5509 (Jaclyn Einstein 2016-01-18 16:37:52 -0800 13) 
b63c5509 (Jaclyn Einstein 2016-01-18 16:37:52 -0800 14) #Exercise 2
b63c5509 (Jaclyn Einstein 2016-01-18 16:37:52 -0800 15) # Filter only the rows of the gtf file that contain the features of type "transcript"\n
b63c5509 (Jaclyn Einstein 2016-01-18 16:37:52 -0800 16) awk '{if($3=="transcript") print}' gencode.v19.annotation.chr22.gtf > gencode.v19.annotation.chr22.transcript.gtf
b63c5509 (Jaclyn Einstein 2016-01-18 16:37:52 -0800 17) 
b63c5509 (Jaclyn Einstein 2016-01-18 16:37:52 -0800 18) #Exercise 3
b63c5509 (Jaclyn Einstein 2016-01-18 16:37:52 -0800 19) # Use bedtools to find promoters (2000 bases upstream of gene)\n
b63c5509 (Jaclyn Einstein 2016-01-18 16:37:52 -0800 20) bedtools flank -i gencode.v19.annotation.chr22.transcript.gtf -g hg19.genome -l 2000 -r 0 -s > gencode.v19.annotation.chr22.transcript.promoter.gtf
0c09c96b (Jaclyn Einstein 2016-01-18 17:47:54 -0800 21) 
88ed4261 (Ben Rubin       2016-01-19 18:15:18 -0800 22) #Exercise 4
88ed4261 (Ben Rubin       2016-01-19 18:15:18 -0800 23) bedtools intersect -a gencode.v19.annotation.chr22.transcript.promoter.gtf -b tf.nfkb.bed >gencode.v19.annotation.chr22.transcript.promoter.nfkb.gtf
88ed4261 (Ben Rubin       2016-01-19 18:15:18 -0800 24) 
88ed4261 (Ben Rubin       2016-01-19 18:15:18 -0800 25) #Exercise 5
88ed4261 (Ben Rubin       2016-01-19 18:15:18 -0800 26) bedtools getfasta -fi GRCh37.p13.chr22.fa -bed gencode.v19.annotation.chr22.transcript.promoter.nfkb.gtf -s -fo gencode.v19.annotation.chr22.transcript.promoter.nfkb.fasta
88ed4261 (Ben Rubin       2016-01-19 18:15:18 -0800 27) 
34b09512 (Jaclyn Einstein 2016-01-20 18:01:55 -0800 28) echo "Hello I am a message in standard out (stout)"
34b09512 (Jaclyn Einstein 2016-01-20 18:01:55 -0800 29) echo "Hello I am a message in standard error (stderr)" >&2

*Note: You will need to add your partner as a collaborator to your repository so they can push to your repo*

Feel free to include any other hg19.genome or bed or gtf files that you want to make your life easier.

Resources for TSCC:

What is this `>&2` garbage and what does it have to do with `stdout` vs `stderr`?¶

To get the most out of this, follow along with the commands and type them out on the terminal.

So far, you have become quite familiar with stdout and stderr without even knowing it! Whenver you do something that outputs something to the terminal, that is stdout. For example, let's take ls. Change directories to your home directory, then type ls.

[ucsd-train01@tscc-login2 ~]$ cd
[ucsd-train01@tscc-login2 ~]$ ls
Anaconda3-2.4.1-Linux-x86_64.sh  bin  code  notebooks  test_script.sh  test_script.sh.e3962194  test_script.sh.o3962194

You can write this output to a file using the > which outputs the stdout to a file:

[ucsd-train01@tscc-login2 ~]$ ls > ls.txt

Notice that there was no output! Where did it go? Well, let's look at ls.txt:

[ucsd-train01@tscc-login2 ~]$ cat ls.txt 
Anaconda3-2.4.1-Linux-x86_64.sh
bin
code
ls.txt
notebooks
test_script.sh
test_script.sh.e3962194
test_script.sh.o3962194

This put the output of ls into a file! It also added a line between entries instead of a space. This is a special feature of ls where it detects that you're outputting to a file that you probably want to grep so to make your life easier, it puts everything on one file.

So why do we need `stderr` ?¶

To learn about stderr, let's first use a command that we know will fail. Let's try to ls a directory that doesn't exist.

[ucsd-train01@tscc-login2 ~]$ ls nonexistent_folder
ls: cannot access nonexistent_folder: No such file or directory

Now let's save the stdout as a file.

[ucsd-train01@tscc-login2 ~]$ ls nonexistent_folder > ls_nonexistent_folder.txt
ls: cannot access nonexistent_folder: No such file or directory

Wait, that still put stuff out on the terminal? Didn't we save this output? Let's take a look.

[ucsd-train01@tscc-login2 ~]$ cat ls_nonexistent_folder.txt

Hmm. This is empty. What's going on?

`stdout`: The other white meat¶

When you use >, the technical jargon for what you are doing is "redirecting standard output to a file." Specifically, stdout is considered the "first" output of a program and is given the number "1". So you secretly specified the "1" already! Try the ls command again, but use 1> to save the stdout. You should get the same output as if you did it without the 1 (do it to convince yourself)

[ucsd-train01@tscc-login2 ~]$ ls 1> ls1.txt
[ucsd-train01@tscc-login2 ~]$ cat ls1.txt
Anaconda3-2.4.1-Linux-x86_64.sh
bin
code
ls1.txt
ls_nonexistent_folder.txt
ls.txt
notebooks
test_script.sh
test_script.sh.e3962194
test_script.sh.o3962194

Now, stderr is officially the "second" output and is given the number "2". Let's try saving the stderr from our failing ls command from before:

[ucsd-train01@tscc-login2 ~]$ ls nonexistent_folder 2> ls_nonexistent_folder.txt

Notice that now there was no output! Let's check out the file we created.

[ucsd-train01@tscc-login2 ~]$ cat ls_nonexistent_folder.txt 
ls: cannot access nonexistent_folder: No such file or directory

Ah-ha! This is *exactly* the output we would have seen had we not "redirected the stderr" (i.e. used this command 2>).

Why are we doing #6?¶

You may be wondering what is the point of this step is:

Have person1 add the line: echo "Hello I am a message in standard out (stdout)" and have person2 add the line echo "Hello I am a message in standard error (stderr) >&2 (the >&2 outputs to "secondary" aka "error" output)

The idea is that then you'll have different outputs in the files you specify by "-e" and "-o" files from the TSCC PBS Submitter script below. That literally means for one person to add the stdout line and the other person to add the stderr line (see the example script at the very bottom)

Now that you've practiced a bit with stderr, Try doing this on the terminal:

[ucsd-train01@tscc-login2 ~]$ echo "Hello I am a message in standard error (stderr)" >&2 | cat > asdf

This should output:

Hello I am a message in standard error (stderr)

But what about this?

[ucsd-train01@tscc-login2 ~]$ echo "Hello I am a message in standard output (stdout)" | cat > asdf

Why does the above command have no output?

Exercise: What is in `asdf`?¶

What are the contents of asdf? Why?

Additional reading, if you feel like it¶

Read more about standard streams on Wikipedia, or on this awesome website which has a really great bash overview/tutorial that I still reference to this day. Plus this long article explains when to use stdout vs stderr very well.

Example scripts¶

What is PBS? PBS stands for "Public Broadcasting Service." Just kidding, it stands for Portable Batch System. It is one of many scheduling and resource management (resources = compute nodes attached to TSCC and their processors and memory) systems out there.

Simple submitter script example - one processor, one node¶

Below is an example submitter script that is in the file called basewise_conservation.sh, and after that I'll explain line-by-line what everything is doing.

You submit the script below with:

qsub basewise_conservation.sh

#!/bin/bash #PBS -N basewise_conservation #PBS -o basewise_conservation.sh.out #PBS -e basewise_conservation.sh.err #PBS -V #PBS -l walltime=24:00:00 #PBS -l nodes=1:ppn=1 #PBS -A yeo-group #PBS -q home # Go to the directory from which the script was called cd $PBS_O_WORKDIR python /home/obotvinnik/ipython_notebook/singlecell/manuscript/0._Data_collection/basewise_conseration.py

Line 1¶

#!/bin/bash

This line is specifying to use the bash aka "Bourne-Again shell" ,vs the Bourne (sh) shell or C-shell (csh). The differences between the shells are rather esoteric and can get quite religous so I won't go into them. So far, we've been using bash so stick to that.

Line 2¶

#PBS -N basewise_conservation

This "-N" flag is the name of the job you'll see when you do qstat. I recommend doing qstat -u $USER because then you'll see only YOUR jobs. In fact, I recommend adding this alias to your ~/.bashrc:

alias qme="qstat -q $USER"

The double quotes are important because they tell the computer to "evaluate" the dollar sign variables inside. Check out the difference between double quotes here:

[ucsd-train01@tscc-login2 ~]$ echo "Who am I? I am $USER"
Who am I? I am ucsd-train01

And single quotes here:

[ucsd-train01@tscc-login2 ~]$ echo 'Who am I? I am $USER' 
Who am I? I am $USER

Line 3¶

#PBS -o basewise_conservation.sh.out

This "-o" flag tells the PBS compute cluster to save the output from stdout, which is relative to the directory this script was run from. i.e. if you change directories in the script itself to biom262/weeks/week01/data, but you submitted the script from your home directory ~, then this file will be in ~/basewise_conservation.sh.out not biom262/weeks/week01/data

Line 4¶

#PBS -e basewise_conservation.sh.err

This "-e" flag tells the supercompting cluster where to save the output from stderr. The same folder location conventions are true with this as they are with the output in line 3 above.

Line 5¶

#PBS -V

This "-V" argument ensures that you have access to all the programs in your PATH that you did before you jumped into the TSCC compute node. It literally prepends your existing path, just like Anaconda did to make sure you first access the Anaconda Python and not any other Python.

Line 6¶

#PBS -l walltime=24:00:00

This is the first argument you've seen with the "-l" ("dash ell") flag. I don't know what "-l" stands for but it's properties of the job that have to do with how many resources to specify (ie time and computers). This one specifies the amount of clock time this script can use. This is specifying 24 hours. Your script will probably only need 10 minutes, which you can specify with walltime=00:10:00.

Line 7¶

#PBS -l nodes=1:ppn=1

This is the second line with the "-l" flag. This time, it's specifying the number of nodes (i.e. literal computers) and the number of processors to use on those nodes. For your code, you will only need one node and one processor.

The maximum number of processors is 16. In all of the submission scripts for this class, you'll need a maximum of one node. If you ever need more, always increase the processors, then the nodes. The example from TSCC is really bad:

nodes=10:ppn=2

Using the room/chair analogy where a node is a room and processor is a chair, that's like asking for 10 rooms, and for each of them have two empty chairs! That's a really strange and weird request so don't have these lopsided requests where the number of nodes is far greater than the number of processors. The main thing to remember is that there's a maximum of 16 chairs per room, and to increase the number of chairs (processors) before the number of rooms (nodes).

Line 8¶

#PBS -A yeo-group

This "-A" flag specifies the account that you're charging to submit to. Ignore this line. We'll use the hotel queue which is free to use.

Line 9¶

#PBS -q home-yeo

This "-q" flag specifies the "queue" that you'll submit to. TSCC describes all the different queues, but for this class we'll stick to hotel because anyone can submit to that one -- you don't have to be a "node contributor" (i.e. your lab paid for compute nodes on the cluster) to use it. That said, hotel can get pretty clogged up.

If you try to submit to a queue you don't have access to, you'll get this error:

qsub: submit error (Unauthorized Request  MSG=group ACL is not satisfied: user ucsd-train01@tscc-login2.local, queue home-yeo)

Line 10¶

This line is empty

Line 11¶

# Go to the directory from which the script was called

This is a comment line, which starts with a hash "#" and can say anything afterwards. It's mainly for the reader of the script (me) to look at later and remember what happened.

Line 12¶

cd $PBS_O_WORKDIR

This line changes directories to the folder where you ran the "qsub" command. You can see more of the PBS environment variables here.

Line 13¶

python /home/obotvinnik/ipython_notebook/singlecell/manuscript/0._Data_collection/basewise_conseration.py

This is the meat of the script! (albeit misspelled) This is the actual function that runs and reads files and does things to data to calculate basewise conservation.

Another example¶

Let's look at another example that looks more like what you would submit for your homework. Let's say it's in a file called ... oh, I don't know, how about tf_binding.sh.

#!/bin/bash #PBS -N tf_binding #PBS -o tf_binding.sh.out #PBS -e tf_binding.sh.err #PBS -V #PBS -l walltime=00:10:00 #PBS -l nodes=1:ppn=1 #PBS -q hotel grep NFKB tf.bed > tf.nkfb.bed echo "Hello I am a message in standard out (stdout)" echo "Hello I am a message in standard error (stderr)" >&2

Note that you can submit this with:

qsub tf_binding.sh

OR, if you had a file called "grep_tf.sh" that only had the line "grep NFKB tf.bed > tf.nkfb.bed" in it, you could have written out all this stuff:

qsub -N tf_binding -o tf_binding.sh.out -e tf_binding.sh.err -V -l walltime=00:10:00 -l nodes=1:ppn=1 -q hotel grep_tf.sh

Which specifies everything we did in the script in a single line. I don't like this method because it's harder to read and is not as reproducible. So the #PBS syntax is a shortcut for writing all that other stuff on the command line, and it's easier to keep track of than the different commands you ran.

PHEW! that was a lot!

Git/GitHub + bedtools +TSCC + YOU!¶

What is this >&2 garbage and what does it have to do with stdout vs stderr?¶

So why do we need stderr ?¶

stdout: The other white meat¶