bedtools
+TSCC + YOU!¶To bring it all together, you will combine everything you've learned so far about UNIX, bash shell, TSCC, git, and github to collaboratively write a submitter script to TSCC.
6_tf_binding_promoters.ipynb
(person1) and the the other to be responsible for exercises 4-6 (person2)biom262-hw1
and write a short README.md
filetf_binding.sh
to TSCC which has all the ##PBS
flags shown in the TSCC Quick Start Guide below, and contribute the lines of code that you are responsible for. Use git blame tf_binding.sh
to make sure person1 is to "blame" for exercises 1-3 and person2 is to "blame" for exercises 4-5. The "blame" of the remaining file doesn't matter.echo "Hello I am a message in standard out (stdout)"
and have person2 add the line echo "Hello I am a message in standard error (stderr) >&2"
(the >&2
outputs to "secondary" aka "error" output)add
, commit
, and push
their changes to the server. Are there merge conflicts? How do you solve them?.o#####
and .e#####
(or if you were fancy and redirected your stdout
and stderr
to something else then include those files)LICENSE
README.txt
tf_binding.sh
tf_binding.sh.o######
tf_binding.sh.e######
Run git blame
on tf_binding.sh
. You should get output that looks like this, which has commit ID in the first column, the name of the user, the timestamp, and the line number for each line. Make sure different people wrote their exercise sections, and the echo
stuff too.
53317857 (Jaclyn Einstein 2016-01-19 15:26:59 -0800 1) #!/bin/bash
53317857 (Jaclyn Einstein 2016-01-19 15:26:59 -0800 2) #PBS -q hotel
34b09512 (Jaclyn Einstein 2016-01-20 18:01:55 -0800 3) #PBS -l nodes=2:ppn=2
34b09512 (Jaclyn Einstein 2016-01-20 18:01:55 -0800 4) #PBS -l walltime=:00:20:00
53317857 (Jaclyn Einstein 2016-01-19 15:26:59 -0800 5) #PBS -N tf_binding.sh
53317857 (Jaclyn Einstein 2016-01-19 15:26:59 -0800 6)
34b09512 (Jaclyn Einstein 2016-01-20 18:01:55 -0800 7) cd ~/code/biom262-hw1/data
34b09512 (Jaclyn Einstein 2016-01-20 18:01:55 -0800 8) module load biotools
b63c5509 (Jaclyn Einstein 2016-01-18 16:37:52 -0800 9)
b63c5509 (Jaclyn Einstein 2016-01-18 16:37:52 -0800 10) #Exercise 1
b63c5509 (Jaclyn Einstein 2016-01-18 16:37:52 -0800 11) # Filter the tf.bed file for only the NFKB\n
b63c5509 (Jaclyn Einstein 2016-01-18 16:37:52 -0800 12) awk '{if($4=="NFKB") print}' tf.bed > tf.nfkb.bed
b63c5509 (Jaclyn Einstein 2016-01-18 16:37:52 -0800 13)
b63c5509 (Jaclyn Einstein 2016-01-18 16:37:52 -0800 14) #Exercise 2
b63c5509 (Jaclyn Einstein 2016-01-18 16:37:52 -0800 15) # Filter only the rows of the gtf file that contain the features of type "transcript"\n
b63c5509 (Jaclyn Einstein 2016-01-18 16:37:52 -0800 16) awk '{if($3=="transcript") print}' gencode.v19.annotation.chr22.gtf > gencode.v19.annotation.chr22.transcript.gtf
b63c5509 (Jaclyn Einstein 2016-01-18 16:37:52 -0800 17)
b63c5509 (Jaclyn Einstein 2016-01-18 16:37:52 -0800 18) #Exercise 3
b63c5509 (Jaclyn Einstein 2016-01-18 16:37:52 -0800 19) # Use bedtools to find promoters (2000 bases upstream of gene)\n
b63c5509 (Jaclyn Einstein 2016-01-18 16:37:52 -0800 20) bedtools flank -i gencode.v19.annotation.chr22.transcript.gtf -g hg19.genome -l 2000 -r 0 -s > gencode.v19.annotation.chr22.transcript.promoter.gtf
0c09c96b (Jaclyn Einstein 2016-01-18 17:47:54 -0800 21)
88ed4261 (Ben Rubin 2016-01-19 18:15:18 -0800 22) #Exercise 4
88ed4261 (Ben Rubin 2016-01-19 18:15:18 -0800 23) bedtools intersect -a gencode.v19.annotation.chr22.transcript.promoter.gtf -b tf.nfkb.bed >gencode.v19.annotation.chr22.transcript.promoter.nfkb.gtf
88ed4261 (Ben Rubin 2016-01-19 18:15:18 -0800 24)
88ed4261 (Ben Rubin 2016-01-19 18:15:18 -0800 25) #Exercise 5
88ed4261 (Ben Rubin 2016-01-19 18:15:18 -0800 26) bedtools getfasta -fi GRCh37.p13.chr22.fa -bed gencode.v19.annotation.chr22.transcript.promoter.nfkb.gtf -s -fo gencode.v19.annotation.chr22.transcript.promoter.nfkb.fasta
88ed4261 (Ben Rubin 2016-01-19 18:15:18 -0800 27)
34b09512 (Jaclyn Einstein 2016-01-20 18:01:55 -0800 28) echo "Hello I am a message in standard out (stout)"
34b09512 (Jaclyn Einstein 2016-01-20 18:01:55 -0800 29) echo "Hello I am a message in standard error (stderr)" >&2
*Note: You will need to add your partner as a collaborator to your repository so they can push to your repo*
Feel free to include any other hg19.genome
or bed
or gtf
files that you want to make your life easier.
Resources for TSCC:
>&2
garbage and what does it have to do with stdout
vs stderr
?¶To get the most out of this, follow along with the commands and type them out on the terminal.
So far, you have become quite familiar with stdout
and stderr
without even knowing it! Whenver you do something that outputs something to the terminal, that is stdout
. For example, let's take ls
. Change directories to your home directory, then type ls
.
[ucsd-train01@tscc-login2 ~]$ cd
[ucsd-train01@tscc-login2 ~]$ ls
Anaconda3-2.4.1-Linux-x86_64.sh bin code notebooks test_script.sh test_script.sh.e3962194 test_script.sh.o3962194
You can write this output to a file using the >
which outputs the stdout
to a file:
[ucsd-train01@tscc-login2 ~]$ ls > ls.txt
Notice that there was no output! Where did it go? Well, let's look at ls.txt
:
[ucsd-train01@tscc-login2 ~]$ cat ls.txt
Anaconda3-2.4.1-Linux-x86_64.sh
bin
code
ls.txt
notebooks
test_script.sh
test_script.sh.e3962194
test_script.sh.o3962194
This put the output of ls
into a file! It also added a line between entries instead of a space. This is a special feature of ls
where it detects that you're outputting to a file that you probably want to grep
so to make your life easier, it puts everything on one file.
stderr
?¶To learn about stderr
, let's first use a command that we know will fail. Let's try to ls
a directory that doesn't exist.
[ucsd-train01@tscc-login2 ~]$ ls nonexistent_folder
ls: cannot access nonexistent_folder: No such file or directory
Now let's save the stdout
as a file.
[ucsd-train01@tscc-login2 ~]$ ls nonexistent_folder > ls_nonexistent_folder.txt
ls: cannot access nonexistent_folder: No such file or directory
Wait, that still put stuff out on the terminal? Didn't we save this output? Let's take a look.
[ucsd-train01@tscc-login2 ~]$ cat ls_nonexistent_folder.txt
Hmm. This is empty. What's going on?
stdout
: The other white meat¶When you use >
, the technical jargon for what you are doing is "redirecting standard output to a file." Specifically, stdout
is considered the "first" output of a program and is given the number "1
". So you secretly specified the "1
" already! Try the ls
command again, but use 1>
to save the stdout
. You should get the same output as if you did it without the 1
(do it to convince yourself)
[ucsd-train01@tscc-login2 ~]$ ls 1> ls1.txt
[ucsd-train01@tscc-login2 ~]$ cat ls1.txt
Anaconda3-2.4.1-Linux-x86_64.sh
bin
code
ls1.txt
ls_nonexistent_folder.txt
ls.txt
notebooks
test_script.sh
test_script.sh.e3962194
test_script.sh.o3962194
Now, stderr
is officially the "second" output and is given the number "2
". Let's try saving the stderr
from our failing ls
command from before:
[ucsd-train01@tscc-login2 ~]$ ls nonexistent_folder 2> ls_nonexistent_folder.txt
Notice that now there was no output! Let's check out the file we created.
[ucsd-train01@tscc-login2 ~]$ cat ls_nonexistent_folder.txt
ls: cannot access nonexistent_folder: No such file or directory
Ah-ha! This is *exactly* the output we would have seen had we not "redirected the stderr" (i.e. used this command 2>
).
You may be wondering what is the point of this step is:
- Have person1 add the line:
echo "Hello I am a message in standard out (stdout)"
and have person2 add the lineecho "Hello I am a message in standard error (stderr) >&2
(the>&2
outputs to "secondary" aka "error" output)
The idea is that then you'll have different outputs in the files you specify by "-e
" and "-o
" files from the TSCC PBS Submitter script below. That literally means for one person to add the stdout line and the other person to add the stderr line (see the example script at the very bottom)
Now that you've practiced a bit with stderr
, Try doing this on the terminal:
[ucsd-train01@tscc-login2 ~]$ echo "Hello I am a message in standard error (stderr)" >&2 | cat > asdf
This should output:
Hello I am a message in standard error (stderr)
But what about this?
[ucsd-train01@tscc-login2 ~]$ echo "Hello I am a message in standard output (stdout)" | cat > asdf
Why does the above command have no output?
asdf
?¶What are the contents of asdf
? Why?
Read more about standard streams on Wikipedia, or on this awesome website which has a really great bash overview/tutorial that I still reference to this day. Plus this long article explains when to use stdout
vs stderr
very well.
What is PBS? PBS stands for "Public Broadcasting Service." Just kidding, it stands for Portable Batch System. It is one of many scheduling and resource management (resources = compute nodes attached to TSCC and their processors and memory) systems out there.
Below is an example submitter script that is in the file called basewise_conservation.sh
, and after that I'll explain line-by-line what everything is doing.
You submit the script below with:
qsub basewise_conservation.sh
#!/bin/bash
This line is specifying to use the bash
aka "Bourne-Again shell" ,vs the Bourne (sh
) shell or C-shell (csh
). The differences between the shells are rather esoteric and can get quite religous so I won't go into them. So far, we've been using bash
so stick to that.
#PBS -N basewise_conservation
This "-N
" flag is the name of the job you'll see when you do qstat
. I recommend doing qstat -u $USER
because then you'll see only YOUR jobs. In fact, I recommend adding this alias to your ~/.bashrc
:
alias qme="qstat -q $USER"
The double quotes are important because they tell the computer to "evaluate" the dollar sign variables inside. Check out the difference between double quotes here:
[ucsd-train01@tscc-login2 ~]$ echo "Who am I? I am $USER"
Who am I? I am ucsd-train01
And single quotes here:
[ucsd-train01@tscc-login2 ~]$ echo 'Who am I? I am $USER'
Who am I? I am $USER
#PBS -o basewise_conservation.sh.out
This "-o
" flag tells the PBS compute cluster to save the output from stdout
, which is relative to the directory this script was run from. i.e. if you change directories in the script itself to biom262/weeks/week01/data
, but you
submitted the script from your home directory ~
, then this file will be in
~/basewise_conservation.sh.out
not biom262/weeks/week01/data
#PBS -e basewise_conservation.sh.err
This "-e
" flag tells the supercompting cluster where to save the output from stderr
. The same folder location conventions are true with this as they are with the output in line 3 above.
#PBS -V
This "-V" argument ensures that you have access to all the programs in your PATH
that you did before you jumped into the TSCC compute node. It literally prepends your existing path, just like Anaconda did to make sure you first access the Anaconda Python and not any other Python.
#PBS -l walltime=24:00:00
This is the first argument you've seen with the "-l
" ("dash ell") flag. I don't know what "-l
" stands for but it's properties of the job that have to do with how many resources to specify (ie time and computers). This one specifies the amount of clock time this script can use. This is specifying 24 hours. Your script will probably only need 10 minutes, which you can specify with walltime=00:10:00
.
#PBS -l nodes=1:ppn=1
This is the second line with the "-l
" flag. This time, it's specifying the number of nodes (i.e. literal computers) and the number of processors to use on those nodes. For your code, you will only need one node and one processor.
The maximum number of processors is 16
. In all of the submission scripts for this class, you'll need a maximum of one node. If you ever need more, always increase the processors, then the nodes. The example from TSCC is really bad:
nodes=10:ppn=2
Using the room/chair analogy where a node is a room and processor is a chair, that's like asking for 10 rooms, and for each of them have two empty chairs! That's a really strange and weird request so don't have these lopsided requests where the number of nodes is far greater than the number of processors. The main thing to remember is that there's a maximum of 16 chairs per room, and to increase the number of chairs (processors) before the number of rooms (nodes).
#PBS -A yeo-group
This "-A
" flag specifies the account that you're charging to submit to. Ignore this line. We'll use the hotel
queue which is free to use.
#PBS -q home-yeo
This "-q
" flag specifies the "queue" that you'll submit to. TSCC describes all the different queues, but for this class we'll stick to hotel
because anyone can submit to that one -- you don't have to be a "node contributor" (i.e. your lab paid for compute nodes on the cluster) to use it. That said, hotel
can get pretty clogged up.
If you try to submit to a queue you don't have access to, you'll get this error:
qsub: submit error (Unauthorized Request MSG=group ACL is not satisfied: user ucsd-train01@tscc-login2.local, queue home-yeo)
This line is empty
# Go to the directory from which the script was called
This is a comment line, which starts with a hash "#
" and can say anything afterwards. It's mainly for the reader of the script (me) to look at later and remember what happened.
cd $PBS_O_WORKDIR
This line changes directories to the folder where you ran the "qsub
" command. You can see more of the PBS environment variables here.
python /home/obotvinnik/ipython_notebook/singlecell/manuscript/0._Data_collection/basewise_conseration.py
This is the meat of the script! (albeit misspelled) This is the actual function that runs and reads files and does things to data to calculate basewise conservation.
Let's look at another example that looks more like what you would submit for your homework. Let's say it's in a file called ... oh, I don't know, how about tf_binding.sh
.
Note that you can submit this with:
qsub tf_binding.sh
OR, if you had a file called "grep_tf.sh
" that only had the line "grep NFKB tf.bed > tf.nkfb.bed
" in it, you could have written out all this stuff:
qsub -N tf_binding -o tf_binding.sh.out -e tf_binding.sh.err -V -l walltime=00:10:00 -l nodes=1:ppn=1 -q hotel grep_tf.sh
Which specifies everything we did in the script in a single line. I don't like this method because it's harder to read and is not as reproducible. So the #PBS
syntax is a shortcut for writing all that other stuff on the command line, and it's easier to keep track of than the different commands you ran.
PHEW! that was a lot!