This is an example from the galaxy project in IPython notebook. https://main.g2.bx.psu.edu/u/aun1/p/galaxy101
==========================
First thing we will do is to obtain data from UCSC
from IPython.core.display import HTML
HTML("<iframe src='http://genome.ucsc.edu/cgi-bin/hgTables?GALAXY_URL=http%3A//galaxyproject.cloudapp.net%3A8080/tool_runner&tool_id=ucsc_table_direct1&hgta_compressType=none&sendToGalaxy=1&hgta_outputType=bed' width=880 height=600>")
Make sure that your settings are exactly the same as shown on the screen (in particular, position should be set to "chr22", output format should be set to "BED - browser extensible data", and "Galaxy" should be checked by Send output to option). Click get output and you will see the next screen:
here make sure Create one BED record per is set to "Coding Exons" and click Send Query to Galaxy. After this you will see your first History Item in Galaxy's right pane. It will go through gray (preparing) and yellow (running) states to become green:
The first step will download Exons on a local directory. The filepath will be used as an input to a next pipeline. In this example, first_query is used to indicate Exons location.
first_query = "/home/azureuser/galaxy-dist/database/files/000/dataset_37.dat"
Now is the time to obtain SNP data. This is done almost exactly the same way. First thing we will do is to again click on "Get Data -> UCSC Main":
from IPython.core.display import HTML
HTML("<iframe src='http://genome.ucsc.edu/cgi-bin/hgTables?GALAXY_URL=http%3A//galaxyproject.cloudapp.net%3A8080/tool_runner&tool_id=ucsc_table_direct1&hgta_compressType=none&sendToGalaxy=1&hgta_outputType=bed' width=880 height=600>")
where you need to make sure that Whole Gene is selected ("Whole Gene" here really means "Whole Feature") and click Send Query to Galaxy. You will get your second item in the history:
Now we will rename the two history items to "Exons" and "SNPs" by clicking on the Pencil icon adjacent to each item. Also we will rename history to "Galaxy 101" (or whatever you want) by clicking on "Unnamed history" so everything looks like this:
The second step downloads SNPs to a local directory as well. second_query is used to indicate the path of the data.
second_query = "/home/azureuser/galaxy-dist/database/files/000/dataset_38.dat"
================================================
Let's remind ourselves that our objective was to find which exon contains the most SNPs. This first step in answering this question will be joining exons with SNPs (a fancy word for printing exons and SNPs that overlap side by side). This is done using "Operate on Genomics Intervals -> Join" tool:
Instead of selecting input values on the web, python variables used to indicate the downloaded Exons and SNPs.
first_query = "/home/azureuser/galaxy-dist/database/files/000/dataset_37.dat"
second_query = "/home/azureuser/galaxy-dist/database/files/000/dataset_38.dat"
output = "/home/azureuser/galaxy-dist/database/files/000/dataset_39.dat"
Galaxy libraries will be loaded in main python path.
import os, sys, logging
galaxy_dir = "/home/azureuser/galaxy-dist/"
new_path = [ os.path.join( galaxy_dir, "lib" ) ]
new_path.extend( sys.path[1:] ) # remove scripts/ from the path
sys.path = new_path
Next part is a python code for joing exons with SNPs. Original script is gops_join.py under galaxy-dist/tools/new_operations
#!/usr/bin/env python
"""
Join two sets of intervals using their overlap as the key.
usage: %prog bed_file_1 bed_file_2 out_file
-1, --cols1=N,N,N,N: Columns for start, end, strand in first file
-2, --cols2=N,N,N,N: Columns for start, end, strand in second file
-m, --mincols=N: Require this much overlap (default 1bp)
-f, --fill=N: none, right, left, both
"""
from galaxy import eggs
import pkg_resources
pkg_resources.require( "bx-python" )
import sys, traceback, fileinput
from warnings import warn
from bx.intervals import *
from bx.intervals.io import *
from bx.intervals.operations.join import *
from bx.cookbook import doc_optparse
from galaxy.tools.util.galaxyops import *
assert sys.version_info[:2] >= ( 2, 4 )
mincols = 1
upstream_pad = 0
downstream_pad = 0
leftfill = False
rightfill = False
#options, args = doc_optparse.parse( __doc__ )
try:
#chr_col_1, start_col_1, end_col_1, strand_col_1 = parse_cols_arg( options.cols1 )
#chr_col_2, start_col_2, end_col_2, strand_col_2 = parse_cols_arg( options.cols2 )
#if options.mincols: mincols = int( options.mincols )
#if options.fill:
# if options.fill == "both":
# rightfill = leftfill = True
# else:
# rightfill = options.fill == "right"
# leftfill = options.fill == "left"
chr_col_1, start_col_1, end_col_1, strand_col_1 = 0,1,2,5
chr_col_2, start_col_2, end_col_2, strand_col_2 = 0,1,2,5
mincols = 1
args = [ first_query, second_query, output ]
in_fname, in2_fname, out_fname = args
except:
doc_optparse.exception()
g1 = NiceReaderWrapper( fileinput.FileInput( in_fname ),
chrom_col=chr_col_1,
start_col=start_col_1,
end_col=end_col_1,
strand_col=strand_col_1,
fix_strand=True )
g2 = NiceReaderWrapper( fileinput.FileInput( in2_fname ),
chrom_col=chr_col_2,
start_col=start_col_2,
end_col=end_col_2,
strand_col=strand_col_2,
fix_strand=True )
out_file = open( out_fname, "w" )
try:
for outfields in join(g1, g2, mincols=mincols, rightfill=rightfill, leftfill=leftfill):
if type( outfields ) is list:
out_file.write( "%s\n" % "\t".join( outfields ) )
else:
out_file.write( "%s\n" % outfields )
except ParseError, exc:
out_file.close()
fail( "Invalid file format: %s" % str( exc ) )
except MemoryError:
out_file.close()
fail( "Input datasets were too large to complete the join operation." )
out_file.close()
if g1.skipped > 0:
print skipped( g1, filedesc=" of 1st dataset" )
if g2.skipped > 0:
print skipped( g2, filedesc=" of 2nd dataset" )
!head $output
chr22 16258185 16258303 uc002zlh.1_cds_1_0_chr22_16258186_r 0 - chr22 16258278 16258279 rs2845178 0 + chr22 16266928 16267095 uc002zlh.1_cds_2_0_chr22_16266929_r 0 - chr22 16267031 16267032 rs7292200 0 + chr22 16266928 16267095 uc002zlh.1_cds_2_0_chr22_16266929_r 0 - chr22 16267011 16267012 rs7290262 0 + chr22 16266928 16267095 uc002zlh.1_cds_2_0_chr22_16266929_r 0 - chr22 16266963 16266964 rs10154680 0 + chr22 16266928 16267095 uc002zlh.1_cds_2_0_chr22_16266929_r 0 - chr22 16267037 16267038 rs2818572 0 + chr22 16269872 16269943 uc002zlh.1_cds_4_0_chr22_16269873_r 0 - chr22 16269933 16269934 rs2845206 0 + chr22 16275206 16275277 uc002zlh.1_cds_5_0_chr22_16275207_r 0 - chr22 16275252 16275253 rs8142076 0 + chr22 16275206 16275277 uc002zlh.1_cds_5_0_chr22_16275207_r 0 - chr22 16275237 16275238 rs2845214 0 + chr22 16277747 16277885 uc002zlh.1_cds_6_0_chr22_16277748_r 0 - chr22 16277818 16277819 rs2073406 0 + chr22 16277747 16277885 uc002zlh.1_cds_6_0_chr22_16277748_r 0 - chr22 16277756 16277757 rs79385954 0 +
Let's take a look at this dataset. The first six columns correspond to exons. The last six correspond to SNPs. You can see that exon with ID uc002zlh.1_cds_2_0_chr22_16266929_r contains four SNPs with IDs rs7290262, rs10154680, rs2818572, and rs7292200.
Above we've seen that exon uc002zlh.1_cds_2_0_chr22_16266929_r is repeated four times in the above dataset. Thus we can easily compute the number of SNPs per exon by simply counting the number of repetitions of name for each exon. This can be easily done with the "Join, Subtract, and Group -> Group" tool:
# python /home/azureuser/galaxy-dist/tools/stats/grouping.py /home/azureuser/galaxy-dist/database/files/000/dataset_40.dat /home/azureuser/galaxy-dist/database/files/000/dataset_39.dat 4 0 'length 4 no'
param1 = "/home/azureuser/galaxy-dist/database/files/000/dataset_40.dat"
param2 = "/home/azureuser/galaxy-dist/database/files/000/dataset_39.dat"
param3 = "4"
param4 = "0"
param5 = 'length 4 no'
if you look at the above image you will see that the result of grouping (dataset #4) contains two columns. This first contains the exon name while the second shows the number of times this name has been repeated in dataset #3.
#!/usr/bin/env python
# Guruprasad Ananda
# Refactored 2011 to use numpy instead of rpy, Kanwei Li
"""
This tool provides the SQL "group by" functionality.
"""
import sys, commands, tempfile, random
try:
import numpy
except:
from galaxy import eggs
eggs.require( "numpy" )
import numpy
from itertools import groupby
def stop_err(msg):
sys.stderr.write(msg)
sys.exit()
def mode(data):
counts = {}
for x in data:
counts[x] = counts.get(x,0) + 1
maxcount = max(counts.values())
modelist = []
for x in counts:
if counts[x] == maxcount:
modelist.append( str(x) )
return ','.join(modelist)
def main():
#inputfile = sys.argv[2]
#ignorecase = int(sys.argv[4])
global param1, param2, param3, param4, param5
inputfile = param2
ignorecase = int(param4)
ops = []
cols = []
round_val = []
data_ary = []
#for var in sys.argv[5:]:
var = param5
op, col, do_round = var.split()
ops.append(op)
cols.append(col)
round_val.append(do_round)
"""
At this point, ops, cols and rounds will look something like this:
ops: ['mean', 'min', 'c']
cols: ['1', '3', '4']
round_val: ['no', 'yes' 'no']
"""
try:
#group_col = int( sys.argv[3] )-1
group_col = int( param3 )-1
except:
stop_err( "Group column not specified." )
str_ops = ['c', 'length', 'unique', 'random', 'cuniq', 'Mode'] #ops that can handle string/non-numeric inputs
tmpfile = tempfile.NamedTemporaryFile()
try:
"""
The -k option for the Posix sort command is as follows:
-k, --key=POS1[,POS2]
start a key at POS1, end it at POS2 (origin 1)
In other words, column positions start at 1 rather than 0, so
we need to add 1 to group_col.
if POS2 is not specified, the newer versions of sort will consider the entire line for sorting. To prevent this, we set POS2=POS1.
"""
case = ''
if ignorecase == 1:
case = '-f'
command_line = "sort -t ' ' %s -k%s,%s -o %s %s" % (case, group_col+1, group_col+1, tmpfile.name, inputfile)
except Exception, exc:
stop_err( 'Initialization error -> %s' %str(exc) )
error_code, stdout = commands.getstatusoutput(command_line)
if error_code != 0:
stop_err( "Sorting input dataset resulted in error: %s: %s" %( error_code, stdout ))
#fout = open(sys.argv[1], "w")
fout = open(param1, "w")
def is_new_item(line):
try:
item = line.strip().split("\t")[group_col]
except IndexError:
stop_err( "The following line didn't have %s columns: %s" % (group_col+1, line) )
if ignorecase == 1:
return item.lower()
return item
for key, line_list in groupby(tmpfile, key=is_new_item):
op_vals = [ [] for op in ops ]
out_str = key
multiple_modes = False
mode_index = None
for line in line_list:
fields = line.strip().split("\t")
for i, col in enumerate(cols):
col = int(col)-1 # cXX from galaxy is 1-based
try:
val = fields[col].strip()
op_vals[i].append(val)
except IndexError:
sys.stderr.write( 'Could not access the value for column %s on line: "%s". Make sure file is tab-delimited.\n' % (col+1, line) )
sys.exit( 1 )
# Generate string for each op for this group
for i, op in enumerate( ops ):
data = op_vals[i]
rval = ""
if op == "mode":
rval = mode( data )
elif op == "length":
rval = len( data )
elif op == "random":
rval = random.choice(data)
elif op in ['cat', 'cat_uniq']:
if op == 'cat_uniq':
data = numpy.unique(data)
rval = ','.join(data)
try:
val = fields[col].strip()
op_vals[i].append(val)
except IndexError:
sys.stderr.write( 'Could not access the value for column %s on line: "%s". Make sure file is tab-delimited.\n' % (col+1, line) )
sys.exit( 1 )
# Generate string for each op for this group
for i, op in enumerate( ops ):
data = op_vals[i]
rval = ""
if op == "mode":
rval = mode( data )
elif op == "length":
rval = len( data )
elif op == "random":
rval = random.choice(data)
elif op in ['cat', 'cat_uniq']:
if op == 'cat_uniq':
data = numpy.unique(data)
rval = ','.join(data)
elif op == "unique":
rval = len( numpy.unique(data) )
else:
# some kind of numpy fn
try:
data = map(float, data)
except ValueError:
sys.stderr.write( "Operation %s expected number values but got %s instead.\n" % (op, data) )
sys.exit( 1 )
rval = getattr(numpy, op)( data )
if round_val[i] == 'yes':
rval = round(rval)
else:
rval = '%g' % rval
out_str += "\t%s" % rval
fout.write(out_str + "\n")
# Generate a useful info message.
msg = "--Group by c%d: " %(group_col+1)
for i, op in enumerate(ops):
if op == 'cat':
op = 'concat'
elif op == 'cat_uniq':
op = 'concat_distinct'
elif op == 'length':
op = 'count'
elif op == 'unique':
op = 'count_distinct'
elif op == 'random':
op = 'randomly_pick'
msg += op + "[c" + cols[i] + "] "
print msg
fout.close()
tmpfile.close()
if __name__ == "__main__":
main()
An exception has occurred, use %tb to see the full traceback.
SystemExit
Sorting input dataset resulted in error: 512: sort: multi-character tab ` 'To exit: use 'exit', 'quit', or Ctrl-D.
!head $param1
uc002zlh.1_cds_1_0_chr22_16258186_r 1 uc002zlh.1_cds_2_0_chr22_16266929_r 4 uc002zlh.1_cds_4_0_chr22_16269873_r 1 uc002zlh.1_cds_5_0_chr22_16275207_r 2 uc002zlh.1_cds_6_0_chr22_16277748_r 5 uc002zlh.1_cds_7_0_chr22_16279195_r 2 uc002zlj.1_cds_0_0_chr22_16266931_r 4 uc002zlj.1_cds_2_0_chr22_16269873_r 1 uc002zlj.1_cds_3_0_chr22_16275207_r 2 uc002zlj.1_cds_4_0_chr22_16277748_r 5
To see which exon has the highest number of SNPs we can simply sort the dataset #4 on the second column in descending order. This is done with "Filter and Sort -> Sort":
#python /home/azureuser/galaxy-dist/tools/filters/sorter.py --input=/home/azureuser/galaxy-dist/database/files/000/dataset_40.dat --output=/home/azureuser/galaxy-dist/database/files/000/dataset_41.dat --key=2,2nr
inputfile = "/home/azureuser/galaxy-dist/database/files/000/dataset_40.dat"
outputfile = "/home/azureuser/galaxy-dist/database/files/000/dataset_41.dat"
keyinput = "2,2nr"
"""
Sorts tabular data on one or more columns. All comments of the file are collected
and placed at the beginning of the sorted output file.
usage: sorter.py [options]
-i, --input: Tabular file to be sorted
-o, --output: Sorted output file
-k, --key: Key (see manual for bash/sort)
usage: sorter.py input output [key ...]
"""
# 03/05/2013 guerler
# imports
import os, re, string, sys
from optparse import OptionParser
# error
def stop_err( msg ):
sys.stderr.write( "%s\n" % msg )
sys.exit()
# main
def main():
# define options
parser = OptionParser()
parser.add_option("-i", "--input")
parser.add_option("-o", "--output")
parser.add_option("-k", "--key", action="append")
# parse
#options, args = parser.parse_args()
global inputfile, outputfile, keyinput
try:
# retrieve options
#input = options.input
#output = options.output
#key = [" -k" + k for k in options.key]
input = inputfile
output = outputfile
key = [" -k" + k for k in [keyinput]]
# grep comments
grep_comments = "(grep '^#' %s) > %s" % (input, output)
#print grep_comments
# grep and sort columns
sort_columns = "(grep '^[^#]' %s | sort -f -t '\t' %s) >> %s" % (input, ' '.join(key), output)
#print sort_columns
# execute
os.system(grep_comments)
os.system(sort_columns)
except Exception, ex:
stop_err('Error running sorter.py\n' + str(ex))
# exit
sys.exit(0)
if __name__ == "__main__":
main()
An exception has occurred, use %tb to see the full traceback. SystemExit: 0
To exit: use 'exit', 'quit', or Ctrl-D.
!head $outputfile
uc010gsw.2_cds_1_0_chr22_21480537_r 67 uc021wmb.1_cds_0_0_chr22_21480537_r 67 uc002zoc.3_cds_0_0_chr22_18834445_f 58 uc021wnd.1_cds_0_0_chr22_24647973_f 50 uc021wmc.1_cds_0_0_chr22_21637809_f 47 uc003bhh.3_cds_0_0_chr22_46652458_r 46 uc002zsd.4_cds_0_0_chr22_20456382_r 41 uc002zuq.4_cds_0_0_chr22_21738148_f 41 uc002zuy.4_cds_0_0_chr22_21900346_r 41 uc003atq.1_cds_12_0_chr22_38119192_f 37
Now let's select top five exons with the highest number of SNPs. For this we will use "Text Manipulation -> Select First" tool:
inputfilename = "/home/azureuser/galaxy-dist/database/files/000/dataset_41.dat"
linenum = 5
outputfilename = "/home/azureuser/galaxy-dist/database/files/000/dataset_42.dat"
#! /usr/bin/perl -w
use strict;
use warnings;
# a wrapper for head for use in galaxy
# headWrapper.pl [filename] [# lines to show] [output]
die "Check arguments" unless @ARGV == 3;
die "Line number must be an integer\n" unless $ARGV[1]=~ m/^\d+$/;
open (OUT, ">$ARGV[2]") or die "Cannot create $ARGV[2]:$!\n";
open (HEAD, "head -n $ARGV[1] $ARGV[0]|") or die "Cannot run head:$!\n";
while (<HEAD>) {
print OUT;
}
close OUT;
close HEAD;
File "<ipython-input-36-1b4ab6e90b60>", line 3 use strict; ^ SyntaxError: invalid syntax
!head $outputfilename
uc010gsw.2_cds_1_0_chr22_21480537_r 67 uc021wmb.1_cds_0_0_chr22_21480537_r 67 uc002zoc.3_cds_0_0_chr22_18834445_f 58 uc021wnd.1_cds_0_0_chr22_24647973_f 50 uc021wmc.1_cds_0_0_chr22_21637809_f 47
Now we know that in this dataset the five top exons contain between 41 and 67 SNPs. But what else can we learn about these? To know more we need to get back the positional information (coordinates) of these exons. This information was lost at the grouping step and now all we have is just two columns. To get coordinates back we will match the names of exons in dataset #6 (column 1) against names of the exons in the original dataset #1 (column 4). This can be done with "Join, Subtract and Group -> Compare two Queries" tool (note the settings of the tool in the middle pane)
#python /home/azureuser/galaxy-dist/tools/filters/joinWrapper.py /home/azureuser/galaxy-dist/database/files/000/dataset_37.dat /home/azureuser/galaxy-dist/database/files/000/dataset_42.dat 4 1 N /home/azureuser/galaxy-dist/database/files/000/dataset_43.dat
infile1 = "/home/azureuser/galaxy-dist/database/files/000/dataset_37.dat"
infile2 = "/home/azureuser/galaxy-dist/database/files/000/dataset_42.dat"
field1 = 4
field2 = 1
mode = "N"
outfile = "/home/azureuser/galaxy-dist/database/files/000/dataset_43.dat"
#!/usr/bin/env python
#Guruprasad Ananda
"""
This tool provides the UNIX "join" functionality.
"""
import sys, os, tempfile, subprocess
def stop_err(msg):
sys.stderr.write(msg)
sys.exit()
def main():
global infile1, infile2, field1, field2, mode, outfile
#infile1 = sys.argv[1]
#infile2 = sys.argv[2]
#field1 = int(sys.argv[3])
#field2 = int(sys.argv[4])
#mode =sys.argv[5]
#outfile = sys.argv[6]
tmpfile1 = tempfile.NamedTemporaryFile()
tmpfile2 = tempfile.NamedTemporaryFile()
try:
#Sort the two files based on specified fields
os.system("sort -t ' ' -k %d,%d -o %s %s" %(field1, field1, tmpfile1.name, infile1))
os.system("sort -t ' ' -k %d,%d -o %s %s" %(field2, field2, tmpfile2.name, infile2))
except Exception, exc:
stop_err( 'Initialization error -> %s' %str(exc) )
option = ""
for line in file(tmpfile1.name):
line = line.strip()
if line:
elems = line.split('\t')
for j in range(1,len(elems)+1):
if j == 1:
option = "1.1"
else:
option = option + ",1." + str(j)
break
#check if join has --version option. BSD join doens't have this option, while GNU join does.
#The return value in the latter case will be 0, and non-zero in the latter case.
ret = subprocess.call('join --version 2>/dev/null', shell=True)
# check if we are a version later than 7 of join. If so, we want to skip
# checking the order since join will raise an error with duplicated items in
# the two files being joined.
if ret == 0:
cl = subprocess.Popen(["join", "--version"], stdout=subprocess.PIPE)
(stdout, _) = cl.communicate()
version_line = stdout.split("\n")[0]
(version, _) = version_line.split()[-1].split(".")
if int(version) >= 7:
flags = "--nocheck-order"
else:
flags = ""
else:
flags = ""
if mode == "V":
cmdline = "join %s -t ' ' -v 1 -o %s -1 %d -2 %d %s %s > %s" %(flags, option, field1, field2, tmpfile1.name, tmpfile2.name, outfile)
else:
cmdline = "join %s -t ' ' -o %s -1 %d -2 %d %s %s > %s" %(flags, option, field1, field2, tmpfile1.name, tmpfile2.name, outfile)
try:
os.system(cmdline)
except Exception, exj:
stop_err('Error joining the two datasets -> %s' %str(exj))
if __name__ == "__main__":
main()
join --nocheck-order -t ' ' -o -1 4 -2 1 /tmp/tmpwCn7om /tmp/tmpwZTKoj > /home/azureuser/galaxy-dist/database/files/000/dataset_43.dat
!head $outfile
chr22 18834444 18835833 uc002zoc.3_cds_0_0_chr22_18834445_f 0 + chr22 21480536 21481925 uc010gsw.2_cds_1_0_chr22_21480537_r 0 - chr22 21480536 21481925 uc021wmb.1_cds_0_0_chr22_21480537_r 0 - chr22 21637808 21638558 uc021wmc.1_cds_0_0_chr22_21637809_f 0 + chr22 24647972 24649256 uc021wnd.1_cds_0_0_chr22_24647973_f 0 +
The best way to learn about these exons is to look at their genomic surrounding. There is really no better way to do this than using genome browsers. Because this analysis was performed on "standard" human genome, you have two choices - UCSC Genome Browser and Ensembl:
For example, clicking on "display at UCSC main" will show this (to see your regions look at "User Track" on top of browser image):