Click run all for this jupyter notebook, please send an email to btsui@eng.ucsd.edu if there is any error.

download example data¶

Change syn15659419 to syn15624400 if you want to download the entire vairant dataset

In [ ]:

%%bash
pip install synapseclient 
pip install pandas --upgrade
####only download one file for now
#
mkdir tmp_data/
cd ./tmp_data/
#created a dummy accounts so that any one can download without registering, please don't do anything crazy with the account
synapse -u synapse.skymap.download -p QtL-E2g-hzz-N4k get  syn15659419

input¶

configure base_mergedBySRR_dir to your local copy of mergedBySRR
query_SRR, Sequence read archive SRR ID

In [4]:

base_mergedBySRR_dir='./tmp_data/'#'~/Data/merged/snp/hg38/mergedBySRR/'
query_SRR='ERR126304'

slicing¶

In [13]:

import pandas as pd
import re 
query_Run_digits=int(re.search(r"\d+", query_SRR).group(0))
query_Run_db=re.search(r"\wRR+", query_SRR).group(0)
#currently fixed, just use this
chunkSize=int(10**5)
%time tmpDf=pd.read_pickle('{}/{}.pickle.gz'.format(base_mergedBySRR_dir,chunkSize))
%time hitDf=tmpDf.loc[[query_Run_db,query_Run_digits]]

CPU times: user 25.5 s, sys: 3.14 s, total: 28.7 s
Wall time: 28.7 s
CPU times: user 32.6 s, sys: 30 s, total: 1min 2s
Wall time: 36.8 s

output¶

viola, in a minute, you got the data

In [10]:

hitDf.head(n=30)

Out[10]:

				features	ReadDepth	AverageBaseQuality
Run_db	Run_digits	Chr	Pos	base
ERR	187270	1	14727	A	2	31
			14727	G	8	37
			630825	T	70	36
			630833	C	75	35
			630833	T	1	37
			833068	G	1	32
			842133	G	5	37
			843942	G	1	40
			850609	T	4	38
			948136	G	3	34
			955964	G	1	38
			970788	G	2	37
			1013541	C	1	17
			1014143	C	2	24
			1014228	A	2	37
			1014228	G	1	37
			1014316	C	2	37
			1014359	G	4	38
			1020239	G	1	36
			1022188	A	2	35
			1022225	G	2	30
			1022260	C	3	39
			1022313	A	1	38
			1042136	T	2	36
			1042190	A	1	36
			1042190	G	2	36
			1043223	C	3	35
			1043248	C	3	34
			1043288	G	3	37
			1043382	G	1	39

Diagnosis¶

if you run into memory issue, I highly recommend trying out AWS machines.

In [11]:

memory_usageS=tmpDf.memory_usage()

In [12]:

print ('{} GB of RAM was for storing the pickle'.format(memory_usageS.sum()/(10**9)))

4.791737056 GB of RAM was for storing the pickle

In [ ]: