Click run all for this jupyter notebook, please send an email to btsui@eng.ucsd.edu if there is any error.
Change syn15659419 to syn15624400 if you want to download the entire vairant dataset
%%bash
pip install synapseclient
pip install pandas --upgrade
####only download one file for now
#
mkdir tmp_data/
cd ./tmp_data/
#created a dummy accounts so that any one can download without registering, please don't do anything crazy with the account
synapse -u synapse.skymap.download -p QtL-E2g-hzz-N4k get syn15659419
base_mergedBySRR_dir='./tmp_data/'#'~/Data/merged/snp/hg38/mergedBySRR/'
query_SRR='ERR126304'
import pandas as pd
import re
query_Run_digits=int(re.search(r"\d+", query_SRR).group(0))
query_Run_db=re.search(r"\wRR+", query_SRR).group(0)
#currently fixed, just use this
chunkSize=int(10**5)
%time tmpDf=pd.read_pickle('{}/{}.pickle.gz'.format(base_mergedBySRR_dir,chunkSize))
%time hitDf=tmpDf.loc[[query_Run_db,query_Run_digits]]
CPU times: user 25.5 s, sys: 3.14 s, total: 28.7 s Wall time: 28.7 s CPU times: user 32.6 s, sys: 30 s, total: 1min 2s Wall time: 36.8 s
viola, in a minute, you got the data
hitDf.head(n=30)
features | ReadDepth | AverageBaseQuality | ||||
---|---|---|---|---|---|---|
Run_db | Run_digits | Chr | Pos | base | ||
ERR | 187270 | 1 | 14727 | A | 2 | 31 |
G | 8 | 37 | ||||
630825 | T | 70 | 36 | |||
630833 | C | 75 | 35 | |||
T | 1 | 37 | ||||
833068 | G | 1 | 32 | |||
842133 | G | 5 | 37 | |||
843942 | G | 1 | 40 | |||
850609 | T | 4 | 38 | |||
948136 | G | 3 | 34 | |||
955964 | G | 1 | 38 | |||
970788 | G | 2 | 37 | |||
1013541 | C | 1 | 17 | |||
1014143 | C | 2 | 24 | |||
1014228 | A | 2 | 37 | |||
G | 1 | 37 | ||||
1014316 | C | 2 | 37 | |||
1014359 | G | 4 | 38 | |||
1020239 | G | 1 | 36 | |||
1022188 | A | 2 | 35 | |||
1022225 | G | 2 | 30 | |||
1022260 | C | 3 | 39 | |||
1022313 | A | 1 | 38 | |||
1042136 | T | 2 | 36 | |||
1042190 | A | 1 | 36 | |||
G | 2 | 36 | ||||
1043223 | C | 3 | 35 | |||
1043248 | C | 3 | 34 | |||
1043288 | G | 3 | 37 | |||
1043382 | G | 1 | 39 |
memory_usageS=tmpDf.memory_usage()
print ('{} GB of RAM was for storing the pickle'.format(memory_usageS.sum()/(10**9)))
4.791737056 GB of RAM was for storing the pickle