Check out more notebooks at our Community Notebooks Repository!
Title: How to transform a NCBI GEO data set to a BigQuery table
Author: David L Gibbs
Created: 2019-06-14
Purpose: Demonstrate how to make BigQuery tables from external data sets
URL: https://github.com/isb-cgc/Community-Notebooks/blob/master/Notebooks/How_to_make_NCBI_GEO_BigQuery_tables.ipynb
Notes:
In this example, we're going to retrieve data from NCBI GEO, and create a BigQuery table.
Our first step is to authenticate with Google -- you will need to be a member of a Google Cloud Platform (GCP) project, with authorization to run BigQuery jobs in order to run this notebook. If you don't have access to a GCP project, please contact the ISB-CGC team for help (www.isb-cgc.org).
from google.colab import auth
try:
auth.authenticate_user()
print('You have been successfully authenticated!')
except:
print('You have not been authenticated.')
You have been successfully authenticated!
Once you're authenticated, we'll begin getting set up to pull data out of BigQuery.
The first step is to initialize the BigQuery client. This requires specifying a Google Cloud Platform (GCP) project id in which you have the necessary privileges (also referred to as "roles") to execute queries and access the data used by this notebook.
from google.cloud import bigquery
try:
project_id = 'your_project_number' # Update your_project_number with your project number
bqclient = bigquery.Client(project=project_id)
print('BigQuery client successfully initialized')
except:
print('Failure to initialize BigQuery client')
BigQuery client successfully initialized
!pip install GEOparse
Collecting GEOparse Downloading https://files.pythonhosted.org/packages/67/f6/9206e1acda1858fa9a117ae91d9541e011735e672d58be58a5ee0947ef13/GEOparse-1.1.0.tar.gz (189kB) |████████████████████████████████| 194kB 2.8MB/s Requirement already satisfied: numpy>=1.7 in /usr/local/lib/python3.6/dist-packages (from GEOparse) (1.16.4) Requirement already satisfied: pandas>=0.17 in /usr/local/lib/python3.6/dist-packages (from GEOparse) (0.24.2) Collecting wgetter>=0.6 (from GEOparse) Downloading https://files.pythonhosted.org/packages/8e/ce/7f160ed9f0e16a5365bcbac1dbc6bad1631e9fc91610a444fbdebede3e8b/wgetter-0.7.tar.gz Collecting biopython>=1.71 (from GEOparse) Downloading https://files.pythonhosted.org/packages/28/15/8ac646ff24cfa2588b4d5e5ea51e8d13f3d35806bd9498fbf40ef79026fd/biopython-1.73-cp36-cp36m-manylinux1_x86_64.whl (2.2MB) |████████████████████████████████| 2.2MB 36.9MB/s Requirement already satisfied: python-dateutil>=2.5.0 in /usr/local/lib/python3.6/dist-packages (from pandas>=0.17->GEOparse) (2.5.3) Requirement already satisfied: pytz>=2011k in /usr/local/lib/python3.6/dist-packages (from pandas>=0.17->GEOparse) (2018.9) Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.6/dist-packages (from python-dateutil>=2.5.0->pandas>=0.17->GEOparse) (1.12.0) Building wheels for collected packages: GEOparse, wgetter Building wheel for GEOparse (setup.py) ... done Stored in directory: /root/.cache/pip/wheels/f3/aa/77/45a2f1517e7545aaabce83d4ad371e4f58aa818e4ee38691cd Building wheel for wgetter (setup.py) ... done Stored in directory: /root/.cache/pip/wheels/89/ce/c2/d0f63a63aec734d8af33209050a861f9a8583e11063dc27419 Successfully built GEOparse wgetter Installing collected packages: wgetter, biopython, GEOparse Successfully installed GEOparse-1.1.0 biopython-1.73 wgetter-0.7
Here's a data set I found by searching for pancreatic cancer:
import GEOparse
import pandas as pd
## docs: https://geoparse.readthedocs.io/en/latest/usage.html
gse = GEOparse.get_GEO(geo="GSE28735", destdir="./")
print()
print("GSM example:")
for gsm_name, gsm in gse.gsms.items():
print("Name: ", gsm_name)
print("Metadata:",)
for key, value in gsm.metadata.items():
print(" - %s : %s" % (key, ", ".join(value)))
print ("Table data:",)
print (gsm.table.head())
break
print()
print("GPL example:")
for gpl_name, gpl in gse.gpls.items():
print("Name: ", gpl_name)
print("Metadata:",)
for key, value in gpl.metadata.items():
print(" - %s : %s" % (key, ", ".join(value)))
print("Table data:",)
print(gpl.table.head())
break
GSM example: Name: GSM711904 Metadata: - title : human pancreatic tumor tissue, patient sample 1 - geo_accession : GSM711904 - status : Public on Jul 09 2012 - submission_date : Apr 20 2011 - last_update_date : Jul 09 2012 - type : RNA - channel_count : 1 - source_name_ch1 : tumor tissue, patient 1 - organism_ch1 : Homo sapiens - taxid_ch1 : 9606 - characteristics_ch1 : tissue: T, survival_month: 51, cancer_death: 1 - molecule_ch1 : total RNA - extract_protocol_ch1 : Tissues were flash frozen immediately after surgery.RNA from frozen tissue samples was extracted using standard TRIZOL (Invitrogen) protocol. RNA quality was confirmed with the Agilent 2100 Bioanalyzer (Agilent Technologies) before the microarray gene expression profiling - label_ch1 : biotin - label_protocol_ch1 : Samples were enzymatically fragmented and biotinylated using the WT Terminal Labeling Kit (Affymetrix) - hyb_protocol : Samples were hybridized using Affymetrix GeneChip Hybridization, Wash, and Stain Kit and according to the manufacturer's protocol. - scan_protocol : Affymetrix Gene ChIP Scanner 3000 7G - description : cancer related death - data_processing : All arrays were RMA normalized and gene expression summaries were created for each gene by averaging all probe sets for each gene using Partek Genomics Suite 6.5. All data analysis was performed on gene summarized data. - platform_id : GPL6244 - contact_name : Perwez,,Hussain - contact_email : hussainp@mail.nih.gov - contact_institute : NCI/NIH - contact_address : 37 Convent Drive - contact_city : Bethesda - contact_zip/postal_code : 20892 - contact_country : USA - supplementary_file : ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM711nnn/GSM711904/suppl/GSM711904.CEL.gz - series_id : GSE28735 - data_row_count : 28869 Table data: ID_REF VALUE 0 8157281 2.87856 1 7997332 4.81907 2 8072798 4.79498 3 7972808 3.91443 4 8157283 6.54982 GPL example: Name: GPL6244 Metadata: - title : [HuGene-1_0-st] Affymetrix Human Gene 1.0 ST Array [transcript (gene) version] - geo_accession : GPL6244 - status : Public on Dec 05 2007 - submission_date : Dec 05 2007 - last_update_date : Jul 26 2018 - technology : in situ oligonucleotide - distribution : commercial - organism : Homo sapiens - taxid : 9606 - manufacturer : Affymetrix - manufacture_protocol : See manufacturer's web site, - description : Affymetrix submissions are typically submitted to GEO using the GEOarchive method described at http://www.ncbi.nlm.nih.gov/projects/geo/info/geo_affy.html, , June 03, 2009: annotation table updated with netaffx build 28, June 18, 2012: annotation table updated with netaffx build 32, July 01, 2016: annotation table updated with netaffx build 35 - web_link : http://www.affymetrix.com/support/technical/byproduct.affx?product=hugene-1_0-st-v1, http://www.affymetrix.com/support/technical/libraryfilesmain.affx - contact_name : ,,Affymetrix, Inc. - contact_email : geo@ncbi.nlm.nih.gov, support@affymetrix.com - contact_phone : 888-362-2447 - contact_institute : Affymetrix, Inc. - contact_address : - contact_city : Santa Clara - contact_state : CA - contact_zip/postal_code : 95051 - contact_country : USA - contact_web_link : http://www.affymetrix.com/index.affx - relation : Alternative to: GPL10063 (Alternative CDF), Alternative to: GPL10666 (Alternative CDF), Alternative to: GPL11209 (Alternative CDF), Alternative to: GPL13243 (Alternative CDF [HuGene10stv1_Hs_ENTREZG_11.0.1]), Alternative to: GPL14010 (Alternative CDF), Alternative to: GPL15034 (Alternative CDF [HuGene10stv1_Hs_ENTREZG_v14]), Alternative to: GPL15424 (alternative CDF [HuGene10stv1_Hs_ENST]), Alternative to: GPL15648 (Alternative CDF), Alternative to: GPL15969 (Alternative CDF [HuGene1_0_genecentric]), Alternative to: GPL16239 (Alternative CDF), Alternative to: GPL16332 (Alternative CDF [MMBGX-Ensembl64]), Alternative to: GPL16522 (Alternative CDF [HuGene10stv1_Hs_ENTREZG_15.1.0]), Alternative to: GPL16786 (Alternative CDF [HuGene10stv1_Hs_ENSG_v13.0.0]), Alternative to: GPL16987 (Alternative CDF [GATExplorer_Ensembl v57]), Alternative to: GPL17047 ( Alternative CDF [HuGene10stv1_Hs_ENTREZG_15.0.0]), Alternative to: GPL17244 (Alternative CDF [HuGene10stv1_Hs_ENTREZG_16.0.0]), Alternative to: GPL17556 (alternative), Alternative to: GPL17737 (Alternative CDF), Alternative to: GPL18401 (AltAnalyze probeset-to-Ensembl), Alternative to: GPL18412 (Alternative CDF [hugene10stv1_74_030]), Alternative to: GPL18695 (alternative CDF[Hugene10st_Hs_ENST_16.0.0]), Alternative to: GPL19145 (Alternative CDF [hugene10st_Hs_ENSG_16.0.0]), Alternative to: GPL19433 (Alternative CDF [hugene10st_Hs_ENSGvs1500.cdf]), Alternative to: GPL20171 (Alternative CDF [hugene10st_Hs_ENTREZG_19.0.0]), Alternative to: GPL23526 (Alternative CDF [hugene10st_Hs_ENTREZG_20]) - data_row_count : 33297 Table data: ID ... category 0 7896736 ... main 1 7896738 ... main 2 7896740 ... main 3 7896742 ... main 4 7896744 ... main [5 rows x 12 columns]
### Let's check if the data is present ... ###
!ls
adc.json GSE28735_family.soft.gz sample_data
### Here's the sample names ###
gse.gsms.keys()
dict_keys(['GSM711904', 'GSM711905', 'GSM711906', 'GSM711907', 'GSM711908', 'GSM711909', 'GSM711910', 'GSM711911', 'GSM711912', 'GSM711913', 'GSM711914', 'GSM711915', 'GSM711916', 'GSM711917', 'GSM711918', 'GSM711919', 'GSM711920', 'GSM711921', 'GSM711922', 'GSM711923', 'GSM711924', 'GSM711925', 'GSM711926', 'GSM711927', 'GSM711928', 'GSM711929', 'GSM711930', 'GSM711931', 'GSM711932', 'GSM711933', 'GSM711934', 'GSM711935', 'GSM711936', 'GSM711937', 'GSM711938', 'GSM711939', 'GSM711940', 'GSM711941', 'GSM711942', 'GSM711943', 'GSM711944', 'GSM711945', 'GSM711946', 'GSM711947', 'GSM711948', 'GSM711949', 'GSM711950', 'GSM711951', 'GSM711952', 'GSM711953', 'GSM711954', 'GSM711955', 'GSM711956', 'GSM711957', 'GSM711958', 'GSM711959', 'GSM711960', 'GSM711961', 'GSM711962', 'GSM711963', 'GSM711964', 'GSM711965', 'GSM711966', 'GSM711967', 'GSM711968', 'GSM711969', 'GSM711970', 'GSM711971', 'GSM711972', 'GSM711973', 'GSM711974', 'GSM711975', 'GSM711976', 'GSM711977', 'GSM711978', 'GSM711979', 'GSM711980', 'GSM711981', 'GSM711982', 'GSM711983', 'GSM711984', 'GSM711985', 'GSM711986', 'GSM711987', 'GSM711988', 'GSM711989', 'GSM711990', 'GSM711991', 'GSM711992', 'GSM711993'])
### To make a pandas table
dataList = []
keyList = []
for gsm_name, gsm in gse.gsms.items():
dataList.append(gsm.table['VALUE'])
keyList.append(gsm_name)
dt = pd.concat(dataList, axis=1, keys=keyList)
dt.shape
(28869, 90)
dt[0:5]
GSM711904 | GSM711905 | GSM711906 | GSM711907 | GSM711908 | GSM711909 | GSM711910 | GSM711911 | GSM711912 | GSM711913 | GSM711914 | GSM711915 | GSM711916 | GSM711917 | GSM711918 | GSM711919 | GSM711920 | GSM711921 | GSM711922 | GSM711923 | GSM711924 | GSM711925 | GSM711926 | GSM711927 | GSM711928 | GSM711929 | GSM711930 | GSM711931 | GSM711932 | GSM711933 | GSM711934 | GSM711935 | GSM711936 | GSM711937 | GSM711938 | GSM711939 | GSM711940 | GSM711941 | GSM711942 | GSM711943 | ... | GSM711954 | GSM711955 | GSM711956 | GSM711957 | GSM711958 | GSM711959 | GSM711960 | GSM711961 | GSM711962 | GSM711963 | GSM711964 | GSM711965 | GSM711966 | GSM711967 | GSM711968 | GSM711969 | GSM711970 | GSM711971 | GSM711972 | GSM711973 | GSM711974 | GSM711975 | GSM711976 | GSM711977 | GSM711978 | GSM711979 | GSM711980 | GSM711981 | GSM711982 | GSM711983 | GSM711984 | GSM711985 | GSM711986 | GSM711987 | GSM711988 | GSM711989 | GSM711990 | GSM711991 | GSM711992 | GSM711993 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ID_REF | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
8157281 | 2.87856 | 3.11238 | 2.53830 | 3.06354 | 3.00862 | 2.86031 | 2.79610 | 2.70925 | 2.93254 | 2.63424 | 2.95893 | 2.87200 | 2.68794 | 2.87043 | 3.00570 | 2.88930 | 2.67934 | 2.57238 | 3.27607 | 2.64049 | 3.06408 | 2.73406 | 2.77169 | 2.82460 | 2.53917 | 2.89397 | 3.02304 | 2.54533 | 2.68656 | 2.71329 | 2.78667 | 3.05640 | 2.64101 | 2.65976 | 2.59429 | 2.48407 | 2.76008 | 2.80774 | 2.87721 | 2.65070 | ... | 2.87857 | 2.56825 | 2.83285 | 2.79112 | 2.54681 | 3.67406 | 2.50075 | 2.78080 | 2.78987 | 3.04386 | 2.71545 | 2.80313 | 2.77076 | 2.47390 | 3.23515 | 3.36002 | 2.94759 | 2.70444 | 2.78727 | 2.74598 | 2.85543 | 2.75771 | 2.59462 | 2.68376 | 2.70517 | 2.55055 | 3.08985 | 2.68454 | 2.82559 | 2.46607 | 2.95069 | 3.08348 | 2.53686 | 2.59509 | 2.81026 | 2.73692 | 2.56131 | 2.55399 | 2.33127 | 2.84898 |
7997332 | 4.81907 | 5.11957 | 3.84534 | 4.88599 | 4.21256 | 5.21558 | 4.44358 | 4.84622 | 4.39662 | 5.31573 | 4.04326 | 4.90359 | 4.66811 | 4.61300 | 3.99221 | 4.26957 | 4.76269 | 5.55366 | 4.36159 | 4.55511 | 4.68806 | 4.54581 | 4.68843 | 4.36278 | 4.00679 | 5.50511 | 3.78917 | 5.88371 | 4.68404 | 4.68971 | 4.26084 | 5.25235 | 4.81340 | 5.05444 | 5.69069 | 5.44038 | 4.59510 | 4.92991 | 4.88100 | 4.55881 | ... | 4.56882 | 4.55533 | 4.81932 | 5.96177 | 4.58744 | 5.35493 | 4.00745 | 5.07203 | 4.30874 | 5.60549 | 4.39004 | 5.03262 | 4.19159 | 4.66786 | 4.41098 | 5.50174 | 4.37206 | 4.25277 | 4.55099 | 5.58128 | 4.25922 | 5.78943 | 4.64110 | 5.39257 | 4.51210 | 4.94286 | 4.66434 | 5.09756 | 4.48413 | 4.87422 | 4.99665 | 5.13753 | 4.71479 | 5.75471 | 4.96034 | 5.30360 | 4.73077 | 4.74355 | 4.92834 | 5.39472 |
8072798 | 4.79498 | 3.33289 | 4.46755 | 4.21402 | 3.96569 | 5.93537 | 5.32265 | 5.61217 | 4.00068 | 4.75637 | 4.22405 | 4.82199 | 4.26785 | 4.43000 | 3.93890 | 4.68159 | 4.57920 | 5.63977 | 4.40216 | 4.90237 | 4.90333 | 3.87491 | 4.55239 | 5.81248 | 4.43769 | 3.87557 | 4.11975 | 4.04842 | 4.39872 | 5.36391 | 4.92992 | 3.55703 | 4.05370 | 4.16998 | 3.95932 | 4.28915 | 4.84668 | 4.96392 | 4.97775 | 5.06941 | ... | 4.98281 | 5.00488 | 4.94275 | 4.51843 | 5.72931 | 3.82826 | 5.29366 | 3.90850 | 4.85564 | 3.70039 | 4.79108 | 4.93422 | 4.48444 | 4.66008 | 4.48499 | 4.10805 | 5.17420 | 5.20986 | 4.47858 | 4.63110 | 5.60834 | 4.55652 | 4.35247 | 4.86911 | 4.66227 | 4.94809 | 4.68328 | 4.49687 | 4.93756 | 5.40975 | 3.97622 | 3.83820 | 4.40752 | 3.56449 | 4.46081 | 4.52637 | 5.26350 | 5.08635 | 4.60016 | 4.87154 |
7972808 | 3.91443 | 3.73899 | 3.56541 | 3.65571 | 3.81255 | 3.47989 | 3.97206 | 3.23493 | 3.65640 | 3.47709 | 3.24363 | 4.19606 | 3.71938 | 3.41108 | 3.60706 | 3.46097 | 3.47632 | 3.46287 | 3.42344 | 3.48008 | 3.32608 | 3.93251 | 3.24573 | 3.59813 | 3.77408 | 4.04420 | 3.71794 | 3.26882 | 3.20658 | 3.43510 | 3.45698 | 3.92237 | 3.58156 | 3.60522 | 3.45792 | 3.67040 | 3.26171 | 3.25425 | 3.47135 | 3.26036 | ... | 3.61229 | 3.46821 | 3.44855 | 3.39108 | 3.44181 | 4.25334 | 3.42747 | 4.48789 | 3.50292 | 3.65136 | 3.32469 | 3.46204 | 3.52742 | 3.38724 | 3.61432 | 3.59659 | 3.63159 | 3.28400 | 3.37072 | 3.24871 | 3.70275 | 3.46703 | 3.64926 | 3.37321 | 3.42203 | 3.09829 | 3.49502 | 3.30841 | 3.58856 | 3.25516 | 3.66669 | 3.56927 | 3.84543 | 3.58422 | 3.60664 | 3.31744 | 3.82820 | 3.71242 | 3.39796 | 3.48399 |
8157283 | 6.54982 | 6.00254 | 6.52508 | 6.14602 | 6.33795 | 6.46008 | 6.90400 | 6.79776 | 6.57530 | 6.28204 | 6.72710 | 6.32471 | 6.40431 | 6.65186 | 6.66155 | 6.99649 | 6.80546 | 6.56490 | 6.39291 | 6.92403 | 6.75612 | 6.47914 | 6.49044 | 6.62358 | 6.46638 | 6.03631 | 6.42088 | 6.41698 | 6.47206 | 6.80258 | 6.65255 | 6.36393 | 6.55296 | 6.35654 | 6.23710 | 6.08812 | 6.62913 | 6.39054 | 7.02084 | 6.89889 | ... | 6.13243 | 6.57235 | 6.55251 | 6.51254 | 6.64045 | 6.69901 | 6.63028 | 6.05628 | 6.70075 | 6.35385 | 6.73580 | 6.47082 | 6.51137 | 7.07220 | 6.60414 | 6.46276 | 6.59864 | 6.59402 | 7.10924 | 6.78148 | 6.59497 | 6.24138 | 6.27252 | 6.54214 | 6.80291 | 6.89463 | 6.79239 | 6.66943 | 6.83302 | 6.59577 | 5.84439 | 6.26302 | 6.56643 | 6.22398 | 6.71857 | 6.49037 | 6.82428 | 6.57863 | 6.79043 | 6.65992 |
5 rows × 90 columns
# Here we convert the wide matrix to a long format #
longTable = dt.unstack().reset_index()
longTable[0:5]
level_0 | ID_REF | 0 | |
---|---|---|---|
0 | GSM711904 | 8157281 | 2.87856 |
1 | GSM711904 | 7997332 | 4.81907 |
2 | GSM711904 | 8072798 | 4.79498 |
3 | GSM711904 | 7972808 | 3.91443 |
4 | GSM711904 | 8157283 | 6.54982 |
# We can rename the columns
longTable = longTable.set_axis(['Sample', 'Entrez', 'Value'], axis=1, inplace=False)
longTable[0:5]
Sample | Entrez | Value | |
---|---|---|---|
0 | GSM711904 | 8157281 | 2.87856 |
1 | GSM711904 | 7997332 | 4.81907 |
2 | GSM711904 | 8072798 | 4.79498 |
3 | GSM711904 | 7972808 | 3.91443 |
4 | GSM711904 | 8157283 | 6.54982 |
pd.
import pandas_gbq as gbq
gbq.to_gbq(longTable, 'Daves_working_area.my_table', project_id, if_exists='fail')
And that's it!