Check out more notebooks at our Community Notebooks Repository!
Title: How to transform a NCBI GEO data set to a BigQuery table
Author: David L Gibbs
Created: 2019-06-14
Purpose: Demonstrate how to make BigQuery tables from external data sets
URL: https://github.com/isb-cgc/Community-Notebooks/blob/master/Notebooks/How_to_make_NCBI_GEO_BigQuery_tables.ipynb
Notes:
In this example, we're going to retrieve data from NCBI GEO, and create a BigQuery table.
Our first step is to authenticate with Google -- you will need to be a member of a Google Cloud Platform (GCP) project, with authorization to run BigQuery jobs in order to run this notebook. If you don't have access to a GCP project, please contact the ISB-CGC team for help (www.isb-cgc.org).
from google.colab import auth
try:
auth.authenticate_user()
print('You have been successfully authenticated!')
except:
print('You have not been authenticated.')
Once you're authenticated, we'll begin getting set up to pull data out of BigQuery.
The first step is to initialize the BigQuery client. This requires specifying a Google Cloud Platform (GCP) project id in which you have the necessary privileges (also referred to as "roles") to execute queries and access the data used by this notebook.
from google.cloud import bigquery
try:
project_id = 'your_project_number' # Update your_project_number with your project number
bqclient = bigquery.Client(project=project_id)
print('BigQuery client successfully initialized')
except:
print('Failure to initialize BigQuery client')
!pip install GEOparse
Here's a data set I found by searching for pancreatic cancer:
import GEOparse
import pandas as pd
## docs: https://geoparse.readthedocs.io/en/latest/usage.html
gse = GEOparse.get_GEO(geo="GSE28735", destdir="./")
print()
print("GSM example:")
for gsm_name, gsm in gse.gsms.items():
print("Name: ", gsm_name)
print("Metadata:",)
for key, value in gsm.metadata.items():
print(" - %s : %s" % (key, ", ".join(value)))
print ("Table data:",)
print (gsm.table.head())
break
print()
print("GPL example:")
for gpl_name, gpl in gse.gpls.items():
print("Name: ", gpl_name)
print("Metadata:",)
for key, value in gpl.metadata.items():
print(" - %s : %s" % (key, ", ".join(value)))
print("Table data:",)
print(gpl.table.head())
break
### Let's check if the data is present ... ###
!ls
### Here's the sample names ###
gse.gsms.keys()
### To make a pandas table
dataList = []
keyList = []
for gsm_name, gsm in gse.gsms.items():
dataList.append(gsm.table['VALUE'])
keyList.append(gsm_name)
dt = pd.concat(dataList, axis=1, keys=keyList)
dt.shape
dt[0:5]
# Here we convert the wide matrix to a long format #
longTable = dt.unstack().reset_index()
longTable[0:5]
# We can rename the columns
longTable = longTable.set_axis(['Sample', 'Entrez', 'Value'], axis=1, inplace=False)
longTable[0:5]
pd.
import pandas_gbq as gbq
gbq.to_gbq(longTable, 'Daves_working_area.my_table', project_id, if_exists='fail')
And that's it!