ISB-CGC Community Notebooks

Check out more notebooks at our Community Notebooks Repository!

Title:   How to visualize results from BigQuery
Author:  David L Gibbs
Created: 2019-07-17
Purpose: Demonstrate how visualize the results from a query.
URL:     https://github.com/isb-cgc/Community-Notebooks/blob/master/Notebooks/How_to_plot_BigQuery_results.ipynb
Notes:

How to visualize results from BigQuery

In this example, we'll perform a few easy queries, and use the python Seaborn library to visualize the results.

Library documentation: https://seaborn.pydata.org/

There's many python tutorials on using the Seaborn library for plotting. For example, https://elitedatascience.com/python-seaborn-tutorial

Initial setup

In [ ]:
# If you start your jupyter notebook session using the 
# Community-Notebook/VM-Notebook-Launcher then you're already logged in!

# !gcloud auth application-default login
In [2]:
!pip3 install seaborn
Collecting seaborn
  Using cached https://files.pythonhosted.org/packages/a8/76/220ba4420459d9c4c9c9587c6ce607bf56c25b3d3d2de62056efe482dadc/seaborn-0.9.0-py3-none-any.whl
Collecting matplotlib>=1.4.3 (from seaborn)
  Using cached https://files.pythonhosted.org/packages/89/61/465fb3bfba684b0f53b5c4829c3c89e86e6fe9fdcdfda93e38f1788090f0/matplotlib-3.0.3-cp35-cp35m-manylinux1_x86_64.whl
Collecting numpy>=1.9.3 (from seaborn)
  Using cached https://files.pythonhosted.org/packages/bb/ef/d5a21cbc094d3f4d5b5336494dbcc9550b70c766a8345513c7c24ed18418/numpy-1.16.4-cp35-cp35m-manylinux1_x86_64.whl
Collecting scipy>=0.14.0 (from seaborn)
  Using cached https://files.pythonhosted.org/packages/14/49/8f13fa215e10a7ab0731cc95b0e9bb66cf83c6a98260b154cfbd0b55fb19/scipy-1.3.0-cp35-cp35m-manylinux1_x86_64.whl
Collecting pandas>=0.15.2 (from seaborn)
  Using cached https://files.pythonhosted.org/packages/74/24/0cdbf8907e1e3bc5a8da03345c23cbed7044330bb8f73bb12e711a640a00/pandas-0.24.2-cp35-cp35m-manylinux1_x86_64.whl
Collecting pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 (from matplotlib>=1.4.3->seaborn)
  Using cached https://files.pythonhosted.org/packages/dd/d9/3ec19e966301a6e25769976999bd7bbe552016f0d32b577dc9d63d2e0c49/pyparsing-2.4.0-py2.py3-none-any.whl
Collecting cycler>=0.10 (from matplotlib>=1.4.3->seaborn)
  Using cached https://files.pythonhosted.org/packages/f7/d2/e07d3ebb2bd7af696440ce7e754c59dd546ffe1bbe732c8ab68b9c834e61/cycler-0.10.0-py2.py3-none-any.whl
Collecting kiwisolver>=1.0.1 (from matplotlib>=1.4.3->seaborn)
  Using cached https://files.pythonhosted.org/packages/ee/18/4cd2e84c6aff0c6a50479118083d20b9e676e5175a913c0ea76d700fc244/kiwisolver-1.1.0-cp35-cp35m-manylinux1_x86_64.whl
Collecting python-dateutil>=2.1 (from matplotlib>=1.4.3->seaborn)
  Using cached https://files.pythonhosted.org/packages/41/17/c62faccbfbd163c7f57f3844689e3a78bae1f403648a6afb1d0866d87fbb/python_dateutil-2.8.0-py2.py3-none-any.whl
Collecting pytz>=2011k (from pandas>=0.15.2->seaborn)
  Using cached https://files.pythonhosted.org/packages/3d/73/fe30c2daaaa0713420d0382b16fbb761409f532c56bdcc514bf7b6262bb6/pytz-2019.1-py2.py3-none-any.whl
Collecting six (from cycler>=0.10->matplotlib>=1.4.3->seaborn)
  Using cached https://files.pythonhosted.org/packages/73/fb/00a976f728d0d1fecfe898238ce23f502a721c0ac0ecfedb80e0d88c64e9/six-1.12.0-py2.py3-none-any.whl
Collecting setuptools (from kiwisolver>=1.0.1->matplotlib>=1.4.3->seaborn)
  Using cached https://files.pythonhosted.org/packages/ec/51/f45cea425fd5cb0b0380f5b0f048ebc1da5b417e48d304838c02d6288a1e/setuptools-41.0.1-py2.py3-none-any.whl
Installing collected packages: pyparsing, six, cycler, setuptools, kiwisolver, python-dateutil, numpy, matplotlib, scipy, pytz, pandas, seaborn
Successfully installed cycler-0.10.0 kiwisolver-1.1.0 matplotlib-3.0.3 numpy-1.16.4 pandas-0.24.2 pyparsing-2.4.0 python-dateutil-2.8.0 pytz-2019.1 scipy-1.3.0 seaborn-0.9.0 setuptools-41.0.1 six-1.12.0
In [3]:
!pip3 install google-cloud-bigquery
Collecting google-cloud-bigquery
  Downloading https://files.pythonhosted.org/packages/2e/95/64e92560983db41ff1de7c08839f38ae7c5326a8aad71f5e893098cd1c85/google_cloud_bigquery-1.15.0-py2.py3-none-any.whl (133kB)
    100% |████████████████████████████████| 133kB 4.0MB/s ta 0:00:01
Collecting protobuf>=3.6.0 (from google-cloud-bigquery)
  Downloading https://files.pythonhosted.org/packages/7c/d2/581ebc3c41879aca2c4fce5c37cdb8d779c4ea79109b6da7f640735ea0a2/protobuf-3.8.0-cp35-cp35m-manylinux1_x86_64.whl (1.2MB)
    100% |████████████████████████████████| 1.2MB 1.1MB/s eta 0:00:01
Collecting google-resumable-media>=0.3.1 (from google-cloud-bigquery)
  Downloading https://files.pythonhosted.org/packages/e2/5d/4bc5c28c252a62efe69ed1a1561da92bd5af8eca0cdcdf8e60354fae9b29/google_resumable_media-0.3.2-py2.py3-none-any.whl
Collecting google-cloud-core<2.0dev,>=1.0.0 (from google-cloud-bigquery)
  Downloading https://files.pythonhosted.org/packages/98/7f/ff56aec313787577e262d5a2e306c04aef61c5c274699ff9fb40095e6691/google_cloud_core-1.0.2-py2.py3-none-any.whl
Collecting setuptools (from protobuf>=3.6.0->google-cloud-bigquery)
  Using cached https://files.pythonhosted.org/packages/ec/51/f45cea425fd5cb0b0380f5b0f048ebc1da5b417e48d304838c02d6288a1e/setuptools-41.0.1-py2.py3-none-any.whl
Collecting six>=1.9 (from protobuf>=3.6.0->google-cloud-bigquery)
  Using cached https://files.pythonhosted.org/packages/73/fb/00a976f728d0d1fecfe898238ce23f502a721c0ac0ecfedb80e0d88c64e9/six-1.12.0-py2.py3-none-any.whl
Collecting google-api-core<2.0.0dev,>=1.11.0 (from google-cloud-core<2.0dev,>=1.0.0->google-cloud-bigquery)
  Downloading https://files.pythonhosted.org/packages/10/d6/8b1e8d79a8a56649af3a094e3d90dd213278da942f36d831b57c0ca4a503/google_api_core-1.11.1-py2.py3-none-any.whl (66kB)
    100% |████████████████████████████████| 71kB 10.6MB/s ta 0:00:01
Collecting google-auth<2.0dev,>=0.4.0 (from google-api-core<2.0.0dev,>=1.11.0->google-cloud-core<2.0dev,>=1.0.0->google-cloud-bigquery)
  Downloading https://files.pythonhosted.org/packages/c5/9b/ed0516cc1f7609fb0217e3057ff4f0f9f3e3ce79a369c6af4a6c5ca25664/google_auth-1.6.3-py2.py3-none-any.whl (73kB)
    100% |████████████████████████████████| 81kB 10.7MB/s ta 0:00:01
Collecting pytz (from google-api-core<2.0.0dev,>=1.11.0->google-cloud-core<2.0dev,>=1.0.0->google-cloud-bigquery)
  Using cached https://files.pythonhosted.org/packages/3d/73/fe30c2daaaa0713420d0382b16fbb761409f532c56bdcc514bf7b6262bb6/pytz-2019.1-py2.py3-none-any.whl
Collecting requests<3.0.0dev,>=2.18.0 (from google-api-core<2.0.0dev,>=1.11.0->google-cloud-core<2.0dev,>=1.0.0->google-cloud-bigquery)
  Downloading https://files.pythonhosted.org/packages/51/bd/23c926cd341ea6b7dd0b2a00aba99ae0f828be89d72b2190f27c11d4b7fb/requests-2.22.0-py2.py3-none-any.whl (57kB)
    100% |████████████████████████████████| 61kB 10.3MB/s ta 0:00:01
Collecting googleapis-common-protos!=1.5.4,<2.0dev,>=1.5.3 (from google-api-core<2.0.0dev,>=1.11.0->google-cloud-core<2.0dev,>=1.0.0->google-cloud-bigquery)
  Downloading https://files.pythonhosted.org/packages/eb/ee/e59e74ecac678a14d6abefb9054f0bbcb318a6452a30df3776f133886d7d/googleapis-common-protos-1.6.0.tar.gz
Collecting rsa>=3.1.4 (from google-auth<2.0dev,>=0.4.0->google-api-core<2.0.0dev,>=1.11.0->google-cloud-core<2.0dev,>=1.0.0->google-cloud-bigquery)
  Downloading https://files.pythonhosted.org/packages/02/e5/38518af393f7c214357079ce67a317307936896e961e35450b70fad2a9cf/rsa-4.0-py2.py3-none-any.whl
Collecting pyasn1-modules>=0.2.1 (from google-auth<2.0dev,>=0.4.0->google-api-core<2.0.0dev,>=1.11.0->google-cloud-core<2.0dev,>=1.0.0->google-cloud-bigquery)
  Downloading https://files.pythonhosted.org/packages/91/f0/b03e00ce9fddf4827c42df1c3ce10c74eadebfb706231e8d6d1c356a4062/pyasn1_modules-0.2.5-py2.py3-none-any.whl (74kB)
    100% |████████████████████████████████| 81kB 11.0MB/s ta 0:00:01
Collecting cachetools>=2.0.0 (from google-auth<2.0dev,>=0.4.0->google-api-core<2.0.0dev,>=1.11.0->google-cloud-core<2.0dev,>=1.0.0->google-cloud-bigquery)
  Downloading https://files.pythonhosted.org/packages/2f/a6/30b0a0bef12283e83e58c1d6e7b5aabc7acfc4110df81a4471655d33e704/cachetools-3.1.1-py2.py3-none-any.whl
Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 (from requests<3.0.0dev,>=2.18.0->google-api-core<2.0.0dev,>=1.11.0->google-cloud-core<2.0dev,>=1.0.0->google-cloud-bigquery)
  Downloading https://files.pythonhosted.org/packages/e6/60/247f23a7121ae632d62811ba7f273d0e58972d75e58a94d329d51550a47d/urllib3-1.25.3-py2.py3-none-any.whl (150kB)
    100% |████████████████████████████████| 153kB 8.1MB/s eta 0:00:01
Collecting certifi>=2017.4.17 (from requests<3.0.0dev,>=2.18.0->google-api-core<2.0.0dev,>=1.11.0->google-cloud-core<2.0dev,>=1.0.0->google-cloud-bigquery)
  Downloading https://files.pythonhosted.org/packages/60/75/f692a584e85b7eaba0e03827b3d51f45f571c2e793dd731e598828d380aa/certifi-2019.3.9-py2.py3-none-any.whl (158kB)
    100% |████████████████████████████████| 163kB 7.1MB/s eta 0:00:01
Collecting chardet<3.1.0,>=3.0.2 (from requests<3.0.0dev,>=2.18.0->google-api-core<2.0.0dev,>=1.11.0->google-cloud-core<2.0dev,>=1.0.0->google-cloud-bigquery)
  Downloading https://files.pythonhosted.org/packages/bc/a9/01ffebfb562e4274b6487b4bb1ddec7ca55ec7510b22e4c51f14098443b8/chardet-3.0.4-py2.py3-none-any.whl (133kB)
    100% |████████████████████████████████| 143kB 7.5MB/s eta 0:00:01
Collecting idna<2.9,>=2.5 (from requests<3.0.0dev,>=2.18.0->google-api-core<2.0.0dev,>=1.11.0->google-cloud-core<2.0dev,>=1.0.0->google-cloud-bigquery)
  Downloading https://files.pythonhosted.org/packages/14/2c/cd551d81dbe15200be1cf41cd03869a46fe7226e7450af7a6545bfc474c9/idna-2.8-py2.py3-none-any.whl (58kB)
    100% |████████████████████████████████| 61kB 10.2MB/s ta 0:00:01
Collecting pyasn1>=0.1.3 (from rsa>=3.1.4->google-auth<2.0dev,>=0.4.0->google-api-core<2.0.0dev,>=1.11.0->google-cloud-core<2.0dev,>=1.0.0->google-cloud-bigquery)
  Downloading https://files.pythonhosted.org/packages/7b/7c/c9386b82a25115cccf1903441bba3cbadcfae7b678a20167347fa8ded34c/pyasn1-0.4.5-py2.py3-none-any.whl (73kB)
    100% |████████████████████████████████| 81kB 11.4MB/s ta 0:00:01
Building wheels for collected packages: googleapis-common-protos
  Running setup.py bdist_wheel for googleapis-common-protos ... done
  Stored in directory: /home/dgibbs/.cache/pip/wheels/9e/3d/a2/1bec8bb7db80ab3216dbc33092bb7ccd0debfb8ba42b5668d5
Successfully built googleapis-common-protos
Installing collected packages: setuptools, six, protobuf, google-resumable-media, pyasn1, rsa, pyasn1-modules, cachetools, google-auth, pytz, urllib3, certifi, chardet, idna, requests, googleapis-common-protos, google-api-core, google-cloud-core, google-cloud-bigquery
Successfully installed cachetools-3.1.1 certifi-2019.3.9 chardet-3.0.4 google-api-core-1.11.1 google-auth-1.6.3 google-cloud-bigquery-1.15.0 google-cloud-core-1.0.2 google-resumable-media-0.3.2 googleapis-common-protos-1.6.0 idna-2.8 protobuf-3.8.0 pyasn1-0.4.5 pyasn1-modules-0.2.5 pytz-2019.1 requests-2.22.0 rsa-4.0 setuptools-41.0.1 six-1.12.0 urllib3-1.25.3

Defining helper functions

In [8]:
import seaborn as se
from google.cloud import bigquery
In [17]:
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (20.0, 10.0)
plt.rcParams['font.family'] = "serif"
In [5]:
def runQuery ( client, qString, dryRun=False ):
  
  print ( "\n in runQuery ... " )
  if ( dryRun ):
    print ( "    dry-run only " )
    
  ## set up QueryJobConfig object
  job_config = bigquery.QueryJobConfig()
  job_config.dry_run = dryRun
  job_config.use_query_cache = True
  job_config.use_legacy_sql = False
  
  ## run the query
  try:
    query_job = client.query ( qString, job_config=job_config )
    ## print ( "    query job state: ", query_job.state )
  except:
    print ( "  FATAL ERROR: query execution failed " )
    return ( None )
  
  ## return results as a dataframe (or an empty dataframe for a dry-run) 
  if ( not dryRun ):
    try:
      df = query_job.to_dataframe()
      if ( query_job.total_bytes_processed==0 ):
        print ( "    the results for this query were previously cached " )
      else:
        print ( "    this query processed {} bytes ".format(query_job.total_bytes_processed) )
      if ( len(df) < 1 ):
        print ( "  WARNING: this query returned NO results ")
      return ( df )
    except:
      print ( "  FATAL ERROR: query execution failed " )
      return ( None )
    
  else:
    print ( "    if not cached, this query will process {} bytes ".format(query_job.total_bytes_processed) )
    ## return an empty dataframe
    return ( pd.DataFrame() )
 


def checkQueryResults ( qr ):
  print ( "\n in checkQueryResults ... " )
  if ( not isinstance(qr, pd.DataFrame) ):
    print ( " query execution failed! " )
    return ( False )
  else:
    if ( len(qr) > 0 ): 
      print ( " # of rows in query results: {} ".format(len(qr)) )
      print ( "\n", qr.head() )
    else:
      print ( " query returned NO results ?!? " )  
    return ( True )
In [10]:
# User your own project ID

project_id = 'your_project_number'
if project_id == 'your_project_number':
    print('Please update the project number with your Google Cloud Project')
else:
    bqclient = bigquery.Client(project=project_id)

Making barplots

In [21]:
# We define queries as strings #

sql = '''
SELECT
   icd_10,
   COUNT(*) as Count
FROM
   `isb-cgc.TCGA_bioclin_v0.Clinical`
GROUP BY
   1  -- this is the same as 'group by icd_10'
ORDER BY
   Count
'''

res0 = runQuery ( bqclient, sql, dryRun=False )

res0[0:5]
 in runQuery ... 
    this query processed 74236 bytes 
Out[21]:
icd_10 Count
0 C02.2 1
1 C51.9 1
2 C44.3 1
3 C53.1 1
4 C49.8 1
In [26]:
# first I'll subset the results,
# there's a lot of categories!
res1 = res0[res0['Count'] > 200]

# then we'll make a bar plot
p = se.barplot(data=res1, x = 'icd_10', y = 'Count')

Making scatter plots

In [27]:
sql = '''
SELECT
   avg_percent_neutrophil_infiltration,
   avg_percent_lymphocyte_infiltration
FROM
   `isb-cgc.TCGA_bioclin_v0.Biospecimen`
GROUP BY
   1,2
'''

res3 = runQuery ( bqclient, sql, dryRun=False )

res3[0:5]
 in runQuery ... 
    this query processed 138600 bytes 
Out[27]:
avg_percent_neutrophil_infiltration avg_percent_lymphocyte_infiltration
0 10.0 40.0
1 0.0 85.0
2 1.5 8.5
3 3.0 9.0
4 5.0 6.0
In [29]:
se.scatterplot(data=res3, x='avg_percent_lymphocyte_infiltration', y='avg_percent_neutrophil_infiltration')
Out[29]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f6da51a04e0>
In [ ]: