ISB-CGC Community Notebooks

Check out more notebooks at our Community Notebooks Repository!

Title:   How to work with cloud storage
Author:  David L Gibbs
Created: 2019-07-17
Purpose: Demonstrate how to move files --in and out of-- GCS.

How to work with cloud storage.

In this notebook, we demonstrate how to work with files stored in GCS buckets.

Let's authenticate ourselves

In [1]:
# with gcloud, we can authenticate ourselves

!gcloud auth login
Your browser has been opened to visit:

    https://accounts.google.com/o/oauth2/auth?redirect_uri=http%3A%2F%2Flocalhost%3A8085%2F&prompt=select_account&response_type=code&client_id=32555940559.apps.googleusercontent.com&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fappengine.admin+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcompute+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Faccounts.reauth&access_type=offline


[13678:13697:0501/104705.579283:ERROR:browser_process_sub_thread.cc(217)] Waited 3 ms for network service
Opening in existing browser session.
WARNING: `gcloud auth login` no longer writes application default credentials.
If you need to use ADC, see:
  gcloud auth application-default --help

You are now logged in as [[email protected]].
Your current project is [isb-cgc-02-0001].  You can change this setting by running:
  $ gcloud config set project PROJECT_ID
In [2]:
# and we can select our project

!gcloud config set project PROJECT_ID
ERROR: (gcloud.config.set) The project property must be set to a valid project ID, not the project name [PROJECT_ID]
To set your project, run:

  $ gcloud config set project PROJECT_ID

or to unset it, run:

  $ gcloud config unset project

Using 'bangs'.

Using a 'bang', or the exclamaition point (!), we can run command line commands. This includes file operations like 'ls', 'mv', and 'cp'. We can also use Google's tools including 'gsutil'.

In [3]:
# here we get a list of files stored *locally*
!ls -lha 
total 2.1M
drwxr-xr-x 3 davidgibbs davidgibbs 4.0K Apr 30 12:13  .
drwxr-xr-x 8 davidgibbs davidgibbs 4.0K Apr 30 10:07  ..
-rw-r--r-- 1 davidgibbs davidgibbs 7.9K Apr 30 10:08 'BCGSC microRNA expression.ipynb'
-rw-r--r-- 1 davidgibbs davidgibbs  19K Apr 30 11:06 'BRAF-V600 study using CCLE data.ipynb'
-rw-r--r-- 1 davidgibbs davidgibbs 174K Apr 30 10:08 'Copy Number segments.ipynb'
-rw-r--r-- 1 davidgibbs davidgibbs 111K Apr 30 10:45 'Creating TCGA cohorts -- part 1.ipynb'
-rw-r--r-- 1 davidgibbs davidgibbs  26K Apr 30 10:08 'Creating TCGA cohorts -- part 2.ipynb'
-rw-r--r-- 1 davidgibbs davidgibbs 130K Apr 30 10:08 'Creating TCGA cohorts -- part 3.ipynb'
-rw-r--r-- 1 davidgibbs davidgibbs  65K Apr 30 10:08 'DNA Methylation.ipynb'
-rw-r--r-- 1 davidgibbs davidgibbs  15K Apr 30 12:13  how_to_move_files.ipynb
drwxr-xr-x 2 davidgibbs davidgibbs 4.0K Apr 30 11:08  .ipynb_checkpoints
-rw-r--r-- 1 davidgibbs davidgibbs 362K Apr 30 10:08  isb_cgc_bam_slicing_with_pysam.ipynb
-rw-r--r-- 1 davidgibbs davidgibbs 362K Apr 30 10:08  ISB_cgc_bam_slicing_with_pysam.ipynb
-rw-r--r-- 1 davidgibbs davidgibbs 116K Apr 30 10:08  ISB_CGC_Query_of_the_Month_November_2018.ipynb
-rw-r--r-- 1 davidgibbs davidgibbs  22K Apr 30 10:08 'Protein expression.ipynb'
-rw-r--r-- 1 davidgibbs davidgibbs 2.5K Apr 30 10:08  README.md
-rw-r--r-- 1 davidgibbs davidgibbs 382K Apr 30 10:08  RegulomeExplorer_1_Gexpr_CNV.ipynb
-rw-r--r-- 1 davidgibbs davidgibbs  28K Apr 30 10:46  renamed_test.bam
-rw-r--r-- 1 davidgibbs davidgibbs 106K Apr 30 10:08 'Somatic Mutations.ipynb'
-rw-r--r-- 1 davidgibbs davidgibbs  40K Apr 30 10:08 'TCGA Annotations.ipynb'
-rw-r--r-- 1 davidgibbs davidgibbs  15K Apr 30 10:08 'The ISB-CGC open-access TCGA tables in BigQuery.ipynb'
-rw-r--r-- 1 davidgibbs davidgibbs  51K Apr 30 10:08 'UNC HiSeq mRNAseq gene expression.ipynb'
In [ ]:
 
In [4]:
# Now we use gsutil to list files in our bucket.
!gsutil ls gs://bam_bucket_1/
gs://bam_bucket_1/renamed_test.bam
gs://bam_bucket_1/test.bam
gs://bam_bucket_1/test_2.bam
gs://bam_bucket_1/test_3.bam
In [ ]:
 
In [5]:
# then we can copy to our local env.
!gsutil cp gs://bam_bucket_1/test.bam test_dl.bam
Copying gs://bam_bucket_1/test.bam...
/ [1 files][ 27.5 KiB/ 27.5 KiB]                                                
Operation completed over 1 objects/27.5 KiB.                                     
In [ ]:
 
In [6]:
# and it made it?
!ls -lha | grep test
-rw-r--r-- 1 davidgibbs davidgibbs  28K Apr 30 10:46 renamed_test.bam
-rw-r--r-- 1 davidgibbs davidgibbs  28K May  1 10:47 test_dl.bam
In [ ]:
 
In [7]:
# then we can copy it back to our bucket
!mv test_dl.bam renamed_test.bam
!gsutil cp renamed_test.bam gs://bam_bucket_1/renamed_test.bam
Copying file://renamed_test.bam [Content-Type=application/octet-stream]...
/ [1 files][ 27.5 KiB/ 27.5 KiB]                                                
Operation completed over 1 objects/27.5 KiB.                                     
In [ ]:
 

Using 'pythons'.

In [8]:
# to install and import the library
#!pip3 install --upgrade --user google-cloud-storage
In [12]:
!gcloud auth application-default login
#Your browser has been opened to visit:
#
#    https://accounts.google.com/o/oauth2/auth?redirect_uri=.....
#
#Credentials saved to file: [/home/davidgibbs/.config/gcloud/application_default_credentials.json]
Your browser has been opened to visit:

    https://accounts.google.com/o/oauth2/auth?redirect_uri=http%3A%2F%2Flocalhost%3A8085%2F&prompt=select_account&response_type=code&client_id=764086051850-6qr4p6gpi6hn506pt8ejuq83di341hur.apps.googleusercontent.com&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform&access_type=offline


[14422:14441:0501/104735.895643:ERROR:browser_process_sub_thread.cc(217)] Waited 8 ms for network service
Opening in existing browser session.

Credentials saved to file: [/home/davidgibbs/.config/gcloud/application_default_credentials.json]

These credentials will be used by any library that requests
Application Default Credentials.

To generate an access token for other uses, run:
  gcloud auth application-default print-access-token
In [ ]:

In [18]:
!export GOOGLE_APPLICATION_CREDENTIALS="/home/davidgibbs/.config/gcloud/application_default_credentials.json"
import google.auth
import google.cloud.storage as storage
credentials, project = google.auth.default()
/home/davidgibbs/.local/lib/python3.6/site-packages/google/auth/_default.py:66: UserWarning: Your application has authenticated using end user credentials from Google Cloud SDK. We recommend that most server applications use service accounts instead. If your application continues to use end user credentials from Cloud SDK, you might receive a "quota exceeded" or "API not enabled" error. For more information about service accounts, see https://cloud.google.com/docs/authentication/
  warnings.warn(_CLOUD_SDK_CREDENTIALS_WARNING)
In [19]:
# connection to our project
storage_client = storage.Client()
/home/davidgibbs/.local/lib/python3.6/site-packages/google/auth/_default.py:66: UserWarning: Your application has authenticated using end user credentials from Google Cloud SDK. We recommend that most server applications use service accounts instead. If your application continues to use end user credentials from Cloud SDK, you might receive a "quota exceeded" or "API not enabled" error. For more information about service accounts, see https://cloud.google.com/docs/authentication/
  warnings.warn(_CLOUD_SDK_CREDENTIALS_WARNING)
/home/davidgibbs/.local/lib/python3.6/site-packages/google/auth/_default.py:66: UserWarning: Your application has authenticated using end user credentials from Google Cloud SDK. We recommend that most server applications use service accounts instead. If your application continues to use end user credentials from Cloud SDK, you might receive a "quota exceeded" or "API not enabled" error. For more information about service accounts, see https://cloud.google.com/docs/authentication/
  warnings.warn(_CLOUD_SDK_CREDENTIALS_WARNING)
In [20]:
for b in storage_client.list_buckets():
    print(b)
<Bucket: artifacts.isb-cgc-02-0001.appspot.com>
<Bucket: bam_bucket_1>
<Bucket: dataproc-6b064c10-086c-44db-b3b5-f14e410e0c13-us>
<Bucket: dave_scratch_cosmic_v86>
<Bucket: daves-cromwell-bucket>
<Bucket: gibbs_bucket_nov162016>
<Bucket: isb-cgc-02-0001>
<Bucket: isb-cgc-02-0001-datalab>
<Bucket: isb-cgc-02-0001-scratch>
<Bucket: isb-cgc-02-0001-workflows>
<Bucket: isb_dataproc_oct28>
<Bucket: may_2018_qotm>
<Bucket: pancan_staging>
<Bucket: public_bucket_for_data_file_lists>
<Bucket: qotm_nov>
<Bucket: qotm_oct_2018>
<Bucket: qotm_oct_20182018-10-29-23-47-53>
<Bucket: qotm_oct_20182018-10-31-23-34-51>
<Bucket: smr-workspace-mlengine>
<Bucket: test_bucket_888>
<Bucket: us.artifacts.isb-cgc-02-0001.appspot.com>
<Bucket: vm-config.isb-cgc-02-0001.appspot.com>
<Bucket: vm-containers.isb-cgc-02-0001.appspot.com>
<Bucket: wild_new_bucket>
In [22]:
# here we'll create a bucket
bucket = storage_client.create_bucket('wild_new_bucket_2000')
print('Bucket {} created'.format(bucket.name))
Bucket wild_new_bucket_2000 created
In [27]:
# and then move a file to it
bucket = storage_client.get_bucket('wild_new_bucket_2000')
blob = bucket.blob('test_dl_upload.bam')
blob.upload_from_filename('renamed_test.bam')
In [29]:
# and check that it made it 
blobs = bucket.list_blobs()
for blob in blobs:
    print(blob.name)
test_dl_upload.bam
In [ ]:
# end of notebook