ISB-CGC Community Notebooks¶

Check out more notebooks at our Community Notebooks Repository!

Title:   How to create convert 10X bams to fastq files using dsub
Author:  David L Gibbs
Created: 2019-08-07
Purpose: Demonstrate how to make fastq files from 10X bams
Notes:

How to use dsub to convert 10X bam files to fastqs

In [ ]:
In this example, we'll be using DataBiosphere's dsub. dsub makes it easy to run a job without having to  spin up and shut down a VM. It's all done automatically. 

https://github.com/DataBiosphere/dsub

Docs for the genomics pipeline run: https://cloud.google.com/sdk/gcloud/reference/alpha/genomics/pipelines/run

For this to work, we need to make sure that the Google Genomics API is enabled. To do that, from the main menu in the cloud console, select 'APIs & Services'. The API is called: genomics.googleapis.com.
In [45]:
# first to install dsub,
# it's also possible to install it directly from 
# github

!pip install dsub
Requirement already satisfied: dsub in ./.local/lib/python2.7/site-packages
Requirement already satisfied: oauth2client in /usr/local/lib/python2.7/dist-packages (from dsub)
Requirement already satisfied: six in /usr/local/lib/python2.7/dist-packages (from dsub)
Requirement already satisfied: python-dateutil in /usr/local/lib/python2.7/dist-packages (from dsub)
Requirement already satisfied: pyyaml in /usr/local/lib/python2.7/dist-packages (from dsub)
Requirement already satisfied: pytz in /usr/local/lib/python2.7/dist-packages (from dsub)
Requirement already satisfied: parameterized in ./.local/lib/python2.7/site-packages (from dsub)
Requirement already satisfied: google-api-python-client in /usr/local/lib/python2.7/dist-packages (from dsub)
Requirement already satisfied: retrying in /usr/local/lib/python2.7/dist-packages (from dsub)
Requirement already satisfied: tabulate in ./.local/lib/python2.7/site-packages (from dsub)
Requirement already satisfied: rsa>=3.1.4 in /usr/local/lib/python2.7/dist-packages (from oauth2client->dsub)
Requirement already satisfied: httplib2>=0.9.1 in /usr/local/lib/python2.7/dist-packages (from oauth2client->dsub)
Requirement already satisfied: pyasn1-modules>=0.0.5 in /usr/local/lib/python2.7/dist-packages (from oauth2client->dsub)
Requirement already satisfied: pyasn1>=0.1.7 in /usr/local/lib/python2.7/dist-packages (from oauth2client->dsub)
Requirement already satisfied: google-auth>=1.4.1 in /usr/local/lib/python2.7/dist-packages (from google-api-python-client->dsub)
Requirement already satisfied: google-auth-httplib2>=0.0.3 in /usr/local/lib/python2.7/dist-packages (from google-api-python-client->dsub)
Requirement already satisfied: uritemplate<4dev,>=3.0.0 in /usr/local/lib/python2.7/dist-packages (from google-api-python-client->dsub)
Requirement already satisfied: cachetools>=2.0.0 in /usr/local/lib/python2.7/dist-packages (from google-auth>=1.4.1->google-api-python-client->dsub)
In [ ]:
 
In [10]:
# let's see if it's installed OK

!pip show dsub
Name: dsub
Version: 0.3.2
Summary: A command-line tool that makes it easy to submit and run batch scripts in the cloud
Home-page: https://github.com/DataBiosphere/dsub
Author: Verily
Author-email: UNKNOWN
License: Apache
Location: /home/jupyter/.local/lib/python2.7/site-packages
Requires: oauth2client, six, python-dateutil, pyyaml, pytz, parameterized, google-api-python-client, retrying, tabulate
In [ ]:
 
In [13]:
# pip install software in the /.local/bin directory .. not part of PATH yet

!~/.local/bin/dsub
usage: /home/jupyter/.local/bin/dsub [-h] [--provider PROVIDER]
                                     [--version VERSION] [--unique-job-id]
                                     [--name NAME]
                                     [--tasks [FILE M-N [FILE M-N ...]]]
                                     [--image IMAGE] [--dry-run]
                                     [--command COMMAND] [--script SCRIPT]
                                     [--env [KEY=VALUE [KEY=VALUE ...]]]
                                     [--label [KEY=VALUE [KEY=VALUE ...]]]
                                     [--input [KEY=REMOTE_PATH [KEY=REMOTE_PATH ...]]]
                                     [--input-recursive [KEY=REMOTE_PATH [KEY=REMOTE_PATH ...]]]
                                     [--output [KEY=REMOTE_PATH [KEY=REMOTE_PATH ...]]]
                                     [--output-recursive [KEY=REMOTE_PATH [KEY=REMOTE_PATH ...]]]
                                     [--user USER]
                                     [--user-project USER_PROJECT]
                                     [--mount [KEY=PATH_SPEC [KEY=PATH_SPEC ...]]]
                                     [--wait] [--retries RETRIES]
                                     [--poll-interval POLL_INTERVAL]
                                     [--after AFTER [AFTER ...]] [--skip]
                                     [--min-cores MIN_CORES]
                                     [--min-ram MIN_RAM]
                                     [--disk-size DISK_SIZE]
                                     [--logging LOGGING] [--project PROJECT]
                                     [--boot-disk-size BOOT_DISK_SIZE]
                                     [--preemptible]
                                     [--zones ZONES [ZONES ...]]
                                     [--scopes SCOPES [SCOPES ...]]
                                     [--accelerator-type ACCELERATOR_TYPE]
                                     [--accelerator-count ACCELERATOR_COUNT]
                                     [--keep-alive KEEP_ALIVE]
                                     [--regions REGIONS [REGIONS ...]]
                                     [--machine-type MACHINE_TYPE]
                                     [--cpu-platform CPU_PLATFORM]
                                     [--network NETWORK]
                                     [--subnetwork SUBNETWORK]
                                     [--use-private-address]
                                     [--timeout TIMEOUT]
                                     [--log-interval LOG_INTERVAL] [--ssh]
                                     [--nvidia-driver-version NVIDIA_DRIVER_VERSION]
                                     [--service-account SERVICE_ACCOUNT]
                                     [--disk-type DISK_TYPE]
                                     [--enable-stackdriver-monitoring]
/home/jupyter/.local/bin/dsub: error: argument --project is required
In [ ]:
 
In [16]:
# hello world test

# using the local provider (--provider local)
# is a faster way to develop the task

! ~/.local/bin/dsub \
   --provider local \
   --logging /tmp/dsub-test/logging/ \
   --output OUT=/tmp/dsub-test/output/out.txt \
   --command 'echo "Hello World" > "${OUT}"' \
   --wait
Job: echo--jupyter--190808-173557-030088
Launched job-id: echo--jupyter--190808-173557-030088
To check the status, run:
  dstat --provider local --jobs 'echo--jupyter--190808-173557-030088' --users 'jupyter' --status '*'
To cancel the job, run:
  ddel --provider local --jobs 'echo--jupyter--190808-173557-030088' --users 'jupyter'
Waiting for job to complete...
Waiting for: echo--jupyter--190808-173557-030088.
  echo--jupyter--190808-173557-030088: SUCCESS
echo--jupyter--190808-173557-030088
In [17]:
# and we can check the output
!cat /tmp/dsub-test/output/out.txt
Hello World
In [ ]:
 
In [43]:
# dsub can take a shell script..

cmd = '''
apt-get update;
apt-get --yes install wget;
wget http://cf.10xgenomics.com/misc/bamtofastq;
chmod +x bamtofastq;
OUTPUT_DIR="$OUTPUT_FOLDER/fastq";./bamtofastq ${INPUT_FILE} ${OUTPUT_DIR};'''

fout = open('job.sh', 'w')
fout.write(cmd)
fout.close()

!cat job.sh
apt-get update;
apt-get --yes install wget;
wget http://cf.10xgenomics.com/misc/bamtofastq;
chmod +x bamtofastq;
./bamtofastq ${INPUT_FILE} $(dirname ${OUTPUT_FOLDER})/fastq;
In [ ]:
 
In [18]:
# default for dsub is for a ubuntu image
# which is great, because bamtofastq is compatible 
In [44]:
!~/.local/bin/dsub \
    --provider google-v2 \
    --project cgc-05-0180 \
    --zones "us-west1-*" \
    --script job.sh \
    --input INPUT_FILE="gs://cgc_bam_bucket_007/pbmc_1k_protein_v3_possorted_genome_bam.bam" \
    --output-recursive OUTPUT_FOLDER="gs://cgc_output/testout/" \
    --disk-size 200 \
    --logging "gs://cgc_temp_02/testout" \
    --wait
        
        
#error: error creating output directory: "/mnt/data/output/gs/cruk_data_02". Does it already exist?        
Job: job--jupyter--190808-184740-70
Launched job-id: job--jupyter--190808-184740-70
To check the status, run:
  dstat --provider google-v2 --project cgc-05-0180 --jobs 'job--jupyter--190808-184740-70' --users 'jupyter' --status '*'
To cancel the job, run:
  ddel --provider google-v2 --project cgc-05-0180 --jobs 'job--jupyter--190808-184740-70' --users 'jupyter'
Waiting for job to complete...
Waiting for: job--jupyter--190808-184740-70.
  job--jupyter--190808-184740-70: SUCCESS
job--jupyter--190808-184740-70
In [ ]:
 

That's it! We can check the output with:

In [ ]:
!gsutil ls gs://cgc_bam_bucket_007/output