Check out more notebooks at our Community Notebooks Repository!
Title: How to create convert 10X bams to fastq files using dsub
Author: David L Gibbs
Created: 2019-08-07
Purpose: Demonstrate how to make fastq files from 10X bams
Notes:
In this example, we'll be using DataBiosphere's dsub. dsub makes it easy to run a job without having to spin up and shut down a VM. It's all done automatically.
https://github.com/DataBiosphere/dsub
Docs for the genomics pipeline run: https://cloud.google.com/sdk/gcloud/reference/alpha/genomics/pipelines/run
For this to work, we need to make sure that the Google Genomics API is enabled. To do that, from the main menu in the cloud console, select 'APIs & Services'. The API is called: genomics.googleapis.com.
# first to install dsub,
# it's also possible to install it directly from
# github
!pip install dsub
# let's see if it's installed OK
!pip show dsub
# pip install software in the /.local/bin directory .. not part of PATH yet
!~/.local/bin/dsub
# hello world test
# using the local provider (--provider local)
# is a faster way to develop the task
! ~/.local/bin/dsub \
--provider local \
--logging /tmp/dsub-test/logging/ \
--output OUT=/tmp/dsub-test/output/out.txt \
--command 'echo "Hello World" > "${OUT}"' \
--wait
# and we can check the output
!cat /tmp/dsub-test/output/out.txt
# dsub can take a shell script..
cmd = '''
apt-get update;
apt-get --yes install wget;
wget http://cf.10xgenomics.com/misc/bamtofastq;
chmod +x bamtofastq;
OUTPUT_DIR="$OUTPUT_FOLDER/fastq";./bamtofastq ${INPUT_FILE} ${OUTPUT_DIR};'''
fout = open('job.sh', 'w')
fout.write(cmd)
fout.close()
!cat job.sh
# default for dsub is for a ubuntu image
# which is great, because bamtofastq is compatible
!~/.local/bin/dsub \
--provider google-v2 \
--project cgc-05-0180 \
--zones "us-west1-*" \
--script job.sh \
--input INPUT_FILE="gs://cgc_bam_bucket_007/pbmc_1k_protein_v3_possorted_genome_bam.bam" \
--output-recursive OUTPUT_FOLDER="gs://cgc_output/testout/" \
--disk-size 200 \
--logging "gs://cgc_temp_02/testout" \
--wait
#error: error creating output directory: "/mnt/data/output/gs/cruk_data_02". Does it already exist?
That's it! We can check the output with:
!gsutil ls gs://cgc_bam_bucket_007/output