Check out more notebooks at our Community Notebooks Repository!
Title: How to convert bams to fastq files using samtools
Author: David L Gibbs
Created: 2019-07-29
Purpose: Convert a bam file, to fastq, using the Google Genomics Pipelines API
In this example, we'll be using the Google Genomics Pipelines API. The pipelines API makes it easy to run a job without having to spin up and shut down a VM. It's all done automatically.
The work is uses materials from https://cloud.google.com/genomics/docs/quickstart. An example yaml config file is found at: https://github.com/googlegenomics/pipelines-api-examples/blob/master/samtools/cloud/samtools.yaml
For this to work, we need to make sure that the Google Genomics API is enabled. To do that, from the main menu in the cloud console, select 'APIs & Services'. The API is called: genomics.googleapis.com.
The software needs to be in a docker image. Lots of biologically related software is already bundled up for you. See https://biocontainers.pro/#/
The biocontainers folks have removed all the 'latest' tags from their docker images, so you need to specify a version
docker_image = 'docker pull biocontainers/samtools:v1.7.0_cv4'
params='''
name: bedtools
description: Run samtools on one bam
resources:
zones:
- us-west1-b
disks:
- name: datadisk
autoDelete: True
# Within the Docker container, specify a mount point for the disk.
mountPoint: /mnt/data
docker:
imageName: bhklab/samtools-1.9.0
# The Pipelines API does not create the output directory.
cmd: >
mkdir /mnt/data/output &&
find /mnt/data/input &&
for file in $(/bin/ls /mnt/data/input/*.bam); do
samtools index ${file};
samtools sort -o sorted.bam -@ 3 ${file};
samtools fastq sorted.bam > /mnt/data/output/output.fastq;
mv sorted.bam /mnt/data/output/;
mv ${file}.bai /mnt/data/output/;
done
inputParameters:
- name: bamFile
description: Cloud Storage path or pattern to input file(s)
localCopy:
path: input/
disk: datadisk
outputParameters:
- name: outputPath
description: Cloud Storage path for where to samtools output
localCopy:
path: output/*
disk: datadisk
'''
fout = open('samtools.yaml','w')
fout.write(params)
fout.close()
See: https://cloud.google.com/sdk/gcloud/reference/alpha/genomics/pipelines/run for more parameters.
%set_env TEMP=gs://temp_bucket_123
env: TEMP=gs://temp_bucket_123
!gcloud alpha genomics pipelines run \
--pipeline-file samtools.yaml \
--inputs bamFile=gs://genomics-public-data/ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/pilot3_exon_targetted_GRCh37_bams/data/NA06986/alignment/NA06986.chromMT.ILLUMINA.bwa.CEU.exon_targetted.20100311.bam \
--outputs outputPath=gs://bam_bucket_1/ \
--logging "${TEMP}/samtools_run_logs.log" \
--disk-size datadisk:100
Running [operations/EIjDkrq_LRi6vJvY9qnH138g6YmPu4QUKg9wcm9kdWN0aW9uUXVldWU].
!gcloud alpha genomics operations wait EM2Uzre_LRijp-vcnvHNuqkBIOmJj7uEFCoPcHJvZHVjdGlvblF1ZXVl
Waiting for [operations/EM2Uzre_LRijp-vcnvHNuqkBIOmJj7uEFCoPcHJvZHVjdGlvblF1ZXV l]...done.