ISB-CGC Community Notebooks

Check out more notebooks at our Community Notebooks Repository!

Title:   How to convert bams to fastq files using samtools
Author:  David L Gibbs
Created: 2019-07-29
Purpose: Convert a bam file, to fastq, using the Google Genomics Pipelines API

Using samtools to index, sort, and convert a bam file to a fastq file.

In this example, we'll be using the Google Genomics Pipelines API. The pipelines API makes it easy to run a job without having to spin up and shut down a VM. It's all done automatically.

The work is uses materials from https://cloud.google.com/genomics/docs/quickstart. An example yaml config file is found at: https://github.com/googlegenomics/pipelines-api-examples/blob/master/samtools/cloud/samtools.yaml

For this to work, we need to make sure that the Google Genomics API is enabled. To do that, from the main menu in the cloud console, select 'APIs & Services'. The API is called: genomics.googleapis.com.

The software needs to be in a docker image. Lots of biologically related software is already bundled up for you. See https://biocontainers.pro/#/

The biocontainers folks have removed all the 'latest' tags from their docker images, so you need to specify a version

docker_image = 'docker pull biocontainers/samtools:v1.7.0_cv4'

See: http://www.htslib.org/doc/samtools-1.9.html

In [28]:
params='''
name: bedtools
description: Run samtools on one bam

resources:
  zones:
  - us-west1-b

  disks:
  - name: datadisk
    autoDelete: True

    # Within the Docker container, specify a mount point for the disk.
    mountPoint: /mnt/data

docker:
  imageName: bhklab/samtools-1.9.0

  # The Pipelines API does not create the output directory.
  cmd: >
    mkdir /mnt/data/output &&
    find /mnt/data/input &&
    for file in $(/bin/ls /mnt/data/input/*.bam); do
       samtools index ${file};
       samtools sort -o sorted.bam [email protected] 3 ${file}; 
       samtools fastq sorted.bam  > /mnt/data/output/output.fastq;
       mv sorted.bam /mnt/data/output/;       
       mv ${file}.bai /mnt/data/output/;
    done


inputParameters:
- name: bamFile
  description: Cloud Storage path or pattern to input file(s)
  localCopy:
    path: input/
    disk: datadisk


outputParameters:
- name: outputPath
  description: Cloud Storage path for where to samtools output
  localCopy:
    path: output/*
    disk: datadisk
'''

fout = open('samtools.yaml','w')
fout.write(params)
fout.close()
In [6]:
%set_env TEMP=gs://temp_bucket_123
env: TEMP=gs://temp_bucket_123
In [29]:
!gcloud alpha genomics pipelines run \
    --pipeline-file samtools.yaml \
    --inputs bamFile=gs://genomics-public-data/ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/pilot3_exon_targetted_GRCh37_bams/data/NA06986/alignment/NA06986.chromMT.ILLUMINA.bwa.CEU.exon_targetted.20100311.bam \
    --outputs outputPath=gs://bam_bucket_1/ \
    --logging "${TEMP}/samtools_run_logs.log" \
    --disk-size datadisk:100
Running [operations/EIjDkrq_LRi6vJvY9qnH138g6YmPu4QUKg9wcm9kdWN0aW9uUXVldWU].
In [23]:
!gcloud alpha genomics operations wait EM2Uzre_LRijp-vcnvHNuqkBIOmJj7uEFCoPcHJvZHVjdGlvblF1ZXVl
Waiting for [operations/EM2Uzre_LRijp-vcnvHNuqkBIOmJj7uEFCoPcHJvZHVjdGlvblF1ZXV
l]...done.
In [ ]: