Playing with the new Biopython KGML rendering module

Biopython 1.65 was released on 17th December 2014. One of the new additions to the libraries was the Bio.Graphics.KGML module, which can be used for customised renderings of KEGG metabolic pathways.

This notebook is an introduction to the Bio.KEGG and Bio.Graphics.KGML modules, providing examples of usage, including data retrieval, manipulation and output rendering.

0. Setting up

We need to import Biopython and parts of the KEGG and KGML modules, but first we'll check that a suitably recent version of Biopython is installed (1.65 or higher):

In [1]:
import Bio
Bio.__version__
Out[1]:
'1.65+'

We're going to be using some other Biopython modules, and working with images in the notebook, so we also prepare for this.

In [2]:
from Bio import SeqIO
from Bio.KEGG.REST import *
from Bio.KEGG.KGML import KGML_parser
from Bio.Graphics.KGML_vis import KGMLCanvas
from Bio.Graphics.ColorSpiral import ColorSpiral

from IPython.display import Image, HTML

import random

# A bit of code that will help us display the PDF output
def PDF(filename):
    return HTML('<iframe src=%s width=700 height=350></iframe>' % filename)

# A bit of helper code to shorten long text
def head(text, lines=10):
    """ Print the first lines lines of the passed text.
    """
    print '\n'.join(text.split('\n')[:lines] + ['[...]'])

1. Retrieving data from KEGG with Bio.KEGG.REST

The KEGG database provides a RESTful interface to its data (see http://www.kegg.jp/kegg/rest/), which we can use via Biopython's Bio.KEGG.REST module.

The KEGG interface is not as well documented as some other resources (such as NCBI or Ensembl), and KEGG does not provide any usage guidelines. To avoid risking overloading the service, Biopython restricts us to three calls per second. Be warned also that the conditions of service include:

"This service should not be used for bulk data downloads".

1a. Obtaining information about a database with kegg_info()

We can obtain basic information about a particular database (e.g. "pathway", "kegg", "hsa", etc.) using the kegg_info() function. This returns a handle which can be read as plaintext, with some useful information about the resource.

In [3]:
# Kyoto Encyclopedia of Genes and Genomes
print(kegg_info("kegg").read())
kegg             Kyoto Encyclopedia of Genes and Genomes
kegg             Release 72.0+/12-21, Dec 14
                 Kanehisa Laboratories
                 pathway     338,948 entries
                 brite       119,262 entries
                 module      275,779 entries
                 disease       1,402 entries
                 drug         10,130 entries
                 environ         849 entries
                 orthology    18,334 entries
                 genome        3,515 entries
                 genes     15,319,849 entries
                 dgenes      654,883 entries
                 compound     17,343 entries
                 glycan       10,987 entries
                 reaction      9,775 entries
                 rpair        14,849 entries
                 rclass        2,945 entries
                 enzyme        6,415 entries

In [4]:
# KEGG Pathway Database
print(kegg_info("pathway").read())
pathway          KEGG Pathway Database
path             Release 72.0+/12-21, Dec 14
                 Kanehisa Laboratories
                 338,948 entries

In [5]:
# Escherichia coli K-12 MG1655 KEGG Genes Database
print(kegg_info("eco").read())
T00007           Escherichia coli K-12 MG1655 KEGG Genes Database
eco              Release 72.0+/12-20, Dec 14
                 Kanehisa Laboratories
                 4,497 entries

1b. Obtaining a list of entries in a database with kegg_list()

The kegg_list() function retrieves a handle, which provides a plaintext tab-separated key:value list of entry identifiers and definitions in the given database or set of database entries.

The kegg_list() function allows the organism to be specified for a database, with an optional second argument.

In [6]:
# List all pathways in the pathway database
head(kegg_list('pathway').read())
path:map00010	Glycolysis / Gluconeogenesis
path:map00020	Citrate cycle (TCA cycle)
path:map00030	Pentose phosphate pathway
path:map00040	Pentose and glucuronate interconversions
path:map00051	Fructose and mannose metabolism
path:map00052	Galactose metabolism
path:map00053	Ascorbate and aldarate metabolism
path:map00061	Fatty acid biosynthesis
path:map00062	Fatty acid elongation
path:map00071	Fatty acid degradation
[...]
In [7]:
# Only list pathways present in E. coli K-12 MG1655
head(kegg_list('pathway', 'eco').read())
path:eco00010	Glycolysis / Gluconeogenesis - Escherichia coli K-12 MG1655
path:eco00020	Citrate cycle (TCA cycle) - Escherichia coli K-12 MG1655
path:eco00030	Pentose phosphate pathway - Escherichia coli K-12 MG1655
path:eco00040	Pentose and glucuronate interconversions - Escherichia coli K-12 MG1655
path:eco00051	Fructose and mannose metabolism - Escherichia coli K-12 MG1655
path:eco00052	Galactose metabolism - Escherichia coli K-12 MG1655
path:eco00053	Ascorbate and aldarate metabolism - Escherichia coli K-12 MG1655
path:eco00061	Fatty acid biosynthesis - Escherichia coli K-12 MG1655
path:eco00071	Fatty acid degradation - Escherichia coli K-12 MG1655
path:eco00130	Ubiquinone and other terpenoid-quinone biosynthesis - Escherichia coli K-12 MG1655
[...]

If only the organism is specified, then a list of gene entries is returned.

In [8]:
# E. coli K-12 MG1655 genes
head(kegg_list('eco').read())
eco:b0001	thrL; thr operon leader peptide; K08278 thr operon leader peptide
eco:b0002	thrA; Bifunctional aspartokinase/homoserine dehydrogenase 1 (EC:1.1.1.3 2.7.2.4); K12524 bifunctional aspartokinase / homoserine dehydrogenase 1 [EC:2.7.2.4 1.1.1.3]
eco:b0003	thrB; homoserine kinase (EC:2.7.1.39); K00872 homoserine kinase [EC:2.7.1.39]
eco:b0004	thrC; L-threonine synthase (EC:4.2.3.1); K01733 threonine synthase [EC:4.2.3.1]
eco:b0005	yaaX; DUF2502 family putative periplasmic protein
eco:b0006	yaaA; peroxide resistance protein, lowers intracellular iron; K09861 hypothetical protein
eco:b0007	yaaJ; putative transporter; K03310 alanine or glycine:cation symporter, AGCS family
eco:b0008	talB; transaldolase B (EC:2.2.1.2); K00616 transaldolase [EC:2.2.1.2]
eco:b0009	mog; molybdochelatase incorporating molybdenum into molybdopterin; K03831 molybdopterin adenylyltransferase [EC:2.7.7.75]
eco:b0010	satP; succinate-acetate transporter; K07034
[...]

The KEGG API also allows the + operator to be used to combine list identifiers.

In [9]:
# Compound entry C01290 and glycan entry G00092
print(kegg_list('C01290+G00092').read())
cpd:C01290	beta-D-Galactosyl-(1->4)-beta-D-glucosyl-(1<->1)-ceramide; beta-D-Galactosyl-1,4-beta-D-glucosylceramide; Lactosylceramide; Gal-beta1->4Glc-beta1->1'Cer; LacCer; Lactosyl-N-acylsphingosine; D-Galactosyl-1,4-beta-D-glucosylceramide
gl:G00092	Lactosylceramide; CDw17; LacCer; (Gal)1 (Glc)1 (Cer)1; Glycolipid; Sphingolipid

But kegg_list() and other functions will take a list of identifiers, and combine them automatically:

In [10]:
# Compound entry C01290 and glycan entry G00092
print(kegg_list(['C01290', 'G00092']).read())
cpd:C01290	beta-D-Galactosyl-(1->4)-beta-D-glucosyl-(1<->1)-ceramide; beta-D-Galactosyl-1,4-beta-D-glucosylceramide; Lactosylceramide; Gal-beta1->4Glc-beta1->1'Cer; LacCer; Lactosyl-N-acylsphingosine; D-Galactosyl-1,4-beta-D-glucosylceramide
gl:G00092	Lactosylceramide; CDw17; LacCer; (Gal)1 (Glc)1 (Cer)1; Glycolipid; Sphingolipid

1c. Finding a specific entry with kegg_find()

The kegg_find(database, query) function finds entries with matching query keywords or other query data in a given database. Data is returned as tab-separated key:value plain text.

In [11]:
# Find shiga toxin genes
head(kegg_find('genes', 'shiga+toxin').read())
ece:Z1464	stx2A; shiga-like toxin II A subunit encoded by bacteriophage BP-933W; K11006 shiga toxin subunit A
ece:Z1465	stx2B; shiga-like toxin II B subunit encoded by bacteriophage BP-933W; K11007 shiga toxin subunit B
ece:Z3343	stx1B; shiga-like toxin 1 subunit B encoded within prophage CP-933V; K11007 shiga toxin subunit B
ece:Z3344	stx1A; shiga-like toxin 1 subunit A encoded within prophage CP-933V; K11006 shiga toxin subunit A
ecs:ECs1205	Shiga toxin 2 subunit A; K11006 shiga toxin subunit A
ecs:ECs1206	Shiga toxin 2 subunit B; K11007 shiga toxin subunit B
ecs:ECs2973	Shiga toxin I subunit B; K11007 shiga toxin subunit B
ecs:ECs2974	Shiga toxin I subunit A; K11006 shiga toxin subunit A
ecf:ECH74115_2905	shigatoxin 2, subunit B; K11007 shiga toxin subunit B
ecf:ECH74115_2906	shiga toxin subunit A (EC:3.2.2.22); K11006 shiga toxin subunit A
[...]
In [12]:
# Find shiga toxin genes only in Escherichia coli O111 H-11128 (EHEC)
print(kegg_find('eoi', 'shiga+toxin').read())
eoi:ECO111_2429	Shiga toxin 2 subunit B; K11007 shiga toxin subunit B
eoi:ECO111_2430	Shiga toxin 2 subunit A; K11006 shiga toxin subunit A
eoi:ECO111_3361	Shiga toxin 1 subunit A; K11006 shiga toxin subunit A
eoi:ECO111_3362	Shiga toxin 1 subunit B; K11007 shiga toxin subunit B

The API lets us query on useful properties, such as compound molecular weight:

In [13]:
# Compounds with molecular weight between 300 and 310g/mol
head(kegg_find('compound', '300-310/mol_weight').read())
cpd:C00051	307.32348
cpd:C00200	306.33696
cpd:C00219	304.46688
cpd:C00239	307.197122
cpd:C00270	309.26986
cpd:C00357	301.187702
cpd:C00365	308.181882
cpd:C00389	302.2357
cpd:C00732	308.37276
cpd:C00777	300.43512
[...]

1d. Retrieving specific entries with kegg_get()

The kegg_get() function lets us retrieve data from KEGG in a range of different formats.

Database entries can be retrieved directly as plain text; sequences may be returned as database entries, amino acid or nucleotide sequences in FASTA format; compounds as database entries, or images (.gif); pathways as database entries, KGML, or images (.png).

Compounds can be retrieved as their database entry in plain text, or as a .gif image.

In [14]:
# Compound as database entry
head(kegg_get("cpd:C01290").read())
ENTRY       C01290                      Compound
NAME        beta-D-Galactosyl-(1->4)-beta-D-glucosyl-(1<->1)-ceramide;
            beta-D-Galactosyl-1,4-beta-D-glucosylceramide;
            Lactosylceramide;
            Gal-beta1->4Glc-beta1->1'Cer;
            LacCer;
            Lactosyl-N-acylsphingosine;
            D-Galactosyl-1,4-beta-D-glucosylceramide
FORMULA     C31H56NO13R
REMARK      Same as: G00092
[...]
In [15]:
# Compound as image
Image(kegg_get("cpd:C01290", "image").read())
Out[15]:

Gene data can be retrieved as the plain text database entry, or a FASTA nucleotide or protein sequence.

In [16]:
# Gene as database entry
head(kegg_get("ece:Z5100").read())
ENTRY       Z5100             CDS       T00044
NAME        espF
DEFINITION  hypothetical protein
ORTHOLOGY   K12786  LEE-encoded effector EspF
ORGANISM    ece  Escherichia coli O157:H7 EDL933 (EHEC)
PATHWAY     ece05130  Pathogenic Escherichia coli infection
MODULE      ece_M00542  EHEC/EPEC pathogenicity signature, T3SS and effectors
BRITE       KEGG Orthology (KO) [BR:ece00001]
             Human Diseases
              Infectious diseases
[...]
In [17]:
# Gene as amino acid sequence
print(kegg_get("ece:Z5100", "aaseq").read())
>ece:Z5100 espF; hypothetical protein; K12786 LEE-encoded effector EspF (A)
MLNGISNAASTLGRQLVGIASRVSSAGGTGFSVAPQAVRLTPVKVHSPFSPGSSNVNART
IFNVSSQVTSFTPSRPAPPPPTSGQASGASRPLPPIAQALKEHLAAYEKSKGPEALGFKP
ARQAPPPPTSGQASGASRPLPPIAQALKEHLAAYEKSKGPEALGFKPARQAPPPPTSGQA
SGASRPLPPIAQALKEHLAAYEKSKGPEALGFKPARQAPPPPTGPSGLPPLAQALKDHLA
AYEQSKKG

In [18]:
# Gene as nucleotide sequence
print(kegg_get("ece:Z5100", "ntseq").read())
>ece:Z5100 espF; hypothetical protein; K12786 LEE-encoded effector EspF (N)
atgcttaatggaattagtaacgctgcttctacactagggcggcagcttgtaggtatcgca
agtcgagtgagctctgcggggggaactggattttctgtagcccctcaggccgtgcgtctt
actccggtgaaagttcattcccctttttctccaggctcgtcgaatgttaatgcgagaacg
atttttaatgtgagcagccaggtgacttcatttactccctctcgtccggcaccgccgcca
ccgacaagtggacaggcatccggggcatcccgacctttaccgcccattgcacaggcatta
aaagagcacttggctgcctatgaaaaatcgaaaggtcctgaggctttaggttttaagccc
gcccgtcaggcaccgccgccaccgacaagtggacaggcatccggggcatcccgaccttta
ccgcccattgcacaggcattaaaagagcacttggctgcctatgaaaaatcgaaaggtcct
gaggctttaggttttaagcccgcccgtcaggcaccgccgccaccgacaagtggacaggca
tccggggcatcccgacctttaccgcccattgcacaggcattaaaagagcacttggctgcc
tatgaaaaatcgaaaggtcctgaggctttaggttttaagcccgcccgtcaggcaccaccg
ccaccgacagggcctagtggactaccgccccttgcacaggcattaaaagatcatttagct
gcctatgagcaatcgaagaaagggtaa

In [19]:
# Parsing a returned sequence with SeqIO
seq = SeqIO.read(kegg_get("ece:Z5100", "ntseq"), 'fasta')
print seq.format('stockholm')
# STOCKHOLM 1.0
#=GF SQ 1
ece:Z5100 atgcttaatggaattagtaacgctgcttctacactagggcggcagcttgtaggtatcgcaagtcgagtgagctctgcggggggaactggattttctgtagcccctcaggccgtgcgtcttactccggtgaaagttcattcccctttttctccaggctcgtcgaatgttaatgcgagaacgatttttaatgtgagcagccaggtgacttcatttactccctctcgtccggcaccgccgccaccgacaagtggacaggcatccggggcatcccgacctttaccgcccattgcacaggcattaaaagagcacttggctgcctatgaaaaatcgaaaggtcctgaggctttaggttttaagcccgcccgtcaggcaccgccgccaccgacaagtggacaggcatccggggcatcccgacctttaccgcccattgcacaggcattaaaagagcacttggctgcctatgaaaaatcgaaaggtcctgaggctttaggttttaagcccgcccgtcaggcaccgccgccaccgacaagtggacaggcatccggggcatcccgacctttaccgcccattgcacaggcattaaaagagcacttggctgcctatgaaaaatcgaaaggtcctgaggctttaggttttaagcccgcccgtcaggcaccaccgccaccgacagggcctagtggactaccgccccttgcacaggcattaaaagatcatttagctgcctatgagcaatcgaagaaagggtaa
#=GS ece:Z5100 AC ece:Z5100
#=GS ece:Z5100 DE ece:Z5100 espF; hypothetical protein; K12786 LEE-encoded effector EspF (N)
//

Pathways can be returned as database entries in plain text, or in the KGML format, or as .png images. These last two formats are especially useful for generating custom pathway renderings with Bio.Graphics.KGML.

In [20]:
# Pathway as database entry
head(kegg_get("hsa05130").read())
ENTRY       hsa05130                    Pathway
NAME        Pathogenic Escherichia coli infection - Homo sapiens (human)
DESCRIPTION Enteropathogenic E. coli (EPEC) and enterohemorrhagic E. coli (EHEC) are closely related pathogenic strains of Escherichia coli. The hallmark of EPEC/EHEC infections [DS:H00278 H00277] is induction of attaching and effacing (A/E) lesions that damage intestinal epithelial cells. The capacity to form A/E lesions is encoded mainly by the locus of enterocyte effacement (LEE) pathogenicity island. Tir, Map, EspF, EspG are known LEE-encoded effector proteins secreted via the type III secretion system, which is also LEE-encoded, into the host cell. EPEC and EHEC Tir's link the extracellular bacterium to the cell cytoskeleton. Map and EspF are involved in mitochondrion membrane permeabilization. EspG interacts with tubulins and stimulates microtubule destabilization. LEE-encoded adhesin or intimin (Eae) is exported via the general secretory pathway to the periplasm, where it is inserted into the outer membrane. In addition to Tir, two potential host cell-carried intimin receptors, beta1 integrin (ITGB1) and nucleolin (NCL), have so far been identified. The distinguishing feature of EHEC is the elaboration of Shiga-like toxin (Stx). Stx cleaves ribosomal RNA, thereby disrupting protein synthesis and killing the intoxicated epithelial or endothelial cells.
CLASS       Human Diseases; Infectious diseases
PATHWAY_MAP hsa05130  Pathogenic Escherichia coli infection
DISEASE     H00277  Enterohemorrhagic Escherichia coli (EHEC) infection
            H00278  Enteropathogenic Escherichia coli (EPEC) infection
ORGANISM    Homo sapiens (human) [GN:hsa]
GENE        7100  TLR5; toll-like receptor 5 [KO:K10168]
            929  CD14; CD14 molecule [KO:K04391]
[...]
In [21]:
# Pathway as image (png)
Image(kegg_get("hsa05130", "image").read())
Out[21]:
In [22]:
# Pathway as KGML
head(kegg_get("hsa05130", "kgml").read())
<?xml version="1.0"?>
<!DOCTYPE pathway SYSTEM "http://www.kegg.jp/kegg/xml/KGML_v0.7.1_.dtd">
<!-- Creation date: May 13, 2014 11:30:29 +0900 (GMT+09:00) -->
<pathway name="path:hsa05130" org="hsa" number="05130"
         title="Pathogenic Escherichia coli infection"
         image="http://www.kegg.jp/kegg/pathway/hsa/hsa05130.png"
         link="http://www.kegg.jp/kegg-bin/show_pathway?hsa05130">
    <entry id="1" name="path:hsa04810" type="map"
        link="http://www.kegg.jp/dbget-bin/www_bget?hsa04810">
        <graphics name="Regulation of actin cytoskeleton" fgcolor="#000000" bgcolor="#FFFFFF"
[...]

2. Rendering pathways with Bio.KEGG and Bio.Graphics.KGML

2a. Retrieving renderings from KEGG

The easiest way to render a pathway is to retrieve an image file directly from KEGG with kegg_get(), as above.

In [23]:
# Render central metabolism
Image(kegg_get("map01100", "image").read())
Out[23]: