PyCellBase usage example

In this use case we are interested in getting missense variants from all the genes that are within two genomic regions of interest: 17:43045767-43046767 and 13:32317272-32318272.

Before starting, we import all the required modules for this example:

In [1]:
from pycellbase.cbconfig import ConfigClient  # Configuration client
from pycellbase.cbclient import CellBaseClient  # CellBase client

Step 1: Setting up the configuration

PyCellBase configuration follows the next structure:

In [2]:
config = {
    'rest': {'hosts': ['http://bioinfo.hpc.cam.ac.uk/cellbase']},  # List of RESTful host URLs
    'species': 'hsapiens',  # Name of the species
    'version': 'v4'  # API version
}
INFO: The first available and valid host URL will be selected from the provided list of hosts
INFO: For a full list of potentially available species, please refer to: http://bioinfo.hpc.cam.ac.uk/cellbase/webservices/rest/v4/meta/species

This custom configuration can be stored in a YAML file, JSON file or Python dictionary. Then, one of these files or dictionary can be passed to the ConfigClient class, which is in charge of managing the configuration:

In [3]:
cc = ConfigClient(config)
WARNING: If no custom configuration is passed to ConfigClient, the default configuration is used.

If we need an example of the configuration structure, we can get the default one using the get_default_configuration method:

In [4]:
cc.get_default_configuration()
Out[4]:
{'rest': {'hosts': ['http://bioinfo.hpc.cam.ac.uk:80/cellbase']},
 'species': 'hsapiens',
 'version': 'v4'}

Step 2: Initialising CellBase client

Once we have set up our PyCellBase configuration, we can create the CellBaseClient, which is the central class of this package. We can pass a ConfigClient with a customised configuration to CellBaseClient:

In [5]:
cbc = CellBaseClient(cc)

We can check at any moment the configuration parameters used to make the calls to the database:

In [6]:
cbc.show_configuration()
Out[6]:
{'host': 'http://bioinfo.hpc.cam.ac.uk/cellbase',
 'species': 'hsapiens',
 'version': 'v4'}
WARNING: If no custom configuration is passed to CellBaseClient, the default configuration is used.

We can modify our CellBaseClient configuration at any moment by modifying the ConfigClient attributes:

In [7]:
cc.species = 'celegans'
cbc.show_configuration()
Out[7]:
{'host': 'http://bioinfo.hpc.cam.ac.uk/cellbase',
 'species': 'celegans',
 'version': 'v4'}
In [8]:
cc.species = 'hsapiens'
cbc.show_configuration()
Out[8]:
{'host': 'http://bioinfo.hpc.cam.ac.uk/cellbase',
 'species': 'hsapiens',
 'version': 'v4'}

Step 3: Querying CellBase

Once we have initialised the main CellBaseClient class, we are ready to query the database. First, we want to get all the genes that are within our regions of interest. To get information from genomic regions, we get the region-specific client:

In [9]:
rc = cbc.get_region_client()
INFO: For a list of potentially available endpoints, please refer to http://bioinfo.hpc.cam.ac.uk/cellbase/webservices
INFO: For a full list of potentially available assemblies, please refer to: http://bioinfo.hpc.cam.ac.uk/cellbase/webservices/rest/v4/meta/species

If we do not know which method is the most adequate for our task, we can get helpful information for each data-specific client. In this case we are interested in gettting all the genes within a region, so we are going to use the get_gene method:

In [10]:
rc.get_help()
RegionClient
    - get_clinical: Retrieves all the clinical variants
    - get_conservation: Retrieves all the conservation scores
    - get_gene: Retrieves all the gene objects for the regions. If query param histogram=true, frequency values per genomic interval will be returned instead.
    - get_model: Get JSON specification of Variant data model
    - get_regulatory: Retrieves all regulatory elements in a region
    - get_repeat: Retrieves all repeats for the regions
    - get_sequence: Retrieves genomic sequence
    - get_tfbs: Retrieves all transcription factor binding site objects for the regions. If query param histogram=true, frequency values per genomic interval will be returned instead.
    - get_transcript: Retrieves all transcript objects for the regions
    - get_variation: Retrieves all the variant objects for the regions. If query param histogram=true, frequency values per genomic interval will be returned instead.
INFO: We can get the accepted parameters and filters for a specific method of interest by using the get_help method: rc.get_help('get_gene', show_params=True)

Once we have our data-specific client, we can query the database with our query of interest and get the JSON response returned by CellBase.

In [11]:
regions_info = rc.get_gene('17:43045767-43046767,13:32317272-32318272', assembly='GRCh38')
INFO: Multiple queries can be passed as comma-separated values ('17:43045767-43046767,13:32317272-32318272') or as a Python list (['17:43045767-43046767', '13:32317272-32318272'])

The obtained response is a list of results for each query. In this case we have asked for information for two different regions so our response has two elements:

In [12]:
region1_result = regions_info[0]['result']  # 17:43045767-43046767
region2_result = regions_info[1]['result']  # 13:32317272-32318272

Now that we have the CellBase JSON output, it's just a question of navigating through it to retrieve the information of interest.

In [13]:
genes = []
for region in regions_info:
    for gene in region['result']:
        genes.append(gene['name'])
genes
Out[13]:
[u'BRCA1', u'BRCA2']

We have found two genes overlapping our regions of interest. Our next step is getting the variants within those genes. To get information from genes, we get the gene-specific client:

In [14]:
gc = cbc.get_gene_client()

In this example, as there are a lot of variants in these genes, we are going to limit the returned results to 5 variants per gene (limit=5), skipping the first 1100 variants (skip=1100):

In [15]:
genes_info = gc.get_snp(genes, assembly='GRCh38', limit=5, skip=1100)

As before, once we get the response, we just need to navigate through the output JSON to get the information of interest:

In [16]:
variants = []
for gene in genes_info:
    for variant in gene['result']:
        variants.append(':'.join([variant['chromosome'], str(variant['start']), variant['reference'], variant['alternate']]))
variants
Out[16]:
[u'17:43045767:G:A',
 u'17:43045768:G:A',
 u'17:43045769:G:A',
 u'17:43045770:TCACC:T',
 u'17:43045772:A:G',
 u'13:32317262:T:C',
 u'13:32317271:G:C',
 u'13:32317272:C:A',
 u'13:32317275:A:G',
 u'13:32317298:A:G']

After getting the variants, the last step is selecting those ones whose consequence is missense (SO:0001583). To get information from variants, we get the variant-specific client:

In [17]:
vc = cbc.get_variant_client()

As before, we query the database and navigate through the output JSON to select our variants of interest.

In [18]:
# Querying the database
variants_info = vc.get_annotation(variants, assembly='GRCh38')

# Navigating the output
missense_variants = []
for index, variant in enumerate(variants_info):
    # Getting variant consequences
    for consequence in variant['result'][0]['consequenceTypes']:
        # Filtering by Ensembl transcript ID
        if 'ensemblTranscriptId' in consequence and consequence['ensemblTranscriptId'] in ['ENST00000357654', 'ENST00000544455']:
            # Filtering by missense variant (SO:0001583)
            if 'SO:0001583' in [so['accession'] for so in consequence['sequenceOntologyTerms']]:
                missense_variants.append(variants[index])
missense_variants
Out[18]:
[u'17:43045769:G:A', u'17:43045772:A:G']

Finally, we have found two missense variants appearing in the genes of our region of interest.