Tutorial 1: Salmonella dataset¶

1. Background¶

This tutorial consists of an index constructed from a Salmonella enterica dataset previously published in (https://doi.org/10.1128/jcm.02200-15). To prepare the data, the genomes were downloaded from NCBI and run through the provided analysis pipeline (gdi analysis) to identify SNVs, as well as skesa, SISTR and MLST for MLST analysis. A notebook describing how the index was constructed is available at index-salmonella-data-tutorial-1.ipynb.

2. Getting data¶

Let's first download the precomputed index for this tutorial. To do this please run the below commands:

Note: In a Jupyter Python notebook, prepending a command with ! runs the command in a shell instead of the Python interpreter (e.g., !unzip runs the command unzip).

In [1]:

!wget -O salmonella-project.zip https://figshare.com/ndownloader/files/31555565?private_link=0405199820a13aedca42
!unzip -n salmonella-project.zip | head -n 3
!echo
!ls

--2022-12-15 15:34:43--  https://figshare.com/ndownloader/files/31555565?private_link=0405199820a13aedca42
Resolving figshare.com (figshare.com)... 54.78.10.27, 52.51.22.31, 2a05:d018:1f4:d003:7186:f789:624a:fe25, ...
Connecting to figshare.com (figshare.com)|54.78.10.27|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://s3-eu-west-1.amazonaws.com/pfigshare-u-files/31555565/salmonellaproject20211122.zip?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIYCQYOYV5JSSROOA/20221215/eu-west-1/s3/aws4_request&X-Amz-Date=20221215T213444Z&X-Amz-Expires=10&X-Amz-SignedHeaders=host&X-Amz-Signature=8c3656328dd2b4da2c34201e6fa383f2b949ff1f6d2c4a877defccfedb1a8456 [following]
--2022-12-15 15:34:44--  https://s3-eu-west-1.amazonaws.com/pfigshare-u-files/31555565/salmonellaproject20211122.zip?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIYCQYOYV5JSSROOA/20221215/eu-west-1/s3/aws4_request&X-Amz-Date=20221215T213444Z&X-Amz-Expires=10&X-Amz-SignedHeaders=host&X-Amz-Signature=8c3656328dd2b4da2c34201e6fa383f2b949ff1f6d2c4a877defccfedb1a8456
Resolving s3-eu-west-1.amazonaws.com (s3-eu-west-1.amazonaws.com)... 52.218.108.19, 52.218.88.203, 52.92.16.112, ...
Connecting to s3-eu-west-1.amazonaws.com (s3-eu-west-1.amazonaws.com)|52.218.108.19|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10696541 (10M) [application/zip]
Saving to: ‘salmonella-project.zip’

salmonella-project. 100%[===================>]  10.20M  2.44MB/s    in 4.4s    

2022-12-15 15:34:49 (2.31 MB/s) - ‘salmonella-project.zip’ saved [10696541/10696541]

Archive:  salmonella-project.zip
   creating: salmonella-project/
   creating: salmonella-project/.gdi-data/

create-index		tutorial-1-salmonella.ipynb
salmonella-project	tutorial-2-sars-cov-2.ipynb
salmonella-project.zip	tutorial-3-querying-overview.ipynb

Great. Now that we've got some data (in the salmonella-project/ directory), let's explore the command-line interface to access this data.

3. Command-line interface (`gdi`)¶

Let's first print out the version of gdi being used.

In [2]:

!gdi --version

gdi, version 0.9.2

3.1 List samples¶

The below command lists the samples loaded in this index. Note that instead of passing the project with --project-dir you can change to the project directory and gdi will figure out which project you're in.

In [3]:

!gdi --project-dir salmonella-project list samples | head -n 5

SH12-012
SH14-017
SH12-004
SH12-005
SH12-013

3.2 List reference genomes¶

This lists all the loaded reference genomes in this genomics index. This can be useful for commands later which require you to pass the name of the reference genome.

In [4]:

!gdi --project-dir salmonella-project list genomes

NC_011083

3.3 Query for a particular mutation or MLST allele type¶

This searches for a particular mutation (in this case a deletion of a CG at position 2865334 and an insertion of a G). If you are familiar with the VCF format, this is given in terms of CHROM:POS:REF:ALT format.

Note, prefixing the feature string with hasa: (as in hasa:NC_011083.1:4559277:C:T) is what is used to query the database for this particular feature.

In [5]:

!gdi --project-dir salmonella-project query 'hasa:NC_011083.1:4559277:C:T' | column -s$'\t' -t

Query                    Sample Name  Sample ID  Status
NC_011083.1:4559277:C:T  SH12-004     3          Present
NC_011083.1:4559277:C:T  SH12-005     4          Present
NC_011083.1:4559277:C:T  SH10-001     7          Present
NC_011083.1:4559277:C:T  SH12-002     12         Present
NC_011083.1:4559277:C:T  SH12-007     15         Present
NC_011083.1:4559277:C:T  SH12-001     17         Present
NC_011083.1:4559277:C:T  SH12-003     19         Present
NC_011083.1:4559277:C:T  SH12-010     28         Present
NC_011083.1:4559277:C:T  SH11-001     36         Present
NC_011083.1:4559277:C:T  SH12-006     47         Present
NC_011083.1:4559277:C:T  SH12-009     53         Present
NC_011083.1:4559277:C:T  SH12-011     54         Present
NC_011083.1:4559277:C:T  SH12-008     56         Present

Alternatively, you can search for samples with a particular MLST allele.

In [6]:

!gdi --project-dir salmonella-project query 'hasa:mlst:senterica:aroC:2' | column -s$'\t' -t | head -n 5

Query                  Sample Name  Sample ID  Status
mlst:senterica:aroC:2  SH12-012     1          Present
mlst:senterica:aroC:2  SH14-017     2          Present
mlst:senterica:aroC:2  SH12-004     3          Present
mlst:senterica:aroC:2  SH12-005     4          Present

3.4. Summarize features¶

If you want to print a table summarizing all features within a query, you can use --features-summary [FEATURES].

For example, to print a table summarizing mutations alleles:

In [7]:

!gdi --project-dir salmonella-project query --features-summary mutations |\
 cut -f 1,7,10,11,25 |\
 head -n 5 |\
 column -s$'\t' -t

Mutation                 Count  Total  Percent  ID_HGVS_GN.p
NC_011083.1:3190398:G:A  59     59     100      -
NC_011083.1:542308:G:A   59     59     100      hgvs_gn:NC_011083.1:SEHA_RS03145:p.T5T
NC_011083.1:2288294:A:G  59     59     100      hgvs_gn:NC_011083.1:thiD:p.P119P
NC_011083.1:4888140:G:T  59     59     100      hgvs_gn:NC_011083.1:SEHA_RS24190:p.C28F
cut: write error: Broken pipe

This shows a small selection of the top mutations out of all genomes in the dataset. Count is the number of genomes with that particular mutation, Total is the total number of genomes in the query (all genomes in the database in this case, feel free to try different query strings here to select subsets of genomes).

Note: the commands cut, head, and column are just used to clean up the output for Jupyter. There is more information included in this full table.

3.5. Query for a particular genome/sample¶

You can also use the isa: prefix to query for a particular genome. As in:

In [8]:

!gdi --project-dir salmonella-project query 'isa:SH14-002' | column -s$'\t' -t

Query                   Sample Name  Sample ID  Status
isa_sample('SH14-002')  SH14-002     30         Present

You can combine this with the previous --features-summary command to summarize features in a particular genome.

In [9]:

!gdi --project-dir salmonella-project query 'isa:SH14-002' --features-summary mutations |\
 cut -f 1,7,10,11,25 |\
 head -n 5 |\
 column -s$'\t' -t

Mutation                 Count  Total  Percent  ID_HGVS_GN.p
NC_011083.1:1947794:A:G  1      1      100      hgvs_gn:NC_011083.1:SEHA_RS10125:p.T61A
NC_011083.1:4716545:T:C  1      1      100      hgvs_gn:NC_011083.1:SEHA_RS23395:p.I219I
NC_011083.1:3945452:G:C  1      1      100      hgvs_gn:NC_011083.1:ligB:p.L134V
NC_011083.1:3012412:C:T  1      1      100      hgvs_gn:NC_011083.1:invC:p.L25L
cut: write error: Broken pipe

3.6. Query based on phylogenetic distance¶

You can also query based on the phylogenetic distance between samples using the isin_[value]_[units]: prefix to the query string. For example, use isin_10_substitutions:SH12-013 to search for genomes within 5 substitutions of genome SH12-013.

The query is based on a stored phylogenetic tree built using mutations/deletions derived from alignment to a reference genome. You will also have to pass --reference-name NC_011083 to specify the reference genome used to build the tree.

Note that 5 substitutions* here is derived from the branch lengths of the phylogenetic tree (multipled by the alignment length to convert substitutions/site to substitutions) and may not correspond to a count of the substitutions you get if you were to align genomes pairwise.*

In [10]:

!gdi --project-dir salmonella-project query --reference-name NC_011083 'isin_5_substitutions:SH14-002' | column -s$'\t' -t

Query                                    Sample Name  Sample ID  Status
within(5.0 substitutions of 'SH14-002')  SH14-017     2          Present
within(5.0 substitutions of 'SH14-002')  SH14-025     6          Present
within(5.0 substitutions of 'SH14-002')  SH14-021     8          Present
within(5.0 substitutions of 'SH14-002')  SH14-027     10         Present
within(5.0 substitutions of 'SH14-002')  SH14-005     11         Present
within(5.0 substitutions of 'SH14-002')  SH14-007     13         Present
within(5.0 substitutions of 'SH14-002')  SH14-023     18         Present
within(5.0 substitutions of 'SH14-002')  SH14-003     20         Present
within(5.0 substitutions of 'SH14-002')  SH14-028     21         Present
within(5.0 substitutions of 'SH14-002')  SH14-006     24         Present
within(5.0 substitutions of 'SH14-002')  SH14-024     26         Present
within(5.0 substitutions of 'SH14-002')  SH14-009     27         Present
within(5.0 substitutions of 'SH14-002')  SH14-011     29         Present
within(5.0 substitutions of 'SH14-002')  SH14-002     30         Present
within(5.0 substitutions of 'SH14-002')  SH14-022     31         Present
within(5.0 substitutions of 'SH14-002')  SH14-014     33         Present
within(5.0 substitutions of 'SH14-002')  SH14-020     39         Present
within(5.0 substitutions of 'SH14-002')  SH14-018     40         Present
within(5.0 substitutions of 'SH14-002')  SH14-008     43         Present
within(5.0 substitutions of 'SH14-002')  SH14-015     44         Present
within(5.0 substitutions of 'SH14-002')  SH14-019     46         Present
within(5.0 substitutions of 'SH14-002')  SH14-016     48         Present
within(5.0 substitutions of 'SH14-002')  SH14-010     50         Present
within(5.0 substitutions of 'SH14-002')  SH14-004     52         Present
within(5.0 substitutions of 'SH14-002')  SH14-001     55         Present
within(5.0 substitutions of 'SH14-002')  SH14-026     57         Present
within(5.0 substitutions of 'SH14-002')  SH14-013     58         Present

Combine this with --features-summary mutations to summarize the top mutations (and indels) in this set of selected genomes.

In [11]:

!gdi --project-dir salmonella-project query --reference-name NC_011083 'isin_5_substitutions:SH14-002' --features-summary mutations |\
 cut -f 1,7,10,11,25 |\
 head -n 5 |\
 column -s$'\t' -t

Mutation                       Count  Total  Percent  ID_HGVS_GN.p
NC_011083.1:4465400:GGCCGAA:G  27     27     100      hgvs_gn:NC_011083.1:tyrB:p.E53_A54del
NC_011083.1:3844892:G:T        27     27     100      hgvs_gn:NC_011083.1:SEHA_RS19240:p.T285K
NC_011083.1:333057:T:G         27     27     100      hgvs_gn:NC_011083.1:tssM:p.L410R
NC_011083.1:4636822:T:C        27     27     100      hgvs_gn:NC_011083.1:frdC:p.T91A
cut: write error: Broken pipe

3.6 Build an alignment¶

To build an alignment for use in further phylogenetics software you can do the following.

In [12]:

!gdi --project-dir salmonella-project build alignment --reference-name NC_011083 --align-type full --output-file out.aln

2022-12-15 15:35:26 INFO: Started building alignment for 59 samples with include_variants=['SNP', 'MNP', 'DELETION']
2022-12-15 15:35:35 INFO: Finished building alignment for 59 samples. Took 8.65 seconds
Wrote alignment to [out.aln]

In [13]:

!head -n 2 out.aln

>NC_011083 [reference genome]
AGAGATTACGTCTGGTTGCAAGAGATCATAACAGGGGGAATTAGTTGAAAATAAATATAT

This will produce an alignment with length equal to the reference genome, masking any missing data or gaps with N, and concatenating individual sequences (contigs) in the alignment together to construct a single, whole-genome alignment.

You can pass in specific samples to include with --sample.

4. Python API (Genomics Data Index API)¶

Now let's move on to the Python interface for loading, and querying data in this index. This is a much more powerful (and flexible) way to work with your data which attempts to provide seamless data flow between this index and Python pandas.

We first start out by connecting to our project through GenomicsDataIndex.

In [14]:

import genomics_data_index.api as gdi

db = gdi.GenomicsDataIndex.connect('salmonella-project')
db

Out[14]:

<GenomicsDataIndex(samples=59)>

Great. We're connected. The samples=59 tells us how many samples are in this database.

You can run db.sample_names() or db.reference_names() to list the samples and reference genomes in this database.

In [15]:

db.sample_names()[0:5]

Out[15]:

['SH12-012', 'SH14-017', 'SH12-004', 'SH12-005', 'SH12-013']

In [16]:

db.reference_names()

Out[16]:

['NC_011083']

4.1 List all features (mutations)¶

To list a summary of all indexed mutations we can do:

In [17]:

summary_df = db.mutations_summary(reference_name='NC_011083').sort_values('Count', ascending=False)
summary_df

Out[17]:

	Sequence	Position	Deletion	Insertion	Type	Count	Unknown Count	Present and Unknown Count	Total	Percent	...	Gene_Name	Gene_ID	Feature_Type	Transcript_BioType	HGVS.c	HGVS.p	ID_HGVS.c	ID_HGVS.p	ID_HGVS_GN.c	ID_HGVS_GN.p
Mutation
NC_011083.1:3190398:G:A	NC_011083.1	3190398	G	A	SNP	59	NaN	NaN	59	100.000000	...	idi-lysS	SEHA_RS16035-SEHA_RS16040	intergenic_region	<NA>	n.3190398G>A	<NA>	hgvs:NC_011083.1:n.3190398G>A	<NA>	hgvs_gn:NC_011083.1:n.3190398G>A	<NA>
NC_011083.1:542308:G:A	NC_011083.1	542308	G	A	SNP	59	NaN	NaN	59	100.000000	...	SEHA_RS03145	SEHA_RS03145	transcript	protein_coding	c.15C>T	p.T5T	hgvs:NC_011083.1:SEHA_RS03145:c.15C>T	hgvs:NC_011083.1:SEHA_RS03145:p.T5T	hgvs_gn:NC_011083.1:SEHA_RS03145:c.15C>T	hgvs_gn:NC_011083.1:SEHA_RS03145:p.T5T
NC_011083.1:2288294:A:G	NC_011083.1	2288294	A	G	SNP	59	NaN	NaN	59	100.000000	...	thiD	SEHA_RS11880	transcript	protein_coding	c.357T>C	p.P119P	hgvs:NC_011083.1:SEHA_RS11880:c.357T>C	hgvs:NC_011083.1:SEHA_RS11880:p.P119P	hgvs_gn:NC_011083.1:thiD:c.357T>C	hgvs_gn:NC_011083.1:thiD:p.P119P
NC_011083.1:4888140:G:T	NC_011083.1	4888140	G	T	SNP	59	NaN	NaN	59	100.000000	...	SEHA_RS24190	SEHA_RS24190	transcript	protein_coding	c.83G>T	p.C28F	hgvs:NC_011083.1:SEHA_RS24190:c.83G>T	hgvs:NC_011083.1:SEHA_RS24190:p.C28F	hgvs_gn:NC_011083.1:SEHA_RS24190:c.83G>T	hgvs_gn:NC_011083.1:SEHA_RS24190:p.C28F
NC_011083.1:4824790:T:C	NC_011083.1	4824790	T	C	SNP	59	NaN	NaN	59	100.000000	...	SEHA_RS23860	SEHA_RS23860	transcript	protein_coding	c.1378T>C	p.460Qext?	hgvs:NC_011083.1:SEHA_RS23860:c.1378T>C	hgvs:NC_011083.1:SEHA_RS23860:p.460Qext?	hgvs_gn:NC_011083.1:SEHA_RS23860:c.1378T>C	hgvs_gn:NC_011083.1:SEHA_RS23860:p.460Qext?
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
NC_011083.1:863025:CCGGCGG:C	NC_011083.1	863025	CCGGCGG	C	INDEL	1	NaN	NaN	59	1.694915	...	tolA	SEHA_RS04685	transcript	protein_coding	c.135_140delCGGCGG	p.G46_G47del	hgvs:NC_011083.1:SEHA_RS04685:c.135_140delCGGCGG	hgvs:NC_011083.1:SEHA_RS04685:p.G46_G47del	hgvs_gn:NC_011083.1:tolA:c.135_140delCGGCGG	hgvs_gn:NC_011083.1:tolA:p.G46_G47del
NC_011083.1:1819152:G:C	NC_011083.1	1819152	G	C	SNP	1	NaN	NaN	59	1.694915	...	SEHA_RS09510	SEHA_RS09510	transcript	protein_coding	c.369G>C	p.W123C	hgvs:NC_011083.1:SEHA_RS09510:c.369G>C	hgvs:NC_011083.1:SEHA_RS09510:p.W123C	hgvs_gn:NC_011083.1:SEHA_RS09510:c.369G>C	hgvs_gn:NC_011083.1:SEHA_RS09510:p.W123C
NC_011083.1:1427185:G:A	NC_011083.1	1427185	G	A	SNP	1	NaN	NaN	59	1.694915	...	SEHA_RS07595	SEHA_RS07595	transcript	protein_coding	c.123G>A	p.Q41Q	hgvs:NC_011083.1:SEHA_RS07595:c.123G>A	hgvs:NC_011083.1:SEHA_RS07595:p.Q41Q	hgvs_gn:NC_011083.1:SEHA_RS07595:c.123G>A	hgvs_gn:NC_011083.1:SEHA_RS07595:p.Q41Q
NC_011083.1:3494198:G:A	NC_011083.1	3494198	G	A	SNP	1	NaN	NaN	59	1.694915	...	mlaE	SEHA_RS17580	transcript	protein_coding	c.56C>T	p.T19M	hgvs:NC_011083.1:SEHA_RS17580:c.56C>T	hgvs:NC_011083.1:SEHA_RS17580:p.T19M	hgvs_gn:NC_011083.1:mlaE:c.56C>T	hgvs_gn:NC_011083.1:mlaE:p.T19M
NC_011083.1:3255259:G:A	NC_011083.1	3255259	G	A	SNP	1	NaN	NaN	59	1.694915	...	SEHA_RS16365	SEHA_RS16365	transcript	protein_coding	c.366G>A	p.L122L	hgvs:NC_011083.1:SEHA_RS16365:c.366G>A	hgvs:NC_011083.1:SEHA_RS16365:p.L122L	hgvs_gn:NC_011083.1:SEHA_RS16365:c.366G>A	hgvs_gn:NC_011083.1:SEHA_RS16365:p.L122L

421 rows × 24 columns

This gives us back a DataFrame summarizing the mutations.

The Count, Total, and Percent columns tell us how many samples have a particular mutation.

The Annotation and beyond columns give us detailed information about the mutation's impact (as derived from snpeff if this was run on the genomes (VCF files) prior to indexing). The ID_HGVS... columns give us alternative identifiers we can use for querying based on the HGVS notation (as derived from snpeff annotations).

You may notice that Unknown Count and Present and Unknown Count have values of NA. These columns record the number of samples where this particular mutation is "Unknown/Missing" (either due to N's or missing data in the particular regions of the genome). Calculating this information can slow down generating the features table for large datasets and so it is disabled by default. To enabled it you can pass include_unknown_samples=True to mutations_summary().

In [18]:

summary_df = db.mutations_summary(reference_name='NC_011083', include_unknown_samples=True).sort_values('Count', ascending=False)
summary_df

Out[18]:

	Sequence	Position	Deletion	Insertion	Type	Count	Unknown Count	Present and Unknown Count	Total	Percent	...	Gene_Name	Gene_ID	Feature_Type	Transcript_BioType	HGVS.c	HGVS.p	ID_HGVS.c	ID_HGVS.p	ID_HGVS_GN.c	ID_HGVS_GN.p
Mutation
NC_011083.1:3190398:G:A	NC_011083.1	3190398	G	A	SNP	59	0	59	59	100.000000	...	idi-lysS	SEHA_RS16035-SEHA_RS16040	intergenic_region	<NA>	n.3190398G>A	<NA>	hgvs:NC_011083.1:n.3190398G>A	<NA>	hgvs_gn:NC_011083.1:n.3190398G>A	<NA>
NC_011083.1:542308:G:A	NC_011083.1	542308	G	A	SNP	59	0	59	59	100.000000	...	SEHA_RS03145	SEHA_RS03145	transcript	protein_coding	c.15C>T	p.T5T	hgvs:NC_011083.1:SEHA_RS03145:c.15C>T	hgvs:NC_011083.1:SEHA_RS03145:p.T5T	hgvs_gn:NC_011083.1:SEHA_RS03145:c.15C>T	hgvs_gn:NC_011083.1:SEHA_RS03145:p.T5T
NC_011083.1:2288294:A:G	NC_011083.1	2288294	A	G	SNP	59	0	59	59	100.000000	...	thiD	SEHA_RS11880	transcript	protein_coding	c.357T>C	p.P119P	hgvs:NC_011083.1:SEHA_RS11880:c.357T>C	hgvs:NC_011083.1:SEHA_RS11880:p.P119P	hgvs_gn:NC_011083.1:thiD:c.357T>C	hgvs_gn:NC_011083.1:thiD:p.P119P
NC_011083.1:4888140:G:T	NC_011083.1	4888140	G	T	SNP	59	0	59	59	100.000000	...	SEHA_RS24190	SEHA_RS24190	transcript	protein_coding	c.83G>T	p.C28F	hgvs:NC_011083.1:SEHA_RS24190:c.83G>T	hgvs:NC_011083.1:SEHA_RS24190:p.C28F	hgvs_gn:NC_011083.1:SEHA_RS24190:c.83G>T	hgvs_gn:NC_011083.1:SEHA_RS24190:p.C28F
NC_011083.1:4824790:T:C	NC_011083.1	4824790	T	C	SNP	59	0	59	59	100.000000	...	SEHA_RS23860	SEHA_RS23860	transcript	protein_coding	c.1378T>C	p.460Qext?	hgvs:NC_011083.1:SEHA_RS23860:c.1378T>C	hgvs:NC_011083.1:SEHA_RS23860:p.460Qext?	hgvs_gn:NC_011083.1:SEHA_RS23860:c.1378T>C	hgvs_gn:NC_011083.1:SEHA_RS23860:p.460Qext?
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
NC_011083.1:863025:CCGGCGG:C	NC_011083.1	863025	CCGGCGG	C	INDEL	1	1	2	59	1.694915	...	tolA	SEHA_RS04685	transcript	protein_coding	c.135_140delCGGCGG	p.G46_G47del	hgvs:NC_011083.1:SEHA_RS04685:c.135_140delCGGCGG	hgvs:NC_011083.1:SEHA_RS04685:p.G46_G47del	hgvs_gn:NC_011083.1:tolA:c.135_140delCGGCGG	hgvs_gn:NC_011083.1:tolA:p.G46_G47del
NC_011083.1:1819152:G:C	NC_011083.1	1819152	G	C	SNP	1	0	1	59	1.694915	...	SEHA_RS09510	SEHA_RS09510	transcript	protein_coding	c.369G>C	p.W123C	hgvs:NC_011083.1:SEHA_RS09510:c.369G>C	hgvs:NC_011083.1:SEHA_RS09510:p.W123C	hgvs_gn:NC_011083.1:SEHA_RS09510:c.369G>C	hgvs_gn:NC_011083.1:SEHA_RS09510:p.W123C
NC_011083.1:1427185:G:A	NC_011083.1	1427185	G	A	SNP	1	0	1	59	1.694915	...	SEHA_RS07595	SEHA_RS07595	transcript	protein_coding	c.123G>A	p.Q41Q	hgvs:NC_011083.1:SEHA_RS07595:c.123G>A	hgvs:NC_011083.1:SEHA_RS07595:p.Q41Q	hgvs_gn:NC_011083.1:SEHA_RS07595:c.123G>A	hgvs_gn:NC_011083.1:SEHA_RS07595:p.Q41Q
NC_011083.1:3494198:G:A	NC_011083.1	3494198	G	A	SNP	1	0	1	59	1.694915	...	mlaE	SEHA_RS17580	transcript	protein_coding	c.56C>T	p.T19M	hgvs:NC_011083.1:SEHA_RS17580:c.56C>T	hgvs:NC_011083.1:SEHA_RS17580:p.T19M	hgvs_gn:NC_011083.1:mlaE:c.56C>T	hgvs_gn:NC_011083.1:mlaE:p.T19M
NC_011083.1:3255259:G:A	NC_011083.1	3255259	G	A	SNP	1	0	1	59	1.694915	...	SEHA_RS16365	SEHA_RS16365	transcript	protein_coding	c.366G>A	p.L122L	hgvs:NC_011083.1:SEHA_RS16365:c.366G>A	hgvs:NC_011083.1:SEHA_RS16365:p.L122L	hgvs_gn:NC_011083.1:SEHA_RS16365:c.366G>A	hgvs_gn:NC_011083.1:SEHA_RS16365:p.L122L

421 rows × 24 columns

Now we can see the Unknown Count and Present and Unknown Count valued filled in (Present and Unknown means the number of samples where the mutation is present (the Count column) plus the number of samples where it is Unknown if a mutation is present (the Unknown Count column).

Let's look for mutations where less than 1/3 of the genomic samples have a particular mutation.

In [19]:

summary_df[summary_df['Percent'] < 33].sort_values(['Count', 'Position'], ascending=False).head(5)

Out[19]:

	Sequence	Position	Deletion	Insertion	Type	Count	Unknown Count	Present and Unknown Count	Total	Percent	...	Gene_Name	Gene_ID	Feature_Type	Transcript_BioType	HGVS.c	HGVS.p	ID_HGVS.c	ID_HGVS.p	ID_HGVS_GN.c	ID_HGVS_GN.p
Mutation
NC_011083.1:4482211:C:A	NC_011083.1	4482211	C	A	SNP	17	13	30	59	28.813559	...	siiE	SEHA_RS22245	transcript	protein_coding	c.3787C>A	p.R1263S	hgvs:NC_011083.1:SEHA_RS22245:c.3787C>A	hgvs:NC_011083.1:SEHA_RS22245:p.R1263S	hgvs_gn:NC_011083.1:siiE:c.3787C>A	hgvs_gn:NC_011083.1:siiE:p.R1263S
NC_011083.1:3167187:AACCACGACCACGACCACGACCACGACCACGACCACGACCACG:A	NC_011083.1	3167187	AACCACGACCACGACCACGACCACGACCACGACCACGACCACG	A	INDEL	15	16	31	59	25.423729	...	SEHA_RS15905	SEHA_RS15905	transcript	protein_coding	c.423_464delCGACCACGACCACGACCACGACCACGACCACGAC...	p.D142_H155del	hgvs:NC_011083.1:SEHA_RS15905:c.423_464delCGAC...	hgvs:NC_011083.1:SEHA_RS15905:p.D142_H155del	hgvs_gn:NC_011083.1:SEHA_RS15905:c.423_464delC...	hgvs_gn:NC_011083.1:SEHA_RS15905:p.D142_H155del
NC_011083.1:4789652:G:A	NC_011083.1	4789652	G	A	SNP	13	0	13	59	22.033898	...	SEHA_RS23695-SEHA_RS23700	SEHA_RS23695-SEHA_RS23700	intergenic_region	<NA>	n.4789652G>A	<NA>	hgvs:NC_011083.1:n.4789652G>A	<NA>	hgvs_gn:NC_011083.1:n.4789652G>A	<NA>
NC_011083.1:4739756:ACGC:A	NC_011083.1	4739756	ACGC	A	INDEL	13	0	13	59	22.033898	...	arcC	SEHA_RS23510	transcript	protein_coding	c.566_568delGCG	p.G189del	hgvs:NC_011083.1:SEHA_RS23510:c.566_568delGCG	hgvs:NC_011083.1:SEHA_RS23510:p.G189del	hgvs_gn:NC_011083.1:arcC:c.566_568delGCG	hgvs_gn:NC_011083.1:arcC:p.G189del
NC_011083.1:4559277:C:T	NC_011083.1	4559277	C	T	SNP	13	0	13	59	22.033898	...	SEHA_RS22545	SEHA_RS22545	transcript	protein_coding	c.474C>T	p.A158A	hgvs:NC_011083.1:SEHA_RS22545:c.474C>T	hgvs:NC_011083.1:SEHA_RS22545:p.A158A	hgvs_gn:NC_011083.1:SEHA_RS22545:c.474C>T	hgvs_gn:NC_011083.1:SEHA_RS22545:p.A158A

5 rows × 24 columns

Let's use this table to select a mutation to search for. We will select NC_011083.1:4482211:C:A.

4.2 Search for a particular mutation (`hasa()`)¶

We can search for genomic samples containing particular mutation by starting a query and using the hasa() method.

In [20]:

q = db.samples_query().hasa('NC_011083.1:4482211:C:A')
q

Out[20]:

<SamplesQueryIndex[selected=29% (17/59) samples, unknown=22% (13/59) samples]>

The hasa() method can be read as "select samples that have a particular mutation". The selected samples are printed above (17/59 have this particular mutation).

4.2.1. Select by HGVS id¶

We can also select samples using the HGVS identifier.

In [21]:

q = db.samples_query().hasa('hgvs_gn:NC_011083.1:siiE:p.R1263S')
q

Out[21]:

<SamplesQueryIndex[selected=29% (17/59) samples, unknown=22% (13/59) samples]>

The HGVS identifier is given as hgvs:sequence:gene_locus_id:variant (or hgvs_gn:sequence:gene_name:variant, hgvs_gn is for specifying gene name instead of locus). You can find the corresponding HGVS identifiers from the mutations_summary() table shown above.

In [22]:

summary_df.loc['NC_011083.1:4482211:C:A'][['ID_HGVS.c', 'ID_HGVS.p', 'ID_HGVS_GN.c', 'ID_HGVS_GN.p']]

Out[22]:

ID_HGVS.c       hgvs:NC_011083.1:SEHA_RS22245:c.3787C>A
ID_HGVS.p        hgvs:NC_011083.1:SEHA_RS22245:p.R1263S
ID_HGVS_GN.c         hgvs_gn:NC_011083.1:siiE:c.3787C>A
ID_HGVS_GN.p          hgvs_gn:NC_011083.1:siiE:p.R1263S
Name: NC_011083.1:4482211:C:A, dtype: object

Note: As of right now, HGVS identifiers are derived when using SnpEff, so this option will only work if SnpEff was run on the VCF files and corresponding HGVS identifiers were stored in the index.

4.2.3. Unknown samples¶

You may notice that in the results of the query we find unknown=22% (13/59) samples.

In [23]:

db.samples_query().hasa('hgvs_gn:NC_011083.1:siiE:p.R1263S')

Out[23]:

<SamplesQueryIndex[selected=29% (17/59) samples, unknown=22% (13/59) samples]>

This shows us that there exists 1 sample where it is unknown whether or not it has a hgvs_gn:NC_011083.1:siiE:p.R1263S mutation. This could be either that this particular region of the genome had no reads align to it or there exists ambiguous characters (N), in this region, hence it cannot be determined whether or not the mutation in question exists.

Samples where the result of a query is unknown are tracked and can be printed out using the commands shown below on accessing more details about a query.

4.2.4. Details about query¶

Once we have a query/selection of samples the specific samples can be shown with the toframe() method:

In [24]:

q.toframe(include_unknown=True)

Out[24]:

	Query	Sample Name	Sample ID	Status
0	hgvs_gn:NC_011083.1:siiE:p.R1263S	SH14-021	8	Present
1	hgvs_gn:NC_011083.1:siiE:p.R1263S	SH14-005	11	Present
2	hgvs_gn:NC_011083.1:siiE:p.R1263S	SH14-007	13	Present
3	hgvs_gn:NC_011083.1:siiE:p.R1263S	SH10-30	16	Present
4	hgvs_gn:NC_011083.1:siiE:p.R1263S	SH14-023	18	Present
5	hgvs_gn:NC_011083.1:siiE:p.R1263S	SH14-003	20	Present
6	hgvs_gn:NC_011083.1:siiE:p.R1263S	SH14-028	21	Present
7	hgvs_gn:NC_011083.1:siiE:p.R1263S	SH14-006	24	Present
8	hgvs_gn:NC_011083.1:siiE:p.R1263S	SH14-012	25	Present
9	hgvs_gn:NC_011083.1:siiE:p.R1263S	SH14-009	27	Present
10	hgvs_gn:NC_011083.1:siiE:p.R1263S	SH14-011	29	Present
11	hgvs_gn:NC_011083.1:siiE:p.R1263S	SH14-002	30	Present
12	hgvs_gn:NC_011083.1:siiE:p.R1263S	SH14-014	33	Present
13	hgvs_gn:NC_011083.1:siiE:p.R1263S	SH09-29	35	Present
14	hgvs_gn:NC_011083.1:siiE:p.R1263S	SH14-008	43	Present
15	hgvs_gn:NC_011083.1:siiE:p.R1263S	SH14-019	46	Present
16	hgvs_gn:NC_011083.1:siiE:p.R1263S	SH14-004	52	Present
17	hgvs_gn:NC_011083.1:siiE:p.R1263S	SH14-017	2	Unknown
18	hgvs_gn:NC_011083.1:siiE:p.R1263S	SH14-025	6	Unknown
19	hgvs_gn:NC_011083.1:siiE:p.R1263S	SH14-027	10	Unknown
20	hgvs_gn:NC_011083.1:siiE:p.R1263S	SH14-024	26	Unknown
21	hgvs_gn:NC_011083.1:siiE:p.R1263S	SH14-022	31	Unknown
22	hgvs_gn:NC_011083.1:siiE:p.R1263S	SH14-020	39	Unknown
23	hgvs_gn:NC_011083.1:siiE:p.R1263S	SH14-018	40	Unknown
24	hgvs_gn:NC_011083.1:siiE:p.R1263S	SH14-015	44	Unknown
25	hgvs_gn:NC_011083.1:siiE:p.R1263S	SH14-016	48	Unknown
26	hgvs_gn:NC_011083.1:siiE:p.R1263S	SH14-010	50	Unknown
27	hgvs_gn:NC_011083.1:siiE:p.R1263S	SH14-001	55	Unknown
28	hgvs_gn:NC_011083.1:siiE:p.R1263S	SH14-026	57	Unknown
29	hgvs_gn:NC_011083.1:siiE:p.R1263S	SH14-013	58	Unknown

You can use include_unknown=True to include samples where the status of the query is Unknown (unknown whether it is True or False). By default unknowns are not included.

To summarize, you can use summary():

In [25]:

q.summary()

Out[25]:

	Query	Present	Absent	Unknown	Total	% Present	% Absent	% Unknown
0	hgvs_gn:NC_011083.1:siiE:p.R1263S	17	29	13	59	28.813559	49.152542	22.033898

Or you can use tolist() (by default unknowns are not included, you can use include_unknown=True to include them).

In [26]:

q.tolist()

Out[26]:

['SH14-021',
 'SH14-005',
 'SH14-007',
 'SH10-30',
 'SH14-023',
 'SH14-003',
 'SH14-028',
 'SH14-006',
 'SH14-012',
 'SH14-009',
 'SH14-011',
 'SH14-002',
 'SH14-014',
 'SH09-29',
 'SH14-008',
 'SH14-019',
 'SH14-004']

4.3 Chaining queries¶

Queries can be chained together to select samples that match every criteria given in the has() method.

In [27]:

q = db.samples_query() \
    .hasa('hgvs_gn:NC_011083.1:siiE:p.R1263S') \
    .hasa('NC_011083.1:3535635:A:G')
q

Out[27]:

<SamplesQueryIndex[selected=5% (3/59) samples, unknown=46% (27/59) samples]>

This can be read as "select all samples that have a R1263S mutation (amino acid) on gene siiE AND select all samples that have an A to G mutation on position 3535635 of sequence NC_011083.1".

4.3.1. Handling unknowns¶

You may notice that the unknown's have increased to 46%. In the data model used by this software, both ambiguous characters (e.g., N) and regions with too low of coverage are considered as unknown (or missing). So, if genomes have a large deletion in a particular region overlapping a mutation being queried, then these would all show up as having an unknown status (small deletions represented in the VCF file won't be considered as unknown). I do not know if this is the case here (further investigation is required) but this is something to keep in mind.

You can convert all those samples/genomes that have an Unknown status to be considered selected if you wish using some boolean logic on queries and select_unknown(). In particular:

In [28]:

# Select everything that is "present" in query `q` OR everything that is "unknown" in query `q`
r = q.select_present() | q.select_unknown()

# Alternatively
# r = q.select_present().or_(q.select_unknown())

r

Out[28]:

<SamplesQueryIndex[selected=51% (30/59) samples, unknown=0% (0/59) samples]>

Here, q.select_unknown() selects only the unknown samples and q.select_present() | q.select_unknown() means select everything that is present in q.select_present() OR in q.select_unknown().

4.4 Searching for a particular sample (`isa()` and `isin()`)¶

The isa() and isin() methods let us search for particular samples by name. The difference between the two is that:

isa() is meant to be read "select samples that are (is a) type matching the expression.
isin() is meant to be read "select samples that are in a set defined by the passed criteria.

The differences between these become more apparent for more advanced queries later on. For now, we can use these to select samples by name.

In [29]:

q = db.samples_query().isa('SH12-001')
q

Out[29]:

<SamplesQueryIndex[selected=2% (1/59) samples, unknown=0% (0/59) samples]>

In [30]:

q = db.samples_query().isin(['SH12-001', 'SH13-001'])
q

Out[30]:

<SamplesQueryIndex[selected=3% (2/59) samples, unknown=0% (0/59) samples]>

4.5 Searching within a tree¶

Queries are not limited to what mutations a sample has or by sample name. We can also use isin() to select samples that match criteria related to a phylogenetic tree.

To do this, we must specify that our query has a tree attached to it. The example data for this tutorial does have such a tree (though it can be built on-the-fly if needed using build_tree(), or joined to an existing tree using join_tree()).

In [31]:

t = db.samples_query(universe='mutations', reference_name='NC_011083')
t

Out[31]:

<MutationTreeSamplesQuery[selected=100% (59/59) samples, unknown=0% (0/59) samples]>

Here, the type of query is a MutationTreeSamplesQuery which means it has a tree attached to it (derived from mutations). You can access the underlying tree with the tree property (as an ete3 Tree object).

In [32]:

t.tree

# Print as newick format
#t.tree.write()

Out[32]:

Tree node '' (0x7effb6c89d9)

You can quickly visualize the tree in-line by using the tree_styler() method:

In [33]:

t.tree_styler(show_legend_type_labels=False).render(w=300)

Out[33]:

4.5.1. Select by distance in a tree¶

Now that we have a phylogenetic tree, we can search using this tree with isin(). To do this, let's search for samples within some distance from SH14-001.

In [34]:

tdist = t.isin('SH14-002', kind='distance', distance=1e-7, units='substitutions/site')
tdist

Out[34]:

<MutationTreeSamplesQuery[selected=3% (2/59) samples, unknown=0% (0/59) samples]>

This selects a subset of samples within the above distance (given in 'substitutions/site', you can also use units='substitutions').

It can be hard to see what is going on, so we can combine our query with the tree visualization using the highlight() method to highlight the selected samples in the tree.

Note: Selecting by distance to SH14-002, won't necessarily select genomes belonging to the same clade.

In [35]:

t.tree_styler(show_legend_type_labels=False).highlight(tdist).render(w=300)

Out[35]:

4.5.2. Select by most recent common ancestor¶

Another type of query instead of distance is mrca which selects samples that all share a particular most recent common ancestor.

In [36]:

tmrca = t.isin(['SH14-002', 'SH14-025'], kind='mrca')
t.tree_styler(show_legend_type_labels=False).highlight(tmrca).render(w=300)

Out[36]:

4.5.3. Select by distance and mrca¶

If you wish to select by distance, but restrict yourself to some particular clade you can chain the above two queries together.

In [37]:

tmrca_and_dist = t.isin(['SH14-002', 'SH14-025'], kind='mrca')\
    .isin('SH14-002', kind='distance', distance=1e-7, units='substitutions/site')

t.tree_styler(show_legend_type_labels=False).highlight(tmrca_and_dist).render(w=300)

Out[37]:

4.6 Attach external metadata¶

So far we've been looking at only the genomics data. But often times many details insights can be derived from the associated metadata with the genomic samples. External metadata can be attached and tracked by our queries. This also gives us annother method for selecting samples using pandas selection statements (e.g., metadata['Column'] == 'value').

To attach external metadata we first must load it up in Python as a DataFrame (note head(3) just means only print the first 3 rows, which avoids printing a very large table for this tutorial).

In [38]:

import pandas as pd

# I rename 'Sample Name' here to 'Sample Name Orig' to avoid issues later on with other columns named 'Sample Name'
metadata_df = pd.read_csv('salmonella-project/metadata.tsv.gz', sep='\t', dtype=str).rename(
    {'Sample Name': 'Sample Name Orig'}, axis='columns')
metadata_df.head(3)

Out[38]:

	Sample Name Orig	Run	Assay Type	AvgSpotLen	Bases	BioProject	BioSample	BioSampleModel	Bytes	Center Name	...	PFGE_SecondaryEnzyme_pattern	Phagetype	Platform	ReleaseDate	Serovar	SRA Study	STRAIN	sub_species	Host_disease	Host
0	SH08-001	SRR3028792	WGS	429	354123684	PRJNA305824	SAMN04334683	Pathogen.cl	197484364	MCGILL UNIVERSITY	...	SHBNI.0001	19	ILLUMINA	2015-12-19T00:00:00Z	Heidelberg	SRP067504	SH08-001	enterica	Salmonella gastroenteritis	Homo sapiens
1	SH09-29	SRR3028793	WGS	422	519366460	PRJNA305824	SAMN04334684	Pathogen.cl	288691068	MCGILL UNIVERSITY	...	SHBNI.0001	26	ILLUMINA	2015-12-19T00:00:00Z	Heidelberg	SRP067504	SH09-29	enterica	Salmonella gastroenteritis	Homo sapiens
2	SH10-001	SRR3028783	WGS	421	387145160	PRJNA305824	SAMN04334674	Pathogen.cl	233911529	MCGILL UNIVERSITY	...	SHBNI.0001	19	ILLUMINA	2015-12-19T00:00:00Z	Heidelberg	SRP067504	SH10-001	enterica	Salmonella gastroenteritis	Homo sapiens

3 rows × 40 columns

Now we can attach to our query using the join() method:

In [39]:

# Setup a new query (you don't have to do this, but this makes sure the results are all as expected in the tutorial)
q = db.samples_query().hasa('hgvs_gn:NC_011083.1:siiE:p.R1263S')

# Join our query with the given data frame
q = q.join(metadata_df, sample_names_column='Sample Name Orig')
q

Out[39]:

<DataFrameSamplesQuery[selected=29% (17/59) samples, unknown=22% (13/59) samples]>

To join we had to define a column containing the sample names. We now get back a query of type DataFrameSamplesQuery.

4.7 Query external metadata¶

We can continue using this to select samples and when we're done use the toframe() method to dump out our selected data as a DataFrame (change to toframe(include_unknown=True) if you wish to include unknown results).

In [40]:

q.hasa('hgvs:NC_011083.1:SEHA_RS17780:c.1080T>C').toframe()

Out[40]:

	Query	Sample Name	Sample ID	Status	Sample Name Orig	Run	Assay Type	AvgSpotLen	Bases	BioProject	...	PFGE_SecondaryEnzyme_pattern	Phagetype	Platform	ReleaseDate	Serovar	SRA Study	STRAIN	sub_species	Host_disease	Host
0	hgvs_gn:NC_011083.1:siiE:p.R1263S AND datafram...	SH10-30	16	Present	SH10-30	SRR3028794	WGS	408	614636422	PRJNA305824	...	SHBNI.0001	29	ILLUMINA	2015-12-19T00:00:00Z	Heidelberg	SRP067504	SH10-30	enterica	Salmonella gastroenteritis	Homo sapiens
1	hgvs_gn:NC_011083.1:siiE:p.R1263S AND datafram...	SH14-002	30	Present	SH14-002	SRR3028755	WGS	513	687874904	PRJNA305824	...	SHBNI.0001	19	ILLUMINA	2015-12-19T00:00:00Z	Heidelberg	SRP067504	SH14-002	enterica	Salmonella gastroenteritis	Homo sapiens
2	hgvs_gn:NC_011083.1:siiE:p.R1263S AND datafram...	SH14-003	20	Present	SH14-003	SRR3028756	WGS	508	697302985	PRJNA305824	...	SHBNI.0001	19	ILLUMINA	2015-12-19T00:00:00Z	Heidelberg	SRP067504	SH14-003	enterica	Salmonella gastroenteritis	Homo sapiens
3	hgvs_gn:NC_011083.1:siiE:p.R1263S AND datafram...	SH14-004	52	Present	SH14-004	SRR3028757	WGS	505	808762405	PRJNA305824	...	SHBNI.0001	19	ILLUMINA	2015-12-19T00:00:00Z	Heidelberg	SRP067504	SH14-004	enterica	Salmonella gastroenteritis	Homo sapiens
4	hgvs_gn:NC_011083.1:siiE:p.R1263S AND datafram...	SH14-005	11	Present	SH14-005	SRR3028758	WGS	510	852322915	PRJNA305824	...	SHBNI.0001	19	ILLUMINA	2015-12-19T00:00:00Z	Heidelberg	SRP067504	SH14-005	enterica	Salmonella gastroenteritis	Homo sapiens
5	hgvs_gn:NC_011083.1:siiE:p.R1263S AND datafram...	SH14-006	24	Present	SH14-006	SRR3028759	WGS	515	705385264	PRJNA305824	...	SHBNI.0001	19	ILLUMINA	2015-12-19T00:00:00Z	Heidelberg	SRP067504	SH14-006	enterica	Salmonella gastroenteritis	Homo sapiens
6	hgvs_gn:NC_011083.1:siiE:p.R1263S AND datafram...	SH14-007	13	Present	SH14-007	SRR3028760	WGS	511	770747207	PRJNA305824	...	SHBNI.0001	19	ILLUMINA	2015-12-19T00:00:00Z	Heidelberg	SRP067504	SH14-007	enterica	Salmonella gastroenteritis	Homo sapiens
7	hgvs_gn:NC_011083.1:siiE:p.R1263S AND datafram...	SH14-008	43	Present	SH14-008	SRR3028761	WGS	510	495519791	PRJNA305824	...	SHBNI.0001	19	ILLUMINA	2015-12-19T00:00:00Z	Heidelberg	SRP067504	SH14-008	enterica	Salmonella gastroenteritis	Homo sapiens
8	hgvs_gn:NC_011083.1:siiE:p.R1263S AND datafram...	SH14-011	29	Present	SH14-011	SRR3028764	WGS	516	658814410	PRJNA305824	...	SHBNI.0001	17	ILLUMINA	2015-12-19T00:00:00Z	Heidelberg	SRP067504	SH14-011	enterica	Salmonella gastroenteritis	Homo sapiens
9	hgvs_gn:NC_011083.1:siiE:p.R1263S AND datafram...	SH14-012	25	Present	SH14-012	SRR3028765	WGS	524	802527201	PRJNA305824	...	SHBNI.0001	ATHE-35	ILLUMINA	2015-12-19T00:00:00Z	Heidelberg	SRP067504	SH14-012	enterica	Salmonella gastroenteritis	Homo sapiens
10	hgvs_gn:NC_011083.1:siiE:p.R1263S AND datafram...	SH14-019	46	Present	SH14-019	SRR3028772	WGS	514	372480499	PRJNA305824	...	SHBNI.0001	19	ILLUMINA	2015-12-19T00:00:00Z	Heidelberg	SRP067504	SH14-019	enterica	NaN	NaN
11	hgvs_gn:NC_011083.1:siiE:p.R1263S AND datafram...	SH14-021	8	Present	SH14-021	SRR3028774	WGS	517	409621309	PRJNA305824	...	SHBNI.0001	19	ILLUMINA	2015-12-19T00:00:00Z	Heidelberg	SRP067504	SH14-021	enterica	NaN	NaN

12 rows × 44 columns

Running toframe() will add some additional columns to the front of the external data frame defining the genomics query information.

4.8 Selecting by column values using `isa()`¶

We can now use the isa() method to select samples by values in a particular metadata column. For example, to select all samples with an Isolation_source of food we can use:

In [41]:

q.isa('food', isa_column='Isolation_source', kind='dataframe').toframe()[
    ['Query', 'Sample Name', 'Status', 'Isolation_source']].head(3)

Out[41]:

	Query	Sample Name	Status	Isolation_source
0	hgvs_gn:NC_011083.1:siiE:p.R1263S AND datafram...	SH14-014	Present	food
1	hgvs_gn:NC_011083.1:siiE:p.R1263S AND datafram...	SH14-019	Present	food
2	hgvs_gn:NC_011083.1:siiE:p.R1263S AND datafram...	SH14-021	Present	food

To make it easier to write these query expressions, you can set a default column for isa() queries when joining to a dataframe:

In [42]:

q = db.samples_query().join(metadata_df, sample_names_column='Sample Name Orig',
                            default_isa_kind='dataframe', default_isa_column='Isolation_source')

q.isa('food').toframe()[['Query', 'Sample Name', 'Status', 'Isolation_source']].head(3)

Out[42]:

	Query	Sample Name	Status	Isolation_source
0	dataframe(names_col=[Sample Name Orig]) AND is...	SH12-009	Present	food
1	dataframe(names_col=[Sample Name Orig]) AND is...	SH12-010	Present	food
2	dataframe(names_col=[Sample Name Orig]) AND is...	SH14-013	Present	food

You can also pass regex=True to isa() to query by a regex.

4.9 Selecting by pandas selection expressions¶

You can also select samples using the more powerful pandas selection expressions. You use the isin() method for this.

For example, an alternative way to select samples where Isolation_source is food is:

In [43]:

q.isin(metadata_df['Isolation_source'] == 'food', kind='dataframe').toframe()[
    ['Query', 'Sample Name', 'Status', 'Isolation_source']].head(3)

Out[43]:

	Query	Sample Name	Status	Isolation_source
0	dataframe(names_col=[Sample Name Orig]) AND is...	SH12-009	Present	food
1	dataframe(names_col=[Sample Name Orig]) AND is...	SH12-010	Present	food
2	dataframe(names_col=[Sample Name Orig]) AND is...	SH14-013	Present	food

5. Working with MLST results¶

So far we've been working with mutations (specified either as nucleotides or as amino acids). But we can also use the same query syntax to work with MLST data.

MLST (multi-locus sequence typing) data specifies genomic features as a combination of a Scheme, Locus (gene), and Allele. For example, Scheme=senterica...

5.1. Viewing all MLST data¶

To view all MLST schemes available in the system, we can use db.mlst_schemes().

In [44]:

db.mlst_schemes()

Out[44]:

['senterica', 'sistr_330']

This says we have two MLST schemes: senterica and sistr_330.

To view all MLST features for a particular scheme, we can use db.mlst_summary().

In [45]:

db.mlst_summary(scheme_name='senterica').sort_values('Count', ascending=False)

Out[45]:

	Scheme	Locus	Allele	Count	Unknown Count	Present and Unknown Count	Total	Percent	Unknown Percent	Present and Unknown Percent
MLST Feature
mlst:senterica:aroC:2	senterica	aroC	2	59	<NA>	<NA>	59	100.0	<NA>	<NA>
mlst:senterica:dnaN:7	senterica	dnaN	7	59	<NA>	<NA>	59	100.0	<NA>	<NA>
mlst:senterica:hemD:9	senterica	hemD	9	59	<NA>	<NA>	59	100.0	<NA>	<NA>
mlst:senterica:hisD:9	senterica	hisD	9	59	<NA>	<NA>	59	100.0	<NA>	<NA>
mlst:senterica:purE:5	senterica	purE	5	59	<NA>	<NA>	59	100.0	<NA>	<NA>
mlst:senterica:sucA:9	senterica	sucA	9	59	<NA>	<NA>	59	100.0	<NA>	<NA>
mlst:senterica:thrA:12	senterica	thrA	12	59	<NA>	<NA>	59	100.0	<NA>	<NA>

In this case, every genome has all the same classic 7-gene MLST alleles (all 100% for the Percent column).

In [46]:

db.mlst_summary(scheme_name='sistr_330').sort_values('Count', ascending=False)

Out[46]:

	Scheme	Locus	Allele	Count	Unknown Count	Present and Unknown Count	Total	Percent	Unknown Percent	Present and Unknown Percent
MLST Feature
mlst:sistr_330:NC_003198.1_3005:419666160	sistr_330	NC_003198.1_3005	419666160	59	NaN	NaN	59	100.000000	NaN	NaN
mlst:sistr_330:NZ_AOXE01000068.1_67:3713645811	sistr_330	NZ_AOXE01000068.1_67	3713645811	59	NaN	NaN	59	100.000000	NaN	NaN
mlst:sistr_330:NZ_AOXE01000068.1_58:93691856	sistr_330	NZ_AOXE01000068.1_58	93691856	59	NaN	NaN	59	100.000000	NaN	NaN
mlst:sistr_330:NZ_AOXE01000068.1_52:1426386553	sistr_330	NZ_AOXE01000068.1_52	1426386553	59	NaN	NaN	59	100.000000	NaN	NaN
mlst:sistr_330:NZ_AOXE01000068.1_5:4267848688	sistr_330	NZ_AOXE01000068.1_5	4267848688	59	NaN	NaN	59	100.000000	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...
mlst:sistr_330:NZ_AOXE01000072.1_65:2777708285	sistr_330	NZ_AOXE01000072.1_65	2777708285	1	NaN	NaN	59	1.694915	NaN	NaN
mlst:sistr_330:NZ_AOXE01000059.1_430:1260977859	sistr_330	NZ_AOXE01000059.1_430	1260977859	1	NaN	NaN	59	1.694915	NaN	NaN
mlst:sistr_330:NZ_AOXE01000059.1_433:4100983823	sistr_330	NZ_AOXE01000059.1_433	4100983823	1	NaN	NaN	59	1.694915	NaN	NaN
mlst:sistr_330:NZ_AOYX01000092.1_135:466472938	sistr_330	NZ_AOYX01000092.1_135	466472938	1	NaN	NaN	59	1.694915	NaN	NaN
mlst:sistr_330:NZ_AOXE01000034.1_119:1444220767	sistr_330	NZ_AOXE01000034.1_119	1444220767	1	NaN	NaN	59	1.694915	NaN	NaN

347 rows × 10 columns

The SISTR MLST results consist of 330 loci in the scheme, and we can see that not all genomes have the same SISTR alleles (not all alleles have 100% for Percent).

5.2. Querying for a specific MLST feature (`hasa()`)¶

We can use hasa() to also query for genomic samples with a specific MLST feature. We can use the MLST Feature listed in the above tables as the identifier.

In [47]:

q = db.samples_query().hasa('mlst:sistr_330:NZ_AOXE01000034.1_103:1944731850')
q

Out[47]:

<SamplesQueryIndex[selected=19% (11/59) samples, unknown=0% (0/59) samples]>

The MLST feature identifier is specified in terms of mlst:[scheme]:[locus]:[allele] (i.e., mlst:sistr_330:NZ_AOXE01000034.1_103:1944731850).

Only 11 genomic samples have this particular MLST feature. Similar to the operations on mutations, we can use q.toframe() to list the specific genomic samples with this MLST feature.

In [48]:

q.toframe()

Out[48]:

	Query	Sample Name	Sample ID	Status
0	mlst:sistr_330:NZ_AOXE01000034.1_103:1944731850	SH13-001	9	Present
1	mlst:sistr_330:NZ_AOXE01000034.1_103:1944731850	SH10-014	14	Present
2	mlst:sistr_330:NZ_AOXE01000034.1_103:1944731850	SH11-002	23	Present
3	mlst:sistr_330:NZ_AOXE01000034.1_103:1944731850	SH10-015	34	Present
4	mlst:sistr_330:NZ_AOXE01000034.1_103:1944731850	SH13-004	37	Present
5	mlst:sistr_330:NZ_AOXE01000034.1_103:1944731850	SH13-006	38	Present
6	mlst:sistr_330:NZ_AOXE01000034.1_103:1944731850	SH13-007	41	Present
7	mlst:sistr_330:NZ_AOXE01000034.1_103:1944731850	SH13-008	42	Present
8	mlst:sistr_330:NZ_AOXE01000034.1_103:1944731850	SH13-003	45	Present
9	mlst:sistr_330:NZ_AOXE01000034.1_103:1944731850	SH13-002	49	Present
10	mlst:sistr_330:NZ_AOXE01000034.1_103:1944731850	SH13-005	59	Present

We can even combine this together with the previous phylogenetic tree visualization to see how this MLST feature clusters with respect to a tree constructed from mutations.

In [49]:

t = db.samples_query(universe='mutations', reference_name='NC_011083')
t.tree_styler(show_legend_type_labels=False, include_unknown=False)\
 .highlight(t.hasa('mlst:sistr_330:NZ_AOXE01000034.1_103:1944731850'), legend_label='mlst:1944731850')\
 .render()

Out[49]:

5.3. Features summmary¶

We can use features_summary() on the previous query to view a summary of other MLST features (possibly from other MLST schemes) in our query, or even a summary of mutations.

In [50]:

q = db.samples_query().hasa('mlst:sistr_330:NZ_AOXE01000034.1_103:1944731850')
q

Out[50]:

<SamplesQueryIndex[selected=19% (11/59) samples, unknown=0% (0/59) samples]>

In [51]:

q.features_summary(kind='mlst', scheme='sistr_330').sort_values('Count', ascending=False)

Out[51]:

	Scheme	Locus	Allele	Count	Unknown Count	Present and Unknown Count	Total	Percent	Unknown Percent	Present and Unknown Percent
MLST Feature
mlst:sistr_330:NC_003198.1_3005:419666160	sistr_330	NC_003198.1_3005	419666160	11	NaN	NaN	11	100.0	NaN	NaN
mlst:sistr_330:NZ_AOXE01000072.1_12:2530339645	sistr_330	NZ_AOXE01000072.1_12	2530339645	11	NaN	NaN	11	100.0	NaN	NaN
mlst:sistr_330:NZ_AOXE01000072.1_100:2740047193	sistr_330	NZ_AOXE01000072.1_100	2740047193	11	NaN	NaN	11	100.0	NaN	NaN
mlst:sistr_330:NZ_AOXE01000068.1_76:2883770685	sistr_330	NZ_AOXE01000068.1_76	2883770685	11	NaN	NaN	11	100.0	NaN	NaN
mlst:sistr_330:NZ_AOXE01000068.1_72:739645922	sistr_330	NZ_AOXE01000068.1_72	739645922	11	NaN	NaN	11	100.0	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...
mlst:sistr_330:NZ_AOXE01000036.1_16:3783573123	sistr_330	NZ_AOXE01000036.1_16	3783573123	11	NaN	NaN	11	100.0	NaN	NaN
mlst:sistr_330:NZ_AOXE01000036.1_157:1706422672	sistr_330	NZ_AOXE01000036.1_157	1706422672	11	NaN	NaN	11	100.0	NaN	NaN
mlst:sistr_330:NZ_AOXE01000036.1_15:519139030	sistr_330	NZ_AOXE01000036.1_15	519139030	11	NaN	NaN	11	100.0	NaN	NaN
mlst:sistr_330:NZ_AOXE01000036.1_116:2832724106	sistr_330	NZ_AOXE01000036.1_116	2832724106	11	NaN	NaN	11	100.0	NaN	NaN
mlst:sistr_330:NZ_CM001471.1_3941:799094037	sistr_330	NZ_CM001471.1_3941	799094037	11	NaN	NaN	11	100.0	NaN	NaN

330 rows × 10 columns

In [52]:

q.features_summary(kind='mutations').sort_values('Count', ascending=False)

Out[52]:

	Sequence	Position	Deletion	Insertion	Type	Count	Unknown Count	Present and Unknown Count	Total	Percent	...	Gene_Name	Gene_ID	Feature_Type	Transcript_BioType	HGVS.c	HGVS.p	ID_HGVS.c	ID_HGVS.p	ID_HGVS_GN.c	ID_HGVS_GN.p
Mutation
NC_011083.1:1947794:A:G	NC_011083.1	1947794	A	G	SNP	11	NaN	NaN	11	100.000000	...	SEHA_RS10125	SEHA_RS10125	transcript	protein_coding	c.181A>G	p.T61A	hgvs:NC_011083.1:SEHA_RS10125:c.181A>G	hgvs:NC_011083.1:SEHA_RS10125:p.T61A	hgvs_gn:NC_011083.1:SEHA_RS10125:c.181A>G	hgvs_gn:NC_011083.1:SEHA_RS10125:p.T61A
NC_011083.1:379589:C:T	NC_011083.1	379589	C	T	SNP	11	NaN	NaN	11	100.000000	...	SEHA_RS02285	SEHA_RS02285	transcript	protein_coding	c.353G>A	p.R118H	hgvs:NC_011083.1:SEHA_RS02285:c.353G>A	hgvs:NC_011083.1:SEHA_RS02285:p.R118H	hgvs_gn:NC_011083.1:SEHA_RS02285:c.353G>A	hgvs_gn:NC_011083.1:SEHA_RS02285:p.R118H
NC_011083.1:3533472:C:T	NC_011083.1	3533472	C	T	SNP	11	NaN	NaN	11	100.000000	...	SEHA_RS17775	SEHA_RS17775	transcript	protein_coding	c.975G>A	p.V325V	hgvs:NC_011083.1:SEHA_RS17775:c.975G>A	hgvs:NC_011083.1:SEHA_RS17775:p.V325V	hgvs_gn:NC_011083.1:SEHA_RS17775:c.975G>A	hgvs_gn:NC_011083.1:SEHA_RS17775:p.V325V
NC_011083.1:3691534:C:A	NC_011083.1	3691534	C	A	SNP	11	NaN	NaN	11	100.000000	...	malT	SEHA_RS18595	transcript	protein_coding	c.1074C>A	p.S358R	hgvs:NC_011083.1:SEHA_RS18595:c.1074C>A	hgvs:NC_011083.1:SEHA_RS18595:p.S358R	hgvs_gn:NC_011083.1:malT:c.1074C>A	hgvs_gn:NC_011083.1:malT:p.S358R
NC_011083.1:602044:T:G	NC_011083.1	602044	T	G	SNP	11	NaN	NaN	11	100.000000	...	SEHA_RS03410	SEHA_RS03410	transcript	protein_coding	c.1163T>G	p.V388G	hgvs:NC_011083.1:SEHA_RS03410:c.1163T>G	hgvs:NC_011083.1:SEHA_RS03410:p.V388G	hgvs_gn:NC_011083.1:SEHA_RS03410:c.1163T>G	hgvs_gn:NC_011083.1:SEHA_RS03410:p.V388G
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
NC_011083.1:2010696:G:T	NC_011083.1	2010696	G	T	SNP	1	NaN	NaN	11	9.090909	...	SEHA_RS25120-SEHA_RS10450	SEHA_RS25120-SEHA_RS10450	intergenic_region	<NA>	n.2010696G>T	<NA>	hgvs:NC_011083.1:n.2010696G>T	<NA>	hgvs_gn:NC_011083.1:n.2010696G>T	<NA>
NC_011083.1:495275:G:A	NC_011083.1	495275	G	A	SNP	1	NaN	NaN	11	9.090909	...	sbcC	SEHA_RS02910	transcript	protein_coding	c.1242C>T	p.A414A	hgvs:NC_011083.1:SEHA_RS02910:c.1242C>T	hgvs:NC_011083.1:SEHA_RS02910:p.A414A	hgvs_gn:NC_011083.1:sbcC:c.1242C>T	hgvs_gn:NC_011083.1:sbcC:p.A414A
NC_011083.1:1514675:G:A	NC_011083.1	1514675	G	A	SNP	1	NaN	NaN	11	9.090909	...	purR-SEHA_RS08035	SEHA_RS08030-SEHA_RS08035	intergenic_region	<NA>	n.1514675G>A	<NA>	hgvs:NC_011083.1:n.1514675G>A	<NA>	hgvs_gn:NC_011083.1:n.1514675G>A	<NA>
NC_011083.1:3401393:T:A	NC_011083.1	3401393	T	A	SNP	1	NaN	NaN	11	9.090909	...	SEHA_RS17120-SEHA_RS17125	SEHA_RS17120-SEHA_RS17125	intergenic_region	<NA>	n.3401393T>A	<NA>	hgvs:NC_011083.1:n.3401393T>A	<NA>	hgvs_gn:NC_011083.1:n.3401393T>A	<NA>
NC_011083.1:1310359:T:G	NC_011083.1	1310359	T	G	SNP	1	NaN	NaN	11	9.090909	...	nagK	SEHA_RS06940	transcript	protein_coding	c.175T>G	p.S59A	hgvs:NC_011083.1:SEHA_RS06940:c.175T>G	hgvs:NC_011083.1:SEHA_RS06940:p.S59A	hgvs_gn:NC_011083.1:nagK:c.175T>G	hgvs_gn:NC_011083.1:nagK:p.S59A

158 rows × 24 columns

This lets us summarize what other genomic features (MLST or mutations) are found in the set of samples that have a mlst:sistr_330:NZ_AOXE01000034.1_103:1944731850 MLST feature.

Setting q.features_summary(..., selection='unique') lets us narrow-down the scope of our summary to only features that are uniquely found in the set of genomic samples that have a mlst:sistr_330:NZ_AOXE01000034.1_103:1944731850 MLST feature.

In [53]:

q.features_summary(kind='mutations', selection='unique').sort_values('Count', ascending=False)

Out[53]:

	Sequence	Position	Deletion	Insertion	Type	Count	Unknown Count	Present and Unknown Count	Total	Percent	...	Gene_Name	Gene_ID	Feature_Type	Transcript_BioType	HGVS.c	HGVS.p	ID_HGVS.c	ID_HGVS.p	ID_HGVS_GN.c	ID_HGVS_GN.p
Mutation
NC_011083.1:3691534:C:A	NC_011083.1	3691534	C	A	SNP	11	NaN	NaN	11	100.000000	...	malT	SEHA_RS18595	transcript	protein_coding	c.1074C>A	p.S358R	hgvs:NC_011083.1:SEHA_RS18595:c.1074C>A	hgvs:NC_011083.1:SEHA_RS18595:p.S358R	hgvs_gn:NC_011083.1:malT:c.1074C>A	hgvs_gn:NC_011083.1:malT:p.S358R
NC_011083.1:2428696:AT:A	NC_011083.1	2428696	AT	A	INDEL	11	NaN	NaN	11	100.000000	...	SEHA_RS12535-nrdA	SEHA_RS12535-SEHA_RS12540	intergenic_region	<NA>	n.2428697delT	<NA>	hgvs:NC_011083.1:n.2428697delT	<NA>	hgvs_gn:NC_011083.1:n.2428697delT	<NA>
NC_011083.1:3796657:C:A	NC_011083.1	3796657	C	A	SNP	11	NaN	NaN	11	100.000000	...	SEHA_RS19050	SEHA_RS19050	transcript	protein_coding	c.489C>A	p.G163G	hgvs:NC_011083.1:SEHA_RS19050:c.489C>A	hgvs:NC_011083.1:SEHA_RS19050:p.G163G	hgvs_gn:NC_011083.1:SEHA_RS19050:c.489C>A	hgvs_gn:NC_011083.1:SEHA_RS19050:p.G163G
NC_011083.1:2841095:G:A	NC_011083.1	2841095	G	A	SNP	11	NaN	NaN	11	100.000000	...	bapA	SEHA_RS14380	transcript	protein_coding	c.5317G>A	p.D1773N	hgvs:NC_011083.1:SEHA_RS14380:c.5317G>A	hgvs:NC_011083.1:SEHA_RS14380:p.D1773N	hgvs_gn:NC_011083.1:bapA:c.5317G>A	hgvs_gn:NC_011083.1:bapA:p.D1773N
NC_011083.1:2209731:A:C	NC_011083.1	2209731	A	C	SNP	11	NaN	NaN	11	100.000000	...	SEHA_RS11560	SEHA_RS11560	transcript	protein_coding	c.250T>G	p.W84G	hgvs:NC_011083.1:SEHA_RS11560:c.250T>G	hgvs:NC_011083.1:SEHA_RS11560:p.W84G	hgvs_gn:NC_011083.1:SEHA_RS11560:c.250T>G	hgvs_gn:NC_011083.1:SEHA_RS11560:p.W84G
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
NC_011083.1:3535143:A:T	NC_011083.1	3535143	A	T	SNP	1	NaN	NaN	11	9.090909	...	oadA	SEHA_RS17780	transcript	protein_coding	c.1101T>A	p.I367I	hgvs:NC_011083.1:SEHA_RS17780:c.1101T>A	hgvs:NC_011083.1:SEHA_RS17780:p.I367I	hgvs_gn:NC_011083.1:oadA:c.1101T>A	hgvs_gn:NC_011083.1:oadA:p.I367I
NC_011083.1:1310359:T:G	NC_011083.1	1310359	T	G	SNP	1	NaN	NaN	11	9.090909	...	nagK	SEHA_RS06940	transcript	protein_coding	c.175T>G	p.S59A	hgvs:NC_011083.1:SEHA_RS06940:c.175T>G	hgvs:NC_011083.1:SEHA_RS06940:p.S59A	hgvs_gn:NC_011083.1:nagK:c.175T>G	hgvs_gn:NC_011083.1:nagK:p.S59A
NC_011083.1:936248:G:A	NC_011083.1	936248	G	A	SNP	1	NaN	NaN	11	9.090909	...	rhlE	SEHA_RS05055	transcript	protein_coding	c.1225G>A	p.G409R	hgvs:NC_011083.1:SEHA_RS05055:c.1225G>A	hgvs:NC_011083.1:SEHA_RS05055:p.G409R	hgvs_gn:NC_011083.1:rhlE:c.1225G>A	hgvs_gn:NC_011083.1:rhlE:p.G409R
NC_011083.1:4536022:C:T	NC_011083.1	4536022	C	T	SNP	1	NaN	NaN	11	9.090909	...	adiA	SEHA_RS22440	transcript	protein_coding	c.909G>A	p.P303P	hgvs:NC_011083.1:SEHA_RS22440:c.909G>A	hgvs:NC_011083.1:SEHA_RS22440:p.P303P	hgvs_gn:NC_011083.1:adiA:c.909G>A	hgvs_gn:NC_011083.1:adiA:p.P303P
NC_011083.1:4800575:C:T	NC_011083.1	4800575	C	T	SNP	1	NaN	NaN	11	9.090909	...	SEHA_RS23755-SEHA_RS23765	SEHA_RS23755-SEHA_RS23765	intergenic_region	<NA>	n.4800575C>T	<NA>	hgvs:NC_011083.1:n.4800575C>T	<NA>	hgvs_gn:NC_011083.1:n.4800575C>T	<NA>

72 rows × 24 columns

Similar to mutations_summary() to calculate values for the Unknown Count and Present and Unknown Count columns you can pass include_unknown_sampes=True. This might slow down the table generation for larger datasets.

In [54]:

q.features_summary(kind='mutations', selection='unique', include_unknown_samples=True).sort_values('Count', ascending=False)

Out[54]:

	Sequence	Position	Deletion	Insertion	Type	Count	Unknown Count	Present and Unknown Count	Total	Percent	...	Gene_Name	Gene_ID	Feature_Type	Transcript_BioType	HGVS.c	HGVS.p	ID_HGVS.c	ID_HGVS.p	ID_HGVS_GN.c	ID_HGVS_GN.p
Mutation
NC_011083.1:3691534:C:A	NC_011083.1	3691534	C	A	SNP	11	0	11	11	100.000000	...	malT	SEHA_RS18595	transcript	protein_coding	c.1074C>A	p.S358R	hgvs:NC_011083.1:SEHA_RS18595:c.1074C>A	hgvs:NC_011083.1:SEHA_RS18595:p.S358R	hgvs_gn:NC_011083.1:malT:c.1074C>A	hgvs_gn:NC_011083.1:malT:p.S358R
NC_011083.1:2428696:AT:A	NC_011083.1	2428696	AT	A	INDEL	11	0	11	11	100.000000	...	SEHA_RS12535-nrdA	SEHA_RS12535-SEHA_RS12540	intergenic_region	<NA>	n.2428697delT	<NA>	hgvs:NC_011083.1:n.2428697delT	<NA>	hgvs_gn:NC_011083.1:n.2428697delT	<NA>
NC_011083.1:3796657:C:A	NC_011083.1	3796657	C	A	SNP	11	0	11	11	100.000000	...	SEHA_RS19050	SEHA_RS19050	transcript	protein_coding	c.489C>A	p.G163G	hgvs:NC_011083.1:SEHA_RS19050:c.489C>A	hgvs:NC_011083.1:SEHA_RS19050:p.G163G	hgvs_gn:NC_011083.1:SEHA_RS19050:c.489C>A	hgvs_gn:NC_011083.1:SEHA_RS19050:p.G163G
NC_011083.1:2841095:G:A	NC_011083.1	2841095	G	A	SNP	11	0	11	11	100.000000	...	bapA	SEHA_RS14380	transcript	protein_coding	c.5317G>A	p.D1773N	hgvs:NC_011083.1:SEHA_RS14380:c.5317G>A	hgvs:NC_011083.1:SEHA_RS14380:p.D1773N	hgvs_gn:NC_011083.1:bapA:c.5317G>A	hgvs_gn:NC_011083.1:bapA:p.D1773N
NC_011083.1:2209731:A:C	NC_011083.1	2209731	A	C	SNP	11	0	11	11	100.000000	...	SEHA_RS11560	SEHA_RS11560	transcript	protein_coding	c.250T>G	p.W84G	hgvs:NC_011083.1:SEHA_RS11560:c.250T>G	hgvs:NC_011083.1:SEHA_RS11560:p.W84G	hgvs_gn:NC_011083.1:SEHA_RS11560:c.250T>G	hgvs_gn:NC_011083.1:SEHA_RS11560:p.W84G
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
NC_011083.1:3535143:A:T	NC_011083.1	3535143	A	T	SNP	1	2	3	11	9.090909	...	oadA	SEHA_RS17780	transcript	protein_coding	c.1101T>A	p.I367I	hgvs:NC_011083.1:SEHA_RS17780:c.1101T>A	hgvs:NC_011083.1:SEHA_RS17780:p.I367I	hgvs_gn:NC_011083.1:oadA:c.1101T>A	hgvs_gn:NC_011083.1:oadA:p.I367I
NC_011083.1:1310359:T:G	NC_011083.1	1310359	T	G	SNP	1	0	1	11	9.090909	...	nagK	SEHA_RS06940	transcript	protein_coding	c.175T>G	p.S59A	hgvs:NC_011083.1:SEHA_RS06940:c.175T>G	hgvs:NC_011083.1:SEHA_RS06940:p.S59A	hgvs_gn:NC_011083.1:nagK:c.175T>G	hgvs_gn:NC_011083.1:nagK:p.S59A
NC_011083.1:936248:G:A	NC_011083.1	936248	G	A	SNP	1	0	1	11	9.090909	...	rhlE	SEHA_RS05055	transcript	protein_coding	c.1225G>A	p.G409R	hgvs:NC_011083.1:SEHA_RS05055:c.1225G>A	hgvs:NC_011083.1:SEHA_RS05055:p.G409R	hgvs_gn:NC_011083.1:rhlE:c.1225G>A	hgvs_gn:NC_011083.1:rhlE:p.G409R
NC_011083.1:4536022:C:T	NC_011083.1	4536022	C	T	SNP	1	0	1	11	9.090909	...	adiA	SEHA_RS22440	transcript	protein_coding	c.909G>A	p.P303P	hgvs:NC_011083.1:SEHA_RS22440:c.909G>A	hgvs:NC_011083.1:SEHA_RS22440:p.P303P	hgvs_gn:NC_011083.1:adiA:c.909G>A	hgvs_gn:NC_011083.1:adiA:p.P303P
NC_011083.1:4800575:C:T	NC_011083.1	4800575	C	T	SNP	1	0	1	11	9.090909	...	SEHA_RS23755-SEHA_RS23765	SEHA_RS23755-SEHA_RS23765	intergenic_region	<NA>	n.4800575C>T	<NA>	hgvs:NC_011083.1:n.4800575C>T	<NA>	hgvs_gn:NC_011083.1:n.4800575C>T	<NA>

72 rows × 24 columns

6. Putting it all together¶

So far we've seen connecting to an index (project), querying by mutations and by relationships in a tree, attaching external metadata, and working with MLST features. We can put this all together and do some basic visualiations of our selections on the tree (using the ete3 toolkit).

6.1. Highlight everything in each outbreak and show a unique mutation/MLST feature¶

6.1.1 Load tree and attach data frame¶

In [55]:

q = db.samples_query(universe='mutations', reference_name='NC_011083')\
    .join(metadata_df, sample_names_column='Sample Name Orig')
q

Out[55]:

<MutationTreeSamplesQuery[selected=100% (59/59) samples, unknown=0% (0/59) samples]>

6.1.2 Select samples in outbreak 1 and show mutations¶

In [56]:

q_o1 = q.isa('1', isa_column='outbreak_number', kind='dataframe')
q_o1

Out[56]:

<MutationTreeSamplesQuery[selected=17% (10/59) samples, unknown=0% (0/59) samples]>

In [57]:

q_o1.features_summary(selection='unique').sort_values('Count', ascending=False).set_index('ID_HGVS_GN.p').head(8)

Out[57]:

	Sequence	Position	Deletion	Insertion	Type	Count	Unknown Count	Present and Unknown Count	Total	Percent	...	Annotation_Impact	Gene_Name	Gene_ID	Feature_Type	Transcript_BioType	HGVS.c	HGVS.p	ID_HGVS.c	ID_HGVS.p	ID_HGVS_GN.c
ID_HGVS_GN.p
hgvs_gn:NC_011083.1:pqiA:p.Q219Q	NC_011083.1	1159308	G	A	SNP	10	<NA>	<NA>	10	100.0	...	LOW	pqiA	SEHA_RS06145	transcript	protein_coding	c.657G>A	p.Q219Q	hgvs:NC_011083.1:SEHA_RS06145:c.657G>A	hgvs:NC_011083.1:SEHA_RS06145:p.Q219Q	hgvs_gn:NC_011083.1:pqiA:c.657G>A
hgvs_gn:NC_011083.1:ychF:p.I128V	NC_011083.1	1930007	A	G	SNP	10	<NA>	<NA>	10	100.0	...	MODERATE	ychF	SEHA_RS10040	transcript	protein_coding	c.382A>G	p.I128V	hgvs:NC_011083.1:SEHA_RS10040:c.382A>G	hgvs:NC_011083.1:SEHA_RS10040:p.I128V	hgvs_gn:NC_011083.1:ychF:c.382A>G
<NA>	NC_011083.1	2732743	G	A	SNP	10	<NA>	<NA>	10	100.0	...	MODIFIER	sseB-pepB	SEHA_RS13890-SEHA_RS13895	intergenic_region	<NA>	n.2732743G>A	<NA>	hgvs:NC_011083.1:n.2732743G>A	<NA>	hgvs_gn:NC_011083.1:n.2732743G>A
hgvs_gn:NC_011083.1:SEHA_RS00825:p.I23F	NC_011083.1	58804	T	A	SNP	10	<NA>	<NA>	10	100.0	...	MODERATE	SEHA_RS00825	SEHA_RS00825	transcript	protein_coding	c.67A>T	p.I23F	hgvs:NC_011083.1:SEHA_RS00825:c.67A>T	hgvs:NC_011083.1:SEHA_RS00825:p.I23F	hgvs_gn:NC_011083.1:SEHA_RS00825:c.67A>T
<NA>	NC_011083.1	349640	A	G	SNP	10	<NA>	<NA>	10	100.0	...	MODIFIER	istB-tcfA	SEHA_RS26020-SEHA_RS02145	intergenic_region	<NA>	n.349640A>G	<NA>	hgvs:NC_011083.1:n.349640A>G	<NA>	hgvs_gn:NC_011083.1:n.349640A>G
hgvs_gn:NC_011083.1:SEHA_RS22485:p.S224S	NC_011083.1	4548496	C	T	SNP	10	<NA>	<NA>	10	100.0	...	LOW	SEHA_RS22485	SEHA_RS22485	transcript	protein_coding	c.672C>T	p.S224S	hgvs:NC_011083.1:SEHA_RS22485:c.672C>T	hgvs:NC_011083.1:SEHA_RS22485:p.S224S	hgvs_gn:NC_011083.1:SEHA_RS22485:c.672C>T
hgvs_gn:NC_011083.1:SEHA_RS08400:p.P5P	NC_011083.1	1582551	C	A	SNP	10	<NA>	<NA>	10	100.0	...	LOW	SEHA_RS08400	SEHA_RS08400	transcript	protein_coding	c.15G>T	p.P5P	hgvs:NC_011083.1:SEHA_RS08400:c.15G>T	hgvs:NC_011083.1:SEHA_RS08400:p.P5P	hgvs_gn:NC_011083.1:SEHA_RS08400:c.15G>T
<NA>	NC_011083.1	1572779	CTTCAGCGCAACGCGGCTGTAAGCATTGCTGCTAAAAAGAGGCCGG...	C	INDEL	4	<NA>	<NA>	10	40.0	...	HIGH	gstA&pdxY	SEHA_RS08355&SEHA_RS08350	gene_variant	<NA>	n.1572780_1572862delTTCAGCGCAACGCGGCTGTAAGCATT...	<NA>	hgvs:NC_011083.1:n.1572780_1572862delTTCAGCGCA...	<NA>	hgvs_gn:NC_011083.1:n.1572780_1572862delTTCAGC...

8 rows × 23 columns

Passing selection='unique' will select only those mutations that are unique to the selected set (useful for searching for lineage-defining mutations or mutations in subsets of the selected samples).

Let's pick one of the lineage-defining mutations (i.e., a mutation uniquely found in only Outbreak 1). We can detect these by using selection='unique' and filtering to mutations where Count is equal to Total (count of samples with mutation equals total samples in selection).

In [58]:

o1_df = q_o1.features_summary(selection='unique').sort_values(['Count', 'Position'])
o1_df[o1_df['Count'] == o1_df['Total']].set_index('ID_HGVS.p').head(3)

Out[58]:

	Sequence	Position	Deletion	Insertion	Type	Count	Unknown Count	Present and Unknown Count	Total	Percent	...	Annotation_Impact	Gene_Name	Gene_ID	Feature_Type	Transcript_BioType	HGVS.c	HGVS.p	ID_HGVS.c	ID_HGVS_GN.c	ID_HGVS_GN.p
ID_HGVS.p
hgvs:NC_011083.1:SEHA_RS00825:p.I23F	NC_011083.1	58804	T	A	SNP	10	<NA>	<NA>	10	100.0	...	MODERATE	SEHA_RS00825	SEHA_RS00825	transcript	protein_coding	c.67A>T	p.I23F	hgvs:NC_011083.1:SEHA_RS00825:c.67A>T	hgvs_gn:NC_011083.1:SEHA_RS00825:c.67A>T	hgvs_gn:NC_011083.1:SEHA_RS00825:p.I23F
<NA>	NC_011083.1	349640	A	G	SNP	10	<NA>	<NA>	10	100.0	...	MODIFIER	istB-tcfA	SEHA_RS26020-SEHA_RS02145	intergenic_region	<NA>	n.349640A>G	<NA>	hgvs:NC_011083.1:n.349640A>G	hgvs_gn:NC_011083.1:n.349640A>G	<NA>
hgvs:NC_011083.1:SEHA_RS06145:p.Q219Q	NC_011083.1	1159308	G	A	SNP	10	<NA>	<NA>	10	100.0	...	LOW	pqiA	SEHA_RS06145	transcript	protein_coding	c.657G>A	p.Q219Q	hgvs:NC_011083.1:SEHA_RS06145:c.657G>A	hgvs_gn:NC_011083.1:pqiA:c.657G>A	hgvs_gn:NC_011083.1:pqiA:p.Q219Q

3 rows × 23 columns

We will highlight hgvs:NC_011083.1:SEHA_RS00825:p.I23F (NC_011083.1:58804:T:A) in the tree.

Let's also select a unique MLST feature to highlight.

6.1.3. Select unique MLST features¶

In [59]:

q_o1.features_summary(kind='mlst', scheme='sistr_330',
                      selection='unique').sort_values('Count', ascending=False)

Out[59]:

	Scheme	Locus	Allele	Count	Unknown Count	Present and Unknown Count	Total	Percent	Unknown Percent	Present and Unknown Percent
MLST Feature
mlst:sistr_330:NZ_AOXE01000047.1_56:1179310307	sistr_330	NZ_AOXE01000047.1_56	1179310307	2	<NA>	<NA>	10	20.0	<NA>	<NA>

Here, we can see there's only one MLST feature to pick mlst:sistr_330:NZ_AOXE01000047.1_56:1179310307 that's unique to Outbreak 1 (but not unique to every sample in the outbreak). We will show this feature on the tree too.

6.1.4 Highlight samples in Outbreak 1 and display those with a particular mutation/MLST feature¶

In [60]:

# Use `highlight()` to highlight Outbreak 1
# Chain with other `highlight()` methods to highlight other outbreak samples
# Use `annotate()` to add a column indicating the presence/absence of '58804:T:A' mutation
# Use another `annotate()` to add a column indicating the presence/absence of the MLST feature
# Use `highlight_style` to change the color scheme of the highlights
ts = q.tree_styler(annotate_show_box_label=True, legend_nsize=30, legend_fsize=14,
                   show_legend_type_labels=False, legend_title='Legend')\
        .highlight(q_o1, legend_label='Outbreak 1')\
        .highlight(q.isa('2', isa_column='outbreak_number', kind='dataframe'), legend_label='Outbreak 2')\
        .highlight(q.isa('3', isa_column='outbreak_number', kind='dataframe'), legend_label='Outbreak 3')\
        .annotate(q.hasa('hgvs:NC_011083.1:SEHA_RS00825:p.I23F'), box_label='SEHA_RS00825:I23F')\
        .annotate(q.hasa('mlst:sistr_330:NZ_AOXE01000047.1_56:1179310307'), box_label='mlst:1179310307')
ts.render(w=600)

Out[60]:

This shows the

6.1.5: Save the output to a file¶

In [61]:

# Save this to a PDF file
file = 'output1.pdf'
x = ts.render(file)
print(f'Saved to {file}')
# (I assign to x to surpress printing text when saving in Jupyter)

Saved to output1.pdf

6.2: Working with mutations table¶

6.2.1 Get mutations for all samples with Outbreak number `2`¶

In [62]:

q = db.samples_query().join(metadata_df, sample_names_column='Sample Name Orig',
                            default_isa_column='outbreak_number', default_isa_kind='dataframe')
q.isa('2')

Out[62]:

<DataFrameSamplesQuery[selected=14% (8/59) samples, unknown=0% (0/59) samples]>

In [63]:

q.isa('2').features_summary().sort_values('Count', ascending=False)

Out[63]:

	Sequence	Position	Deletion	Insertion	Type	Count	Unknown Count	Present and Unknown Count	Total	Percent	...	Gene_Name	Gene_ID	Feature_Type	Transcript_BioType	HGVS.c	HGVS.p	ID_HGVS.c	ID_HGVS.p	ID_HGVS_GN.c	ID_HGVS_GN.p
Mutation
NC_011083.1:1947794:A:G	NC_011083.1	1947794	A	G	SNP	8	NaN	NaN	8	100.0	...	SEHA_RS10125	SEHA_RS10125	transcript	protein_coding	c.181A>G	p.T61A	hgvs:NC_011083.1:SEHA_RS10125:c.181A>G	hgvs:NC_011083.1:SEHA_RS10125:p.T61A	hgvs_gn:NC_011083.1:SEHA_RS10125:c.181A>G	hgvs_gn:NC_011083.1:SEHA_RS10125:p.T61A
NC_011083.1:3945452:G:C	NC_011083.1	3945452	G	C	SNP	8	NaN	NaN	8	100.0	...	ligB	SEHA_RS19715	transcript	protein_coding	c.400C>G	p.L134V	hgvs:NC_011083.1:SEHA_RS19715:c.400C>G	hgvs:NC_011083.1:SEHA_RS19715:p.L134V	hgvs_gn:NC_011083.1:ligB:c.400C>G	hgvs_gn:NC_011083.1:ligB:p.L134V
NC_011083.1:602044:T:G	NC_011083.1	602044	T	G	SNP	8	NaN	NaN	8	100.0	...	SEHA_RS03410	SEHA_RS03410	transcript	protein_coding	c.1163T>G	p.V388G	hgvs:NC_011083.1:SEHA_RS03410:c.1163T>G	hgvs:NC_011083.1:SEHA_RS03410:p.V388G	hgvs_gn:NC_011083.1:SEHA_RS03410:c.1163T>G	hgvs_gn:NC_011083.1:SEHA_RS03410:p.V388G
NC_011083.1:2209731:A:C	NC_011083.1	2209731	A	C	SNP	8	NaN	NaN	8	100.0	...	SEHA_RS11560	SEHA_RS11560	transcript	protein_coding	c.250T>G	p.W84G	hgvs:NC_011083.1:SEHA_RS11560:c.250T>G	hgvs:NC_011083.1:SEHA_RS11560:p.W84G	hgvs_gn:NC_011083.1:SEHA_RS11560:c.250T>G	hgvs_gn:NC_011083.1:SEHA_RS11560:p.W84G
NC_011083.1:2225479:G:T	NC_011083.1	2225479	G	T	SNP	8	NaN	NaN	8	100.0	...	rfbB	SEHA_RS11635	transcript	protein_coding	c.721C>A	p.R241S	hgvs:NC_011083.1:SEHA_RS11635:c.721C>A	hgvs:NC_011083.1:SEHA_RS11635:p.R241S	hgvs_gn:NC_011083.1:rfbB:c.721C>A	hgvs_gn:NC_011083.1:rfbB:p.R241S
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
NC_011083.1:1514675:G:A	NC_011083.1	1514675	G	A	SNP	1	NaN	NaN	8	12.5	...	purR-SEHA_RS08035	SEHA_RS08030-SEHA_RS08035	intergenic_region	<NA>	n.1514675G>A	<NA>	hgvs:NC_011083.1:n.1514675G>A	<NA>	hgvs_gn:NC_011083.1:n.1514675G>A	<NA>
NC_011083.1:2010696:G:T	NC_011083.1	2010696	G	T	SNP	1	NaN	NaN	8	12.5	...	SEHA_RS25120-SEHA_RS10450	SEHA_RS25120-SEHA_RS10450	intergenic_region	<NA>	n.2010696G>T	<NA>	hgvs:NC_011083.1:n.2010696G>T	<NA>	hgvs_gn:NC_011083.1:n.2010696G>T	<NA>
NC_011083.1:4800575:C:T	NC_011083.1	4800575	C	T	SNP	1	NaN	NaN	8	12.5	...	SEHA_RS23755-SEHA_RS23765	SEHA_RS23755-SEHA_RS23765	intergenic_region	<NA>	n.4800575C>T	<NA>	hgvs:NC_011083.1:n.4800575C>T	<NA>	hgvs_gn:NC_011083.1:n.4800575C>T	<NA>
NC_011083.1:1310359:T:G	NC_011083.1	1310359	T	G	SNP	1	NaN	NaN	8	12.5	...	nagK	SEHA_RS06940	transcript	protein_coding	c.175T>G	p.S59A	hgvs:NC_011083.1:SEHA_RS06940:c.175T>G	hgvs:NC_011083.1:SEHA_RS06940:p.S59A	hgvs_gn:NC_011083.1:nagK:c.175T>G	hgvs_gn:NC_011083.1:nagK:p.S59A
NC_011083.1:1691486:G:C	NC_011083.1	1691486	G	C	SNP	1	NaN	NaN	8	12.5	...	SEHA_RS08920	SEHA_RS08920	transcript	protein_coding	c.781C>G	p.Q261E	hgvs:NC_011083.1:SEHA_RS08920:c.781C>G	hgvs:NC_011083.1:SEHA_RS08920:p.Q261E	hgvs_gn:NC_011083.1:SEHA_RS08920:c.781C>G	hgvs_gn:NC_011083.1:SEHA_RS08920:p.Q261E

151 rows × 24 columns

6.2.2 Plot distribution of mutations on genome for outbreak `2`¶

In [64]:

q_o2_positions = q.isa('2').features_summary()\
    .groupby('Position')\
    .agg({'Position': 'first', 'Count': 'sum'})
q_o2_positions

Out[64]:

	Position	Count
Position
70519	70519	7
140658	140658	8
203200	203200	8
231665	231665	8
272673	272673	8
...	...	...
4860641	4860641	8
4876176	4876176	8
4882099	4882099	8
4884909	4884909	8
4888140	4888140	8

150 rows × 2 columns

In [65]:

import matplotlib.pyplot as plt

reference_genome = db.reference_names()[0]

# I'm just showing a histogram of positions, I'm ignoring sample counts
q_o2_positions['Position'].hist(bins=100)

plt.title('Distribution of mutations for Outbreak 2', fontdict={'size': 16})
plt.xlabel(f'Position on {reference_genome} (bp)', fontdict={'size': 14})
plt.ylabel('Count of mutation positions', fontdict={'size': 14})

Out[65]:

Text(0, 0.5, 'Count of mutation positions')

6.2.3 Compare to distribution of mutations in outbreak 1 and 3¶

We'll compare to outbreak 1 and 3 using the same histogram.

In [66]:

q_o1_positions = q.isa('1').features_summary()\
    .groupby('Position')\
    .agg({'Position': 'first', 'Count': 'sum'})
q_o1_positions

Out[66]:

	Position	Count
Position
58804	58804	10
63393	63393	10
71448	71448	10
140658	140658	10
203200	203200	10
...	...	...
4860641	4860641	10
4876176	4876176	10
4882099	4882099	10
4884909	4884909	10
4888140	4888140	10

137 rows × 2 columns

In [67]:

q_o3_positions = q.isa('3').features_summary()\
    .groupby('Position')\
    .agg({'Position': 'first', 'Count': 'sum'})
q_o3_positions

Out[67]:

	Position	Count
Position
140658	140658	28
203200	203200	28
231665	231665	28
298472	298472	28
298943	298943	26
...	...	...
4860641	4860641	28
4876176	4876176	28
4882099	4882099	28
4884909	4884909	28
4888140	4888140	28

149 rows × 2 columns

In [68]:

o1_positions_list = q_o1_positions['Position'].tolist()
o2_positions_list = q_o2_positions['Position'].tolist()
o3_positions_list = q_o3_positions['Position'].tolist()
data = [o1_positions_list, o2_positions_list, o3_positions_list]
labels = ['Outbreak 1', 'Outbreak 2', 'Outbreak 3']
colors = ['#a6cee3', '#1f78b4', '#b2df8a']

# Create histogram
plt.figure(figsize=(15,6))
plt.hist(data,
         label=labels, color=colors, edgecolor='black',
         bins=25)

plt.legend(prop={'size': 14})
plt.title('Distribution of mutations for Outbreaks 1,2, and 3', fontdict={'size': 16})
plt.xlabel(f'Position on {reference_genome} (bp)', fontdict={'size': 14})
plt.ylabel('Count of mutation positions', fontdict={'size': 14})

Out[68]:

Text(0, 0.5, 'Count of mutation positions')

6.2.4 (Optional) View distribution of only unique mutations in each outbreak¶

Try replacing features_summary() with features_summary(selection='unique'). This will change the plot from a distribution of all mutations in each outbreak to only a distribution of the mutations uniquely found in each outbreak.

6. End¶

You've made it to the end. You are amazing 😀🥳. Way to go. I hope you enjoyed the tutorial.