This tutorial consists of an index constructed from a Salmonella enterica dataset previously published in (https://doi.org/10.1128/jcm.02200-15). To prepare the data, the genomes were downloaded from NCBI and run through the provided analysis pipeline (gdi analysis
) to identify SNVs, as well as skesa, SISTR and MLST for MLST analysis. A notebook describing how the index was constructed is available at index-salmonella-data-tutorial-1.ipynb.
Let's first download the precomputed index for this tutorial. To do this please run the below commands:
Note: In a Jupyter Python notebook, prepending a command with !
runs the command in a shell instead of the Python interpreter (e.g., !unzip
runs the command unzip
).
!wget -O salmonella-project.zip https://figshare.com/ndownloader/files/31555565?private_link=0405199820a13aedca42
!unzip -n salmonella-project.zip | head -n 3
!echo
!ls
--2022-12-15 15:34:43-- https://figshare.com/ndownloader/files/31555565?private_link=0405199820a13aedca42 Resolving figshare.com (figshare.com)... 54.78.10.27, 52.51.22.31, 2a05:d018:1f4:d003:7186:f789:624a:fe25, ... Connecting to figshare.com (figshare.com)|54.78.10.27|:443... connected. HTTP request sent, awaiting response... 302 Found Location: https://s3-eu-west-1.amazonaws.com/pfigshare-u-files/31555565/salmonellaproject20211122.zip?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIYCQYOYV5JSSROOA/20221215/eu-west-1/s3/aws4_request&X-Amz-Date=20221215T213444Z&X-Amz-Expires=10&X-Amz-SignedHeaders=host&X-Amz-Signature=8c3656328dd2b4da2c34201e6fa383f2b949ff1f6d2c4a877defccfedb1a8456 [following] --2022-12-15 15:34:44-- https://s3-eu-west-1.amazonaws.com/pfigshare-u-files/31555565/salmonellaproject20211122.zip?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIYCQYOYV5JSSROOA/20221215/eu-west-1/s3/aws4_request&X-Amz-Date=20221215T213444Z&X-Amz-Expires=10&X-Amz-SignedHeaders=host&X-Amz-Signature=8c3656328dd2b4da2c34201e6fa383f2b949ff1f6d2c4a877defccfedb1a8456 Resolving s3-eu-west-1.amazonaws.com (s3-eu-west-1.amazonaws.com)... 52.218.108.19, 52.218.88.203, 52.92.16.112, ... Connecting to s3-eu-west-1.amazonaws.com (s3-eu-west-1.amazonaws.com)|52.218.108.19|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 10696541 (10M) [application/zip] Saving to: ‘salmonella-project.zip’ salmonella-project. 100%[===================>] 10.20M 2.44MB/s in 4.4s 2022-12-15 15:34:49 (2.31 MB/s) - ‘salmonella-project.zip’ saved [10696541/10696541] Archive: salmonella-project.zip creating: salmonella-project/ creating: salmonella-project/.gdi-data/ create-index tutorial-1-salmonella.ipynb salmonella-project tutorial-2-sars-cov-2.ipynb salmonella-project.zip tutorial-3-querying-overview.ipynb
Great. Now that we've got some data (in the salmonella-project/
directory), let's explore the command-line interface to access this data.
gdi
)¶Let's first print out the version of gdi
being used.
!gdi --version
gdi, version 0.9.2
The below command lists the samples loaded in this index. Note that instead of passing the project with --project-dir
you can change to the project directory and gdi
will figure out which project you're in.
!gdi --project-dir salmonella-project list samples | head -n 5
SH12-012 SH14-017 SH12-004 SH12-005 SH12-013
This lists all the loaded reference genomes in this genomics index. This can be useful for commands later which require you to pass the name of the reference genome.
!gdi --project-dir salmonella-project list genomes
NC_011083
This searches for a particular mutation (in this case a deletion of a CG
at position 2865334 and an insertion of a G
). If you are familiar with the VCF format, this is given in terms of CHROM:POS:REF:ALT
format.
Note, prefixing the feature string with hasa:
(as in hasa:NC_011083.1:4559277:C:T
) is what is used to query the database for this particular feature.
!gdi --project-dir salmonella-project query 'hasa:NC_011083.1:4559277:C:T' | column -s$'\t' -t
Query Sample Name Sample ID Status NC_011083.1:4559277:C:T SH12-004 3 Present NC_011083.1:4559277:C:T SH12-005 4 Present NC_011083.1:4559277:C:T SH10-001 7 Present NC_011083.1:4559277:C:T SH12-002 12 Present NC_011083.1:4559277:C:T SH12-007 15 Present NC_011083.1:4559277:C:T SH12-001 17 Present NC_011083.1:4559277:C:T SH12-003 19 Present NC_011083.1:4559277:C:T SH12-010 28 Present NC_011083.1:4559277:C:T SH11-001 36 Present NC_011083.1:4559277:C:T SH12-006 47 Present NC_011083.1:4559277:C:T SH12-009 53 Present NC_011083.1:4559277:C:T SH12-011 54 Present NC_011083.1:4559277:C:T SH12-008 56 Present
Alternatively, you can search for samples with a particular MLST allele.
!gdi --project-dir salmonella-project query 'hasa:mlst:senterica:aroC:2' | column -s$'\t' -t | head -n 5
Query Sample Name Sample ID Status mlst:senterica:aroC:2 SH12-012 1 Present mlst:senterica:aroC:2 SH14-017 2 Present mlst:senterica:aroC:2 SH12-004 3 Present mlst:senterica:aroC:2 SH12-005 4 Present
If you want to print a table summarizing all features within a query, you can use --features-summary [FEATURES]
.
For example, to print a table summarizing mutations alleles:
!gdi --project-dir salmonella-project query --features-summary mutations |\
cut -f 1,7,10,11,25 |\
head -n 5 |\
column -s$'\t' -t
Mutation Count Total Percent ID_HGVS_GN.p NC_011083.1:3190398:G:A 59 59 100 - NC_011083.1:542308:G:A 59 59 100 hgvs_gn:NC_011083.1:SEHA_RS03145:p.T5T NC_011083.1:2288294:A:G 59 59 100 hgvs_gn:NC_011083.1:thiD:p.P119P NC_011083.1:4888140:G:T 59 59 100 hgvs_gn:NC_011083.1:SEHA_RS24190:p.C28F cut: write error: Broken pipe
This shows a small selection of the top mutations out of all genomes in the dataset. Count is the number of genomes with that particular mutation, Total is the total number of genomes in the query (all genomes in the database in this case, feel free to try different query strings here to select subsets of genomes).
Note: the commands cut
, head
, and column
are just used to clean up the output for Jupyter. There is more information included in this full table.
You can also use the isa:
prefix to query for a particular genome. As in:
!gdi --project-dir salmonella-project query 'isa:SH14-002' | column -s$'\t' -t
Query Sample Name Sample ID Status isa_sample('SH14-002') SH14-002 30 Present
You can combine this with the previous --features-summary
command to summarize features in a particular genome.
!gdi --project-dir salmonella-project query 'isa:SH14-002' --features-summary mutations |\
cut -f 1,7,10,11,25 |\
head -n 5 |\
column -s$'\t' -t
Mutation Count Total Percent ID_HGVS_GN.p NC_011083.1:1947794:A:G 1 1 100 hgvs_gn:NC_011083.1:SEHA_RS10125:p.T61A NC_011083.1:4716545:T:C 1 1 100 hgvs_gn:NC_011083.1:SEHA_RS23395:p.I219I NC_011083.1:3945452:G:C 1 1 100 hgvs_gn:NC_011083.1:ligB:p.L134V NC_011083.1:3012412:C:T 1 1 100 hgvs_gn:NC_011083.1:invC:p.L25L cut: write error: Broken pipe
You can also query based on the phylogenetic distance between samples using the isin_[value]_[units]:
prefix to the query string. For example, use isin_10_substitutions:SH12-013
to search for genomes within 5 substitutions of genome SH12-013.
The query is based on a stored phylogenetic tree built using mutations/deletions derived from alignment to a reference genome. You will also have to pass --reference-name NC_011083
to specify the reference genome used to build the tree.
Note that 5 substitutions* here is derived from the branch lengths of the phylogenetic tree (multipled by the alignment length to convert substitutions/site to substitutions) and may not correspond to a count of the substitutions you get if you were to align genomes pairwise.*
!gdi --project-dir salmonella-project query --reference-name NC_011083 'isin_5_substitutions:SH14-002' | column -s$'\t' -t
Query Sample Name Sample ID Status within(5.0 substitutions of 'SH14-002') SH14-017 2 Present within(5.0 substitutions of 'SH14-002') SH14-025 6 Present within(5.0 substitutions of 'SH14-002') SH14-021 8 Present within(5.0 substitutions of 'SH14-002') SH14-027 10 Present within(5.0 substitutions of 'SH14-002') SH14-005 11 Present within(5.0 substitutions of 'SH14-002') SH14-007 13 Present within(5.0 substitutions of 'SH14-002') SH14-023 18 Present within(5.0 substitutions of 'SH14-002') SH14-003 20 Present within(5.0 substitutions of 'SH14-002') SH14-028 21 Present within(5.0 substitutions of 'SH14-002') SH14-006 24 Present within(5.0 substitutions of 'SH14-002') SH14-024 26 Present within(5.0 substitutions of 'SH14-002') SH14-009 27 Present within(5.0 substitutions of 'SH14-002') SH14-011 29 Present within(5.0 substitutions of 'SH14-002') SH14-002 30 Present within(5.0 substitutions of 'SH14-002') SH14-022 31 Present within(5.0 substitutions of 'SH14-002') SH14-014 33 Present within(5.0 substitutions of 'SH14-002') SH14-020 39 Present within(5.0 substitutions of 'SH14-002') SH14-018 40 Present within(5.0 substitutions of 'SH14-002') SH14-008 43 Present within(5.0 substitutions of 'SH14-002') SH14-015 44 Present within(5.0 substitutions of 'SH14-002') SH14-019 46 Present within(5.0 substitutions of 'SH14-002') SH14-016 48 Present within(5.0 substitutions of 'SH14-002') SH14-010 50 Present within(5.0 substitutions of 'SH14-002') SH14-004 52 Present within(5.0 substitutions of 'SH14-002') SH14-001 55 Present within(5.0 substitutions of 'SH14-002') SH14-026 57 Present within(5.0 substitutions of 'SH14-002') SH14-013 58 Present
Combine this with --features-summary mutations
to summarize the top mutations (and indels) in this set of selected genomes.
!gdi --project-dir salmonella-project query --reference-name NC_011083 'isin_5_substitutions:SH14-002' --features-summary mutations |\
cut -f 1,7,10,11,25 |\
head -n 5 |\
column -s$'\t' -t
Mutation Count Total Percent ID_HGVS_GN.p NC_011083.1:4465400:GGCCGAA:G 27 27 100 hgvs_gn:NC_011083.1:tyrB:p.E53_A54del NC_011083.1:3844892:G:T 27 27 100 hgvs_gn:NC_011083.1:SEHA_RS19240:p.T285K NC_011083.1:333057:T:G 27 27 100 hgvs_gn:NC_011083.1:tssM:p.L410R NC_011083.1:4636822:T:C 27 27 100 hgvs_gn:NC_011083.1:frdC:p.T91A cut: write error: Broken pipe
To build an alignment for use in further phylogenetics software you can do the following.
!gdi --project-dir salmonella-project build alignment --reference-name NC_011083 --align-type full --output-file out.aln
2022-12-15 15:35:26 INFO: Started building alignment for 59 samples with include_variants=['SNP', 'MNP', 'DELETION'] 2022-12-15 15:35:35 INFO: Finished building alignment for 59 samples. Took 8.65 seconds Wrote alignment to [out.aln]
!head -n 2 out.aln
>NC_011083 [reference genome] AGAGATTACGTCTGGTTGCAAGAGATCATAACAGGGGGAATTAGTTGAAAATAAATATAT
This will produce an alignment with length equal to the reference genome, masking any missing data or gaps with N
, and concatenating individual sequences (contigs) in the alignment together to construct a single, whole-genome alignment.
You can pass in specific samples to include with --sample
.
Now let's move on to the Python interface for loading, and querying data in this index. This is a much more powerful (and flexible) way to work with your data which attempts to provide seamless data flow between this index and Python pandas.
We first start out by connecting to our project through GenomicsDataIndex
.
import genomics_data_index.api as gdi
db = gdi.GenomicsDataIndex.connect('salmonella-project')
db
<GenomicsDataIndex(samples=59)>
Great. We're connected. The samples=59
tells us how many samples are in this database.
You can run db.sample_names()
or db.reference_names()
to list the samples and reference genomes in this database.
db.sample_names()[0:5]
['SH12-012', 'SH14-017', 'SH12-004', 'SH12-005', 'SH12-013']
db.reference_names()
['NC_011083']
To list a summary of all indexed mutations we can do:
summary_df = db.mutations_summary(reference_name='NC_011083').sort_values('Count', ascending=False)
summary_df
Sequence | Position | Deletion | Insertion | Type | Count | Unknown Count | Present and Unknown Count | Total | Percent | ... | Gene_Name | Gene_ID | Feature_Type | Transcript_BioType | HGVS.c | HGVS.p | ID_HGVS.c | ID_HGVS.p | ID_HGVS_GN.c | ID_HGVS_GN.p | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Mutation | |||||||||||||||||||||
NC_011083.1:3190398:G:A | NC_011083.1 | 3190398 | G | A | SNP | 59 | NaN | NaN | 59 | 100.000000 | ... | idi-lysS | SEHA_RS16035-SEHA_RS16040 | intergenic_region | <NA> | n.3190398G>A | <NA> | hgvs:NC_011083.1:n.3190398G>A | <NA> | hgvs_gn:NC_011083.1:n.3190398G>A | <NA> |
NC_011083.1:542308:G:A | NC_011083.1 | 542308 | G | A | SNP | 59 | NaN | NaN | 59 | 100.000000 | ... | SEHA_RS03145 | SEHA_RS03145 | transcript | protein_coding | c.15C>T | p.T5T | hgvs:NC_011083.1:SEHA_RS03145:c.15C>T | hgvs:NC_011083.1:SEHA_RS03145:p.T5T | hgvs_gn:NC_011083.1:SEHA_RS03145:c.15C>T | hgvs_gn:NC_011083.1:SEHA_RS03145:p.T5T |
NC_011083.1:2288294:A:G | NC_011083.1 | 2288294 | A | G | SNP | 59 | NaN | NaN | 59 | 100.000000 | ... | thiD | SEHA_RS11880 | transcript | protein_coding | c.357T>C | p.P119P | hgvs:NC_011083.1:SEHA_RS11880:c.357T>C | hgvs:NC_011083.1:SEHA_RS11880:p.P119P | hgvs_gn:NC_011083.1:thiD:c.357T>C | hgvs_gn:NC_011083.1:thiD:p.P119P |
NC_011083.1:4888140:G:T | NC_011083.1 | 4888140 | G | T | SNP | 59 | NaN | NaN | 59 | 100.000000 | ... | SEHA_RS24190 | SEHA_RS24190 | transcript | protein_coding | c.83G>T | p.C28F | hgvs:NC_011083.1:SEHA_RS24190:c.83G>T | hgvs:NC_011083.1:SEHA_RS24190:p.C28F | hgvs_gn:NC_011083.1:SEHA_RS24190:c.83G>T | hgvs_gn:NC_011083.1:SEHA_RS24190:p.C28F |
NC_011083.1:4824790:T:C | NC_011083.1 | 4824790 | T | C | SNP | 59 | NaN | NaN | 59 | 100.000000 | ... | SEHA_RS23860 | SEHA_RS23860 | transcript | protein_coding | c.1378T>C | p.*460Qext*? | hgvs:NC_011083.1:SEHA_RS23860:c.1378T>C | hgvs:NC_011083.1:SEHA_RS23860:p.*460Qext*? | hgvs_gn:NC_011083.1:SEHA_RS23860:c.1378T>C | hgvs_gn:NC_011083.1:SEHA_RS23860:p.*460Qext*? |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
NC_011083.1:863025:CCGGCGG:C | NC_011083.1 | 863025 | CCGGCGG | C | INDEL | 1 | NaN | NaN | 59 | 1.694915 | ... | tolA | SEHA_RS04685 | transcript | protein_coding | c.135_140delCGGCGG | p.G46_G47del | hgvs:NC_011083.1:SEHA_RS04685:c.135_140delCGGCGG | hgvs:NC_011083.1:SEHA_RS04685:p.G46_G47del | hgvs_gn:NC_011083.1:tolA:c.135_140delCGGCGG | hgvs_gn:NC_011083.1:tolA:p.G46_G47del |
NC_011083.1:1819152:G:C | NC_011083.1 | 1819152 | G | C | SNP | 1 | NaN | NaN | 59 | 1.694915 | ... | SEHA_RS09510 | SEHA_RS09510 | transcript | protein_coding | c.369G>C | p.W123C | hgvs:NC_011083.1:SEHA_RS09510:c.369G>C | hgvs:NC_011083.1:SEHA_RS09510:p.W123C | hgvs_gn:NC_011083.1:SEHA_RS09510:c.369G>C | hgvs_gn:NC_011083.1:SEHA_RS09510:p.W123C |
NC_011083.1:1427185:G:A | NC_011083.1 | 1427185 | G | A | SNP | 1 | NaN | NaN | 59 | 1.694915 | ... | SEHA_RS07595 | SEHA_RS07595 | transcript | protein_coding | c.123G>A | p.Q41Q | hgvs:NC_011083.1:SEHA_RS07595:c.123G>A | hgvs:NC_011083.1:SEHA_RS07595:p.Q41Q | hgvs_gn:NC_011083.1:SEHA_RS07595:c.123G>A | hgvs_gn:NC_011083.1:SEHA_RS07595:p.Q41Q |
NC_011083.1:3494198:G:A | NC_011083.1 | 3494198 | G | A | SNP | 1 | NaN | NaN | 59 | 1.694915 | ... | mlaE | SEHA_RS17580 | transcript | protein_coding | c.56C>T | p.T19M | hgvs:NC_011083.1:SEHA_RS17580:c.56C>T | hgvs:NC_011083.1:SEHA_RS17580:p.T19M | hgvs_gn:NC_011083.1:mlaE:c.56C>T | hgvs_gn:NC_011083.1:mlaE:p.T19M |
NC_011083.1:3255259:G:A | NC_011083.1 | 3255259 | G | A | SNP | 1 | NaN | NaN | 59 | 1.694915 | ... | SEHA_RS16365 | SEHA_RS16365 | transcript | protein_coding | c.366G>A | p.L122L | hgvs:NC_011083.1:SEHA_RS16365:c.366G>A | hgvs:NC_011083.1:SEHA_RS16365:p.L122L | hgvs_gn:NC_011083.1:SEHA_RS16365:c.366G>A | hgvs_gn:NC_011083.1:SEHA_RS16365:p.L122L |
421 rows × 24 columns
This gives us back a DataFrame summarizing the mutations.
The Count, Total, and Percent columns tell us how many samples have a particular mutation.
The Annotation and beyond columns give us detailed information about the mutation's impact (as derived from snpeff if this was run on the genomes (VCF files) prior to indexing). The ID_HGVS... columns give us alternative identifiers we can use for querying based on the HGVS notation (as derived from snpeff annotations).
You may notice that Unknown Count and Present and Unknown Count have values of NA
. These columns record the number of samples where this particular mutation is "Unknown/Missing" (either due to N's or missing data in the particular regions of the genome). Calculating this information can slow down generating the features table for large datasets and so it is disabled by default. To enabled it you can pass include_unknown_samples=True
to mutations_summary()
.
summary_df = db.mutations_summary(reference_name='NC_011083', include_unknown_samples=True).sort_values('Count', ascending=False)
summary_df
Sequence | Position | Deletion | Insertion | Type | Count | Unknown Count | Present and Unknown Count | Total | Percent | ... | Gene_Name | Gene_ID | Feature_Type | Transcript_BioType | HGVS.c | HGVS.p | ID_HGVS.c | ID_HGVS.p | ID_HGVS_GN.c | ID_HGVS_GN.p | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Mutation | |||||||||||||||||||||
NC_011083.1:3190398:G:A | NC_011083.1 | 3190398 | G | A | SNP | 59 | 0 | 59 | 59 | 100.000000 | ... | idi-lysS | SEHA_RS16035-SEHA_RS16040 | intergenic_region | <NA> | n.3190398G>A | <NA> | hgvs:NC_011083.1:n.3190398G>A | <NA> | hgvs_gn:NC_011083.1:n.3190398G>A | <NA> |
NC_011083.1:542308:G:A | NC_011083.1 | 542308 | G | A | SNP | 59 | 0 | 59 | 59 | 100.000000 | ... | SEHA_RS03145 | SEHA_RS03145 | transcript | protein_coding | c.15C>T | p.T5T | hgvs:NC_011083.1:SEHA_RS03145:c.15C>T | hgvs:NC_011083.1:SEHA_RS03145:p.T5T | hgvs_gn:NC_011083.1:SEHA_RS03145:c.15C>T | hgvs_gn:NC_011083.1:SEHA_RS03145:p.T5T |
NC_011083.1:2288294:A:G | NC_011083.1 | 2288294 | A | G | SNP | 59 | 0 | 59 | 59 | 100.000000 | ... | thiD | SEHA_RS11880 | transcript | protein_coding | c.357T>C | p.P119P | hgvs:NC_011083.1:SEHA_RS11880:c.357T>C | hgvs:NC_011083.1:SEHA_RS11880:p.P119P | hgvs_gn:NC_011083.1:thiD:c.357T>C | hgvs_gn:NC_011083.1:thiD:p.P119P |
NC_011083.1:4888140:G:T | NC_011083.1 | 4888140 | G | T | SNP | 59 | 0 | 59 | 59 | 100.000000 | ... | SEHA_RS24190 | SEHA_RS24190 | transcript | protein_coding | c.83G>T | p.C28F | hgvs:NC_011083.1:SEHA_RS24190:c.83G>T | hgvs:NC_011083.1:SEHA_RS24190:p.C28F | hgvs_gn:NC_011083.1:SEHA_RS24190:c.83G>T | hgvs_gn:NC_011083.1:SEHA_RS24190:p.C28F |
NC_011083.1:4824790:T:C | NC_011083.1 | 4824790 | T | C | SNP | 59 | 0 | 59 | 59 | 100.000000 | ... | SEHA_RS23860 | SEHA_RS23860 | transcript | protein_coding | c.1378T>C | p.*460Qext*? | hgvs:NC_011083.1:SEHA_RS23860:c.1378T>C | hgvs:NC_011083.1:SEHA_RS23860:p.*460Qext*? | hgvs_gn:NC_011083.1:SEHA_RS23860:c.1378T>C | hgvs_gn:NC_011083.1:SEHA_RS23860:p.*460Qext*? |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
NC_011083.1:863025:CCGGCGG:C | NC_011083.1 | 863025 | CCGGCGG | C | INDEL | 1 | 1 | 2 | 59 | 1.694915 | ... | tolA | SEHA_RS04685 | transcript | protein_coding | c.135_140delCGGCGG | p.G46_G47del | hgvs:NC_011083.1:SEHA_RS04685:c.135_140delCGGCGG | hgvs:NC_011083.1:SEHA_RS04685:p.G46_G47del | hgvs_gn:NC_011083.1:tolA:c.135_140delCGGCGG | hgvs_gn:NC_011083.1:tolA:p.G46_G47del |
NC_011083.1:1819152:G:C | NC_011083.1 | 1819152 | G | C | SNP | 1 | 0 | 1 | 59 | 1.694915 | ... | SEHA_RS09510 | SEHA_RS09510 | transcript | protein_coding | c.369G>C | p.W123C | hgvs:NC_011083.1:SEHA_RS09510:c.369G>C | hgvs:NC_011083.1:SEHA_RS09510:p.W123C | hgvs_gn:NC_011083.1:SEHA_RS09510:c.369G>C | hgvs_gn:NC_011083.1:SEHA_RS09510:p.W123C |
NC_011083.1:1427185:G:A | NC_011083.1 | 1427185 | G | A | SNP | 1 | 0 | 1 | 59 | 1.694915 | ... | SEHA_RS07595 | SEHA_RS07595 | transcript | protein_coding | c.123G>A | p.Q41Q | hgvs:NC_011083.1:SEHA_RS07595:c.123G>A | hgvs:NC_011083.1:SEHA_RS07595:p.Q41Q | hgvs_gn:NC_011083.1:SEHA_RS07595:c.123G>A | hgvs_gn:NC_011083.1:SEHA_RS07595:p.Q41Q |
NC_011083.1:3494198:G:A | NC_011083.1 | 3494198 | G | A | SNP | 1 | 0 | 1 | 59 | 1.694915 | ... | mlaE | SEHA_RS17580 | transcript | protein_coding | c.56C>T | p.T19M | hgvs:NC_011083.1:SEHA_RS17580:c.56C>T | hgvs:NC_011083.1:SEHA_RS17580:p.T19M | hgvs_gn:NC_011083.1:mlaE:c.56C>T | hgvs_gn:NC_011083.1:mlaE:p.T19M |
NC_011083.1:3255259:G:A | NC_011083.1 | 3255259 | G | A | SNP | 1 | 0 | 1 | 59 | 1.694915 | ... | SEHA_RS16365 | SEHA_RS16365 | transcript | protein_coding | c.366G>A | p.L122L | hgvs:NC_011083.1:SEHA_RS16365:c.366G>A | hgvs:NC_011083.1:SEHA_RS16365:p.L122L | hgvs_gn:NC_011083.1:SEHA_RS16365:c.366G>A | hgvs_gn:NC_011083.1:SEHA_RS16365:p.L122L |
421 rows × 24 columns
Now we can see the Unknown Count and Present and Unknown Count valued filled in (Present and Unknown means the number of samples where the mutation is present (the Count
column) plus the number of samples where it is Unknown if a mutation is present (the Unknown Count
column).
Let's look for mutations where less than 1/3 of the genomic samples have a particular mutation.
summary_df[summary_df['Percent'] < 33].sort_values(['Count', 'Position'], ascending=False).head(5)
Sequence | Position | Deletion | Insertion | Type | Count | Unknown Count | Present and Unknown Count | Total | Percent | ... | Gene_Name | Gene_ID | Feature_Type | Transcript_BioType | HGVS.c | HGVS.p | ID_HGVS.c | ID_HGVS.p | ID_HGVS_GN.c | ID_HGVS_GN.p | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Mutation | |||||||||||||||||||||
NC_011083.1:4482211:C:A | NC_011083.1 | 4482211 | C | A | SNP | 17 | 13 | 30 | 59 | 28.813559 | ... | siiE | SEHA_RS22245 | transcript | protein_coding | c.3787C>A | p.R1263S | hgvs:NC_011083.1:SEHA_RS22245:c.3787C>A | hgvs:NC_011083.1:SEHA_RS22245:p.R1263S | hgvs_gn:NC_011083.1:siiE:c.3787C>A | hgvs_gn:NC_011083.1:siiE:p.R1263S |
NC_011083.1:3167187:AACCACGACCACGACCACGACCACGACCACGACCACGACCACG:A | NC_011083.1 | 3167187 | AACCACGACCACGACCACGACCACGACCACGACCACGACCACG | A | INDEL | 15 | 16 | 31 | 59 | 25.423729 | ... | SEHA_RS15905 | SEHA_RS15905 | transcript | protein_coding | c.423_464delCGACCACGACCACGACCACGACCACGACCACGAC... | p.D142_H155del | hgvs:NC_011083.1:SEHA_RS15905:c.423_464delCGAC... | hgvs:NC_011083.1:SEHA_RS15905:p.D142_H155del | hgvs_gn:NC_011083.1:SEHA_RS15905:c.423_464delC... | hgvs_gn:NC_011083.1:SEHA_RS15905:p.D142_H155del |
NC_011083.1:4789652:G:A | NC_011083.1 | 4789652 | G | A | SNP | 13 | 0 | 13 | 59 | 22.033898 | ... | SEHA_RS23695-SEHA_RS23700 | SEHA_RS23695-SEHA_RS23700 | intergenic_region | <NA> | n.4789652G>A | <NA> | hgvs:NC_011083.1:n.4789652G>A | <NA> | hgvs_gn:NC_011083.1:n.4789652G>A | <NA> |
NC_011083.1:4739756:ACGC:A | NC_011083.1 | 4739756 | ACGC | A | INDEL | 13 | 0 | 13 | 59 | 22.033898 | ... | arcC | SEHA_RS23510 | transcript | protein_coding | c.566_568delGCG | p.G189del | hgvs:NC_011083.1:SEHA_RS23510:c.566_568delGCG | hgvs:NC_011083.1:SEHA_RS23510:p.G189del | hgvs_gn:NC_011083.1:arcC:c.566_568delGCG | hgvs_gn:NC_011083.1:arcC:p.G189del |
NC_011083.1:4559277:C:T | NC_011083.1 | 4559277 | C | T | SNP | 13 | 0 | 13 | 59 | 22.033898 | ... | SEHA_RS22545 | SEHA_RS22545 | transcript | protein_coding | c.474C>T | p.A158A | hgvs:NC_011083.1:SEHA_RS22545:c.474C>T | hgvs:NC_011083.1:SEHA_RS22545:p.A158A | hgvs_gn:NC_011083.1:SEHA_RS22545:c.474C>T | hgvs_gn:NC_011083.1:SEHA_RS22545:p.A158A |
5 rows × 24 columns
Let's use this table to select a mutation to search for. We will select NC_011083.1:4482211:C:A
.
hasa()
)¶We can search for genomic samples containing particular mutation by starting a query and using the hasa()
method.
q = db.samples_query().hasa('NC_011083.1:4482211:C:A')
q
<SamplesQueryIndex[selected=29% (17/59) samples, unknown=22% (13/59) samples]>
The hasa()
method can be read as "select samples that have a particular mutation". The selected samples are printed above (17/59 have this particular mutation).
q = db.samples_query().hasa('hgvs_gn:NC_011083.1:siiE:p.R1263S')
q
<SamplesQueryIndex[selected=29% (17/59) samples, unknown=22% (13/59) samples]>
The HGVS identifier is given as hgvs:sequence:gene_locus_id:variant
(or hgvs_gn:sequence:gene_name:variant
, hgvs_gn
is for specifying gene name instead of locus). You can find the corresponding HGVS identifiers from the mutations_summary()
table shown above.
summary_df.loc['NC_011083.1:4482211:C:A'][['ID_HGVS.c', 'ID_HGVS.p', 'ID_HGVS_GN.c', 'ID_HGVS_GN.p']]
ID_HGVS.c hgvs:NC_011083.1:SEHA_RS22245:c.3787C>A ID_HGVS.p hgvs:NC_011083.1:SEHA_RS22245:p.R1263S ID_HGVS_GN.c hgvs_gn:NC_011083.1:siiE:c.3787C>A ID_HGVS_GN.p hgvs_gn:NC_011083.1:siiE:p.R1263S Name: NC_011083.1:4482211:C:A, dtype: object
Note: As of right now, HGVS identifiers are derived when using SnpEff, so this option will only work if SnpEff was run on the VCF files and corresponding HGVS identifiers were stored in the index.
You may notice that in the results of the query we find unknown=22% (13/59) samples
.
db.samples_query().hasa('hgvs_gn:NC_011083.1:siiE:p.R1263S')
<SamplesQueryIndex[selected=29% (17/59) samples, unknown=22% (13/59) samples]>
This shows us that there exists 1 sample where it is unknown whether or not it has a hgvs_gn:NC_011083.1:siiE:p.R1263S
mutation. This could be either that this particular region of the genome had no reads align to it or there exists ambiguous characters (N
), in this region, hence it cannot be determined whether or not the mutation in question exists.
Samples where the result of a query is unknown
are tracked and can be printed out using the commands shown below on accessing more details about a query.
Once we have a query/selection of samples the specific samples can be shown with the toframe()
method:
q.toframe(include_unknown=True)
Query | Sample Name | Sample ID | Status | |
---|---|---|---|---|
0 | hgvs_gn:NC_011083.1:siiE:p.R1263S | SH14-021 | 8 | Present |
1 | hgvs_gn:NC_011083.1:siiE:p.R1263S | SH14-005 | 11 | Present |
2 | hgvs_gn:NC_011083.1:siiE:p.R1263S | SH14-007 | 13 | Present |
3 | hgvs_gn:NC_011083.1:siiE:p.R1263S | SH10-30 | 16 | Present |
4 | hgvs_gn:NC_011083.1:siiE:p.R1263S | SH14-023 | 18 | Present |
5 | hgvs_gn:NC_011083.1:siiE:p.R1263S | SH14-003 | 20 | Present |
6 | hgvs_gn:NC_011083.1:siiE:p.R1263S | SH14-028 | 21 | Present |
7 | hgvs_gn:NC_011083.1:siiE:p.R1263S | SH14-006 | 24 | Present |
8 | hgvs_gn:NC_011083.1:siiE:p.R1263S | SH14-012 | 25 | Present |
9 | hgvs_gn:NC_011083.1:siiE:p.R1263S | SH14-009 | 27 | Present |
10 | hgvs_gn:NC_011083.1:siiE:p.R1263S | SH14-011 | 29 | Present |
11 | hgvs_gn:NC_011083.1:siiE:p.R1263S | SH14-002 | 30 | Present |
12 | hgvs_gn:NC_011083.1:siiE:p.R1263S | SH14-014 | 33 | Present |
13 | hgvs_gn:NC_011083.1:siiE:p.R1263S | SH09-29 | 35 | Present |
14 | hgvs_gn:NC_011083.1:siiE:p.R1263S | SH14-008 | 43 | Present |
15 | hgvs_gn:NC_011083.1:siiE:p.R1263S | SH14-019 | 46 | Present |
16 | hgvs_gn:NC_011083.1:siiE:p.R1263S | SH14-004 | 52 | Present |
17 | hgvs_gn:NC_011083.1:siiE:p.R1263S | SH14-017 | 2 | Unknown |
18 | hgvs_gn:NC_011083.1:siiE:p.R1263S | SH14-025 | 6 | Unknown |
19 | hgvs_gn:NC_011083.1:siiE:p.R1263S | SH14-027 | 10 | Unknown |
20 | hgvs_gn:NC_011083.1:siiE:p.R1263S | SH14-024 | 26 | Unknown |
21 | hgvs_gn:NC_011083.1:siiE:p.R1263S | SH14-022 | 31 | Unknown |
22 | hgvs_gn:NC_011083.1:siiE:p.R1263S | SH14-020 | 39 | Unknown |
23 | hgvs_gn:NC_011083.1:siiE:p.R1263S | SH14-018 | 40 | Unknown |
24 | hgvs_gn:NC_011083.1:siiE:p.R1263S | SH14-015 | 44 | Unknown |
25 | hgvs_gn:NC_011083.1:siiE:p.R1263S | SH14-016 | 48 | Unknown |
26 | hgvs_gn:NC_011083.1:siiE:p.R1263S | SH14-010 | 50 | Unknown |
27 | hgvs_gn:NC_011083.1:siiE:p.R1263S | SH14-001 | 55 | Unknown |
28 | hgvs_gn:NC_011083.1:siiE:p.R1263S | SH14-026 | 57 | Unknown |
29 | hgvs_gn:NC_011083.1:siiE:p.R1263S | SH14-013 | 58 | Unknown |
You can use include_unknown=True
to include samples where the status of the query is Unknown (unknown whether it is True or False). By default unknowns are not included.
To summarize, you can use summary()
:
q.summary()
Query | Present | Absent | Unknown | Total | % Present | % Absent | % Unknown | |
---|---|---|---|---|---|---|---|---|
0 | hgvs_gn:NC_011083.1:siiE:p.R1263S | 17 | 29 | 13 | 59 | 28.813559 | 49.152542 | 22.033898 |
Or you can use tolist()
(by default unknowns are not included, you can use include_unknown=True
to include them).
q.tolist()
['SH14-021', 'SH14-005', 'SH14-007', 'SH10-30', 'SH14-023', 'SH14-003', 'SH14-028', 'SH14-006', 'SH14-012', 'SH14-009', 'SH14-011', 'SH14-002', 'SH14-014', 'SH09-29', 'SH14-008', 'SH14-019', 'SH14-004']
Queries can be chained together to select samples that match every criteria given in the has()
method.
q = db.samples_query() \
.hasa('hgvs_gn:NC_011083.1:siiE:p.R1263S') \
.hasa('NC_011083.1:3535635:A:G')
q
<SamplesQueryIndex[selected=5% (3/59) samples, unknown=46% (27/59) samples]>
This can be read as "select all samples that have a R1263S
mutation (amino acid) on gene siiE
AND select all samples that have an A to G
mutation on position 3535635
of sequence NC_011083.1
".
You may notice that the unknown's have increased to 46%. In the data model used by this software, both ambiguous characters (e.g., N
) and regions with too low of coverage are considered as unknown (or missing). So, if genomes have a large deletion in a particular region overlapping a mutation being queried, then these would all show up as having an unknown status (small deletions represented in the VCF file won't be considered as unknown). I do not know if this is the case here (further investigation is required) but this is something to keep in mind.
You can convert all those samples/genomes that have an Unknown status to be considered selected if you wish using some boolean logic on queries and select_unknown()
. In particular:
# Select everything that is "present" in query `q` OR everything that is "unknown" in query `q`
r = q.select_present() | q.select_unknown()
# Alternatively
# r = q.select_present().or_(q.select_unknown())
r
<SamplesQueryIndex[selected=51% (30/59) samples, unknown=0% (0/59) samples]>
Here, q.select_unknown()
selects only the unknown samples and q.select_present() | q.select_unknown()
means select everything that is present in q.select_present()
OR in q.select_unknown()
.
isa()
and isin()
)¶The isa()
and isin()
methods let us search for particular samples by name. The difference between the two is that:
isa()
is meant to be read "select samples that are (is a) type matching the expression.isin()
is meant to be read "select samples that are in a set defined by the passed criteria.The differences between these become more apparent for more advanced queries later on. For now, we can use these to select samples by name.
q = db.samples_query().isa('SH12-001')
q
<SamplesQueryIndex[selected=2% (1/59) samples, unknown=0% (0/59) samples]>
q = db.samples_query().isin(['SH12-001', 'SH13-001'])
q
<SamplesQueryIndex[selected=3% (2/59) samples, unknown=0% (0/59) samples]>
Queries are not limited to what mutations a sample has or by sample name. We can also use isin()
to select samples that match criteria related to a phylogenetic tree.
To do this, we must specify that our query has a tree attached to it. The example data for this tutorial does have such a tree (though it can be built on-the-fly if needed using build_tree()
, or joined to an existing tree using join_tree()
).
t = db.samples_query(universe='mutations', reference_name='NC_011083')
t
<MutationTreeSamplesQuery[selected=100% (59/59) samples, unknown=0% (0/59) samples]>
Here, the type of query is a MutationTreeSamplesQuery
which means it has a tree attached to it (derived from mutations). You can access the underlying tree with the tree
property (as an ete3 Tree object).
t.tree
# Print as newick format
#t.tree.write()
Tree node '' (0x7effb6c89d9)
You can quickly visualize the tree in-line by using the tree_styler()
method:
t.tree_styler(show_legend_type_labels=False).render(w=300)
Now that we have a phylogenetic tree, we can search using this tree with isin()
. To do this, let's search for samples within some distance from SH14-001
.
tdist = t.isin('SH14-002', kind='distance', distance=1e-7, units='substitutions/site')
tdist
<MutationTreeSamplesQuery[selected=3% (2/59) samples, unknown=0% (0/59) samples]>
This selects a subset of samples within the above distance (given in 'substitutions/site'
, you can also use units='substitutions'
).
It can be hard to see what is going on, so we can combine our query with the tree visualization using the highlight()
method to highlight the selected samples in the tree.
Note: Selecting by distance to SH14-002
, won't necessarily select genomes belonging to the same clade.
t.tree_styler(show_legend_type_labels=False).highlight(tdist).render(w=300)
Another type of query instead of distance is mrca
which selects samples that all share a particular most recent common ancestor.
tmrca = t.isin(['SH14-002', 'SH14-025'], kind='mrca')
t.tree_styler(show_legend_type_labels=False).highlight(tmrca).render(w=300)
If you wish to select by distance, but restrict yourself to some particular clade you can chain the above two queries together.
tmrca_and_dist = t.isin(['SH14-002', 'SH14-025'], kind='mrca')\
.isin('SH14-002', kind='distance', distance=1e-7, units='substitutions/site')
t.tree_styler(show_legend_type_labels=False).highlight(tmrca_and_dist).render(w=300)
So far we've been looking at only the genomics data. But often times many details insights can be derived from the associated metadata with the genomic samples. External metadata can be attached and tracked by our queries. This also gives us annother method for selecting samples using pandas selection statements (e.g., metadata['Column'] == 'value'
).
To attach external metadata we first must load it up in Python as a DataFrame (note head(3)
just means only print the first 3 rows, which avoids printing a very large table for this tutorial).
import pandas as pd
# I rename 'Sample Name' here to 'Sample Name Orig' to avoid issues later on with other columns named 'Sample Name'
metadata_df = pd.read_csv('salmonella-project/metadata.tsv.gz', sep='\t', dtype=str).rename(
{'Sample Name': 'Sample Name Orig'}, axis='columns')
metadata_df.head(3)
Sample Name Orig | Run | Assay Type | AvgSpotLen | Bases | BioProject | BioSample | BioSampleModel | Bytes | Center Name | ... | PFGE_SecondaryEnzyme_pattern | Phagetype | Platform | ReleaseDate | Serovar | SRA Study | STRAIN | sub_species | Host_disease | Host | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | SH08-001 | SRR3028792 | WGS | 429 | 354123684 | PRJNA305824 | SAMN04334683 | Pathogen.cl | 197484364 | MCGILL UNIVERSITY | ... | SHBNI.0001 | 19 | ILLUMINA | 2015-12-19T00:00:00Z | Heidelberg | SRP067504 | SH08-001 | enterica | Salmonella gastroenteritis | Homo sapiens |
1 | SH09-29 | SRR3028793 | WGS | 422 | 519366460 | PRJNA305824 | SAMN04334684 | Pathogen.cl | 288691068 | MCGILL UNIVERSITY | ... | SHBNI.0001 | 26 | ILLUMINA | 2015-12-19T00:00:00Z | Heidelberg | SRP067504 | SH09-29 | enterica | Salmonella gastroenteritis | Homo sapiens |
2 | SH10-001 | SRR3028783 | WGS | 421 | 387145160 | PRJNA305824 | SAMN04334674 | Pathogen.cl | 233911529 | MCGILL UNIVERSITY | ... | SHBNI.0001 | 19 | ILLUMINA | 2015-12-19T00:00:00Z | Heidelberg | SRP067504 | SH10-001 | enterica | Salmonella gastroenteritis | Homo sapiens |
3 rows × 40 columns
Now we can attach to our query using the join()
method:
# Setup a new query (you don't have to do this, but this makes sure the results are all as expected in the tutorial)
q = db.samples_query().hasa('hgvs_gn:NC_011083.1:siiE:p.R1263S')
# Join our query with the given data frame
q = q.join(metadata_df, sample_names_column='Sample Name Orig')
q
<DataFrameSamplesQuery[selected=29% (17/59) samples, unknown=22% (13/59) samples]>
To join we had to define a column containing the sample names. We now get back a query of type DataFrameSamplesQuery
.
We can continue using this to select samples and when we're done use the toframe()
method to dump out our selected data as a DataFrame (change to toframe(include_unknown=True)
if you wish to include unknown results).
q.hasa('hgvs:NC_011083.1:SEHA_RS17780:c.1080T>C').toframe()
Query | Sample Name | Sample ID | Status | Sample Name Orig | Run | Assay Type | AvgSpotLen | Bases | BioProject | ... | PFGE_SecondaryEnzyme_pattern | Phagetype | Platform | ReleaseDate | Serovar | SRA Study | STRAIN | sub_species | Host_disease | Host | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | hgvs_gn:NC_011083.1:siiE:p.R1263S AND datafram... | SH10-30 | 16 | Present | SH10-30 | SRR3028794 | WGS | 408 | 614636422 | PRJNA305824 | ... | SHBNI.0001 | 29 | ILLUMINA | 2015-12-19T00:00:00Z | Heidelberg | SRP067504 | SH10-30 | enterica | Salmonella gastroenteritis | Homo sapiens |
1 | hgvs_gn:NC_011083.1:siiE:p.R1263S AND datafram... | SH14-002 | 30 | Present | SH14-002 | SRR3028755 | WGS | 513 | 687874904 | PRJNA305824 | ... | SHBNI.0001 | 19 | ILLUMINA | 2015-12-19T00:00:00Z | Heidelberg | SRP067504 | SH14-002 | enterica | Salmonella gastroenteritis | Homo sapiens |
2 | hgvs_gn:NC_011083.1:siiE:p.R1263S AND datafram... | SH14-003 | 20 | Present | SH14-003 | SRR3028756 | WGS | 508 | 697302985 | PRJNA305824 | ... | SHBNI.0001 | 19 | ILLUMINA | 2015-12-19T00:00:00Z | Heidelberg | SRP067504 | SH14-003 | enterica | Salmonella gastroenteritis | Homo sapiens |
3 | hgvs_gn:NC_011083.1:siiE:p.R1263S AND datafram... | SH14-004 | 52 | Present | SH14-004 | SRR3028757 | WGS | 505 | 808762405 | PRJNA305824 | ... | SHBNI.0001 | 19 | ILLUMINA | 2015-12-19T00:00:00Z | Heidelberg | SRP067504 | SH14-004 | enterica | Salmonella gastroenteritis | Homo sapiens |
4 | hgvs_gn:NC_011083.1:siiE:p.R1263S AND datafram... | SH14-005 | 11 | Present | SH14-005 | SRR3028758 | WGS | 510 | 852322915 | PRJNA305824 | ... | SHBNI.0001 | 19 | ILLUMINA | 2015-12-19T00:00:00Z | Heidelberg | SRP067504 | SH14-005 | enterica | Salmonella gastroenteritis | Homo sapiens |
5 | hgvs_gn:NC_011083.1:siiE:p.R1263S AND datafram... | SH14-006 | 24 | Present | SH14-006 | SRR3028759 | WGS | 515 | 705385264 | PRJNA305824 | ... | SHBNI.0001 | 19 | ILLUMINA | 2015-12-19T00:00:00Z | Heidelberg | SRP067504 | SH14-006 | enterica | Salmonella gastroenteritis | Homo sapiens |
6 | hgvs_gn:NC_011083.1:siiE:p.R1263S AND datafram... | SH14-007 | 13 | Present | SH14-007 | SRR3028760 | WGS | 511 | 770747207 | PRJNA305824 | ... | SHBNI.0001 | 19 | ILLUMINA | 2015-12-19T00:00:00Z | Heidelberg | SRP067504 | SH14-007 | enterica | Salmonella gastroenteritis | Homo sapiens |
7 | hgvs_gn:NC_011083.1:siiE:p.R1263S AND datafram... | SH14-008 | 43 | Present | SH14-008 | SRR3028761 | WGS | 510 | 495519791 | PRJNA305824 | ... | SHBNI.0001 | 19 | ILLUMINA | 2015-12-19T00:00:00Z | Heidelberg | SRP067504 | SH14-008 | enterica | Salmonella gastroenteritis | Homo sapiens |
8 | hgvs_gn:NC_011083.1:siiE:p.R1263S AND datafram... | SH14-011 | 29 | Present | SH14-011 | SRR3028764 | WGS | 516 | 658814410 | PRJNA305824 | ... | SHBNI.0001 | 17 | ILLUMINA | 2015-12-19T00:00:00Z | Heidelberg | SRP067504 | SH14-011 | enterica | Salmonella gastroenteritis | Homo sapiens |
9 | hgvs_gn:NC_011083.1:siiE:p.R1263S AND datafram... | SH14-012 | 25 | Present | SH14-012 | SRR3028765 | WGS | 524 | 802527201 | PRJNA305824 | ... | SHBNI.0001 | ATHE-35 | ILLUMINA | 2015-12-19T00:00:00Z | Heidelberg | SRP067504 | SH14-012 | enterica | Salmonella gastroenteritis | Homo sapiens |
10 | hgvs_gn:NC_011083.1:siiE:p.R1263S AND datafram... | SH14-019 | 46 | Present | SH14-019 | SRR3028772 | WGS | 514 | 372480499 | PRJNA305824 | ... | SHBNI.0001 | 19 | ILLUMINA | 2015-12-19T00:00:00Z | Heidelberg | SRP067504 | SH14-019 | enterica | NaN | NaN |
11 | hgvs_gn:NC_011083.1:siiE:p.R1263S AND datafram... | SH14-021 | 8 | Present | SH14-021 | SRR3028774 | WGS | 517 | 409621309 | PRJNA305824 | ... | SHBNI.0001 | 19 | ILLUMINA | 2015-12-19T00:00:00Z | Heidelberg | SRP067504 | SH14-021 | enterica | NaN | NaN |
12 rows × 44 columns
Running toframe()
will add some additional columns to the front of the external data frame defining the genomics query information.
isa()
¶We can now use the isa()
method to select samples by values in a particular metadata column. For example, to select all samples with an Isolation_source of food
we can use:
q.isa('food', isa_column='Isolation_source', kind='dataframe').toframe()[
['Query', 'Sample Name', 'Status', 'Isolation_source']].head(3)
Query | Sample Name | Status | Isolation_source | |
---|---|---|---|---|
0 | hgvs_gn:NC_011083.1:siiE:p.R1263S AND datafram... | SH14-014 | Present | food |
1 | hgvs_gn:NC_011083.1:siiE:p.R1263S AND datafram... | SH14-019 | Present | food |
2 | hgvs_gn:NC_011083.1:siiE:p.R1263S AND datafram... | SH14-021 | Present | food |
To make it easier to write these query expressions, you can set a default column for isa()
queries when joining to a dataframe:
q = db.samples_query().join(metadata_df, sample_names_column='Sample Name Orig',
default_isa_kind='dataframe', default_isa_column='Isolation_source')
q.isa('food').toframe()[['Query', 'Sample Name', 'Status', 'Isolation_source']].head(3)
Query | Sample Name | Status | Isolation_source | |
---|---|---|---|---|
0 | dataframe(names_col=[Sample Name Orig]) AND is... | SH12-009 | Present | food |
1 | dataframe(names_col=[Sample Name Orig]) AND is... | SH12-010 | Present | food |
2 | dataframe(names_col=[Sample Name Orig]) AND is... | SH14-013 | Present | food |
You can also pass regex=True
to isa()
to query by a regex.
You can also select samples using the more powerful pandas selection expressions. You use the isin()
method for this.
For example, an alternative way to select samples where Isolation_source is food
is:
q.isin(metadata_df['Isolation_source'] == 'food', kind='dataframe').toframe()[
['Query', 'Sample Name', 'Status', 'Isolation_source']].head(3)
Query | Sample Name | Status | Isolation_source | |
---|---|---|---|---|
0 | dataframe(names_col=[Sample Name Orig]) AND is... | SH12-009 | Present | food |
1 | dataframe(names_col=[Sample Name Orig]) AND is... | SH12-010 | Present | food |
2 | dataframe(names_col=[Sample Name Orig]) AND is... | SH14-013 | Present | food |
So far we've been working with mutations (specified either as nucleotides or as amino acids). But we can also use the same query syntax to work with MLST data.
MLST (multi-locus sequence typing) data specifies genomic features as a combination of a Scheme, Locus (gene), and Allele. For example, Scheme=senterica
...
To view all MLST schemes available in the system, we can use db.mlst_schemes()
.
db.mlst_schemes()
['senterica', 'sistr_330']
This says we have two MLST schemes: senterica
and sistr_330
.
To view all MLST features for a particular scheme, we can use db.mlst_summary()
.
db.mlst_summary(scheme_name='senterica').sort_values('Count', ascending=False)
Scheme | Locus | Allele | Count | Unknown Count | Present and Unknown Count | Total | Percent | Unknown Percent | Present and Unknown Percent | |
---|---|---|---|---|---|---|---|---|---|---|
MLST Feature | ||||||||||
mlst:senterica:aroC:2 | senterica | aroC | 2 | 59 | <NA> | <NA> | 59 | 100.0 | <NA> | <NA> |
mlst:senterica:dnaN:7 | senterica | dnaN | 7 | 59 | <NA> | <NA> | 59 | 100.0 | <NA> | <NA> |
mlst:senterica:hemD:9 | senterica | hemD | 9 | 59 | <NA> | <NA> | 59 | 100.0 | <NA> | <NA> |
mlst:senterica:hisD:9 | senterica | hisD | 9 | 59 | <NA> | <NA> | 59 | 100.0 | <NA> | <NA> |
mlst:senterica:purE:5 | senterica | purE | 5 | 59 | <NA> | <NA> | 59 | 100.0 | <NA> | <NA> |
mlst:senterica:sucA:9 | senterica | sucA | 9 | 59 | <NA> | <NA> | 59 | 100.0 | <NA> | <NA> |
mlst:senterica:thrA:12 | senterica | thrA | 12 | 59 | <NA> | <NA> | 59 | 100.0 | <NA> | <NA> |
In this case, every genome has all the same classic 7-gene MLST alleles (all 100% for the Percent column).
db.mlst_summary(scheme_name='sistr_330').sort_values('Count', ascending=False)
Scheme | Locus | Allele | Count | Unknown Count | Present and Unknown Count | Total | Percent | Unknown Percent | Present and Unknown Percent | |
---|---|---|---|---|---|---|---|---|---|---|
MLST Feature | ||||||||||
mlst:sistr_330:NC_003198.1_3005:419666160 | sistr_330 | NC_003198.1_3005 | 419666160 | 59 | NaN | NaN | 59 | 100.000000 | NaN | NaN |
mlst:sistr_330:NZ_AOXE01000068.1_67:3713645811 | sistr_330 | NZ_AOXE01000068.1_67 | 3713645811 | 59 | NaN | NaN | 59 | 100.000000 | NaN | NaN |
mlst:sistr_330:NZ_AOXE01000068.1_58:93691856 | sistr_330 | NZ_AOXE01000068.1_58 | 93691856 | 59 | NaN | NaN | 59 | 100.000000 | NaN | NaN |
mlst:sistr_330:NZ_AOXE01000068.1_52:1426386553 | sistr_330 | NZ_AOXE01000068.1_52 | 1426386553 | 59 | NaN | NaN | 59 | 100.000000 | NaN | NaN |
mlst:sistr_330:NZ_AOXE01000068.1_5:4267848688 | sistr_330 | NZ_AOXE01000068.1_5 | 4267848688 | 59 | NaN | NaN | 59 | 100.000000 | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
mlst:sistr_330:NZ_AOXE01000072.1_65:2777708285 | sistr_330 | NZ_AOXE01000072.1_65 | 2777708285 | 1 | NaN | NaN | 59 | 1.694915 | NaN | NaN |
mlst:sistr_330:NZ_AOXE01000059.1_430:1260977859 | sistr_330 | NZ_AOXE01000059.1_430 | 1260977859 | 1 | NaN | NaN | 59 | 1.694915 | NaN | NaN |
mlst:sistr_330:NZ_AOXE01000059.1_433:4100983823 | sistr_330 | NZ_AOXE01000059.1_433 | 4100983823 | 1 | NaN | NaN | 59 | 1.694915 | NaN | NaN |
mlst:sistr_330:NZ_AOYX01000092.1_135:466472938 | sistr_330 | NZ_AOYX01000092.1_135 | 466472938 | 1 | NaN | NaN | 59 | 1.694915 | NaN | NaN |
mlst:sistr_330:NZ_AOXE01000034.1_119:1444220767 | sistr_330 | NZ_AOXE01000034.1_119 | 1444220767 | 1 | NaN | NaN | 59 | 1.694915 | NaN | NaN |
347 rows × 10 columns
The SISTR MLST results consist of 330 loci in the scheme, and we can see that not all genomes have the same SISTR alleles (not all alleles have 100% for Percent).
hasa()
)¶We can use hasa()
to also query for genomic samples with a specific MLST feature. We can use the MLST Feature listed in the above tables as the identifier.
q = db.samples_query().hasa('mlst:sistr_330:NZ_AOXE01000034.1_103:1944731850')
q
<SamplesQueryIndex[selected=19% (11/59) samples, unknown=0% (0/59) samples]>
The MLST feature identifier is specified in terms of mlst:[scheme]:[locus]:[allele]
(i.e., mlst:sistr_330:NZ_AOXE01000034.1_103:1944731850
).
Only 11 genomic samples have this particular MLST feature. Similar to the operations on mutations, we can use q.toframe()
to list the specific genomic samples with this MLST feature.
q.toframe()
Query | Sample Name | Sample ID | Status | |
---|---|---|---|---|
0 | mlst:sistr_330:NZ_AOXE01000034.1_103:1944731850 | SH13-001 | 9 | Present |
1 | mlst:sistr_330:NZ_AOXE01000034.1_103:1944731850 | SH10-014 | 14 | Present |
2 | mlst:sistr_330:NZ_AOXE01000034.1_103:1944731850 | SH11-002 | 23 | Present |
3 | mlst:sistr_330:NZ_AOXE01000034.1_103:1944731850 | SH10-015 | 34 | Present |
4 | mlst:sistr_330:NZ_AOXE01000034.1_103:1944731850 | SH13-004 | 37 | Present |
5 | mlst:sistr_330:NZ_AOXE01000034.1_103:1944731850 | SH13-006 | 38 | Present |
6 | mlst:sistr_330:NZ_AOXE01000034.1_103:1944731850 | SH13-007 | 41 | Present |
7 | mlst:sistr_330:NZ_AOXE01000034.1_103:1944731850 | SH13-008 | 42 | Present |
8 | mlst:sistr_330:NZ_AOXE01000034.1_103:1944731850 | SH13-003 | 45 | Present |
9 | mlst:sistr_330:NZ_AOXE01000034.1_103:1944731850 | SH13-002 | 49 | Present |
10 | mlst:sistr_330:NZ_AOXE01000034.1_103:1944731850 | SH13-005 | 59 | Present |
We can even combine this together with the previous phylogenetic tree visualization to see how this MLST feature clusters with respect to a tree constructed from mutations.
t = db.samples_query(universe='mutations', reference_name='NC_011083')
t.tree_styler(show_legend_type_labels=False, include_unknown=False)\
.highlight(t.hasa('mlst:sistr_330:NZ_AOXE01000034.1_103:1944731850'), legend_label='mlst:1944731850')\
.render()
We can use features_summary()
on the previous query to view a summary of other MLST features (possibly from other MLST schemes) in our query, or even a summary of mutations.
q = db.samples_query().hasa('mlst:sistr_330:NZ_AOXE01000034.1_103:1944731850')
q
<SamplesQueryIndex[selected=19% (11/59) samples, unknown=0% (0/59) samples]>
q.features_summary(kind='mlst', scheme='sistr_330').sort_values('Count', ascending=False)
Scheme | Locus | Allele | Count | Unknown Count | Present and Unknown Count | Total | Percent | Unknown Percent | Present and Unknown Percent | |
---|---|---|---|---|---|---|---|---|---|---|
MLST Feature | ||||||||||
mlst:sistr_330:NC_003198.1_3005:419666160 | sistr_330 | NC_003198.1_3005 | 419666160 | 11 | NaN | NaN | 11 | 100.0 | NaN | NaN |
mlst:sistr_330:NZ_AOXE01000072.1_12:2530339645 | sistr_330 | NZ_AOXE01000072.1_12 | 2530339645 | 11 | NaN | NaN | 11 | 100.0 | NaN | NaN |
mlst:sistr_330:NZ_AOXE01000072.1_100:2740047193 | sistr_330 | NZ_AOXE01000072.1_100 | 2740047193 | 11 | NaN | NaN | 11 | 100.0 | NaN | NaN |
mlst:sistr_330:NZ_AOXE01000068.1_76:2883770685 | sistr_330 | NZ_AOXE01000068.1_76 | 2883770685 | 11 | NaN | NaN | 11 | 100.0 | NaN | NaN |
mlst:sistr_330:NZ_AOXE01000068.1_72:739645922 | sistr_330 | NZ_AOXE01000068.1_72 | 739645922 | 11 | NaN | NaN | 11 | 100.0 | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
mlst:sistr_330:NZ_AOXE01000036.1_16:3783573123 | sistr_330 | NZ_AOXE01000036.1_16 | 3783573123 | 11 | NaN | NaN | 11 | 100.0 | NaN | NaN |
mlst:sistr_330:NZ_AOXE01000036.1_157:1706422672 | sistr_330 | NZ_AOXE01000036.1_157 | 1706422672 | 11 | NaN | NaN | 11 | 100.0 | NaN | NaN |
mlst:sistr_330:NZ_AOXE01000036.1_15:519139030 | sistr_330 | NZ_AOXE01000036.1_15 | 519139030 | 11 | NaN | NaN | 11 | 100.0 | NaN | NaN |
mlst:sistr_330:NZ_AOXE01000036.1_116:2832724106 | sistr_330 | NZ_AOXE01000036.1_116 | 2832724106 | 11 | NaN | NaN | 11 | 100.0 | NaN | NaN |
mlst:sistr_330:NZ_CM001471.1_3941:799094037 | sistr_330 | NZ_CM001471.1_3941 | 799094037 | 11 | NaN | NaN | 11 | 100.0 | NaN | NaN |
330 rows × 10 columns
q.features_summary(kind='mutations').sort_values('Count', ascending=False)
Sequence | Position | Deletion | Insertion | Type | Count | Unknown Count | Present and Unknown Count | Total | Percent | ... | Gene_Name | Gene_ID | Feature_Type | Transcript_BioType | HGVS.c | HGVS.p | ID_HGVS.c | ID_HGVS.p | ID_HGVS_GN.c | ID_HGVS_GN.p | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Mutation | |||||||||||||||||||||
NC_011083.1:1947794:A:G | NC_011083.1 | 1947794 | A | G | SNP | 11 | NaN | NaN | 11 | 100.000000 | ... | SEHA_RS10125 | SEHA_RS10125 | transcript | protein_coding | c.181A>G | p.T61A | hgvs:NC_011083.1:SEHA_RS10125:c.181A>G | hgvs:NC_011083.1:SEHA_RS10125:p.T61A | hgvs_gn:NC_011083.1:SEHA_RS10125:c.181A>G | hgvs_gn:NC_011083.1:SEHA_RS10125:p.T61A |
NC_011083.1:379589:C:T | NC_011083.1 | 379589 | C | T | SNP | 11 | NaN | NaN | 11 | 100.000000 | ... | SEHA_RS02285 | SEHA_RS02285 | transcript | protein_coding | c.353G>A | p.R118H | hgvs:NC_011083.1:SEHA_RS02285:c.353G>A | hgvs:NC_011083.1:SEHA_RS02285:p.R118H | hgvs_gn:NC_011083.1:SEHA_RS02285:c.353G>A | hgvs_gn:NC_011083.1:SEHA_RS02285:p.R118H |
NC_011083.1:3533472:C:T | NC_011083.1 | 3533472 | C | T | SNP | 11 | NaN | NaN | 11 | 100.000000 | ... | SEHA_RS17775 | SEHA_RS17775 | transcript | protein_coding | c.975G>A | p.V325V | hgvs:NC_011083.1:SEHA_RS17775:c.975G>A | hgvs:NC_011083.1:SEHA_RS17775:p.V325V | hgvs_gn:NC_011083.1:SEHA_RS17775:c.975G>A | hgvs_gn:NC_011083.1:SEHA_RS17775:p.V325V |
NC_011083.1:3691534:C:A | NC_011083.1 | 3691534 | C | A | SNP | 11 | NaN | NaN | 11 | 100.000000 | ... | malT | SEHA_RS18595 | transcript | protein_coding | c.1074C>A | p.S358R | hgvs:NC_011083.1:SEHA_RS18595:c.1074C>A | hgvs:NC_011083.1:SEHA_RS18595:p.S358R | hgvs_gn:NC_011083.1:malT:c.1074C>A | hgvs_gn:NC_011083.1:malT:p.S358R |
NC_011083.1:602044:T:G | NC_011083.1 | 602044 | T | G | SNP | 11 | NaN | NaN | 11 | 100.000000 | ... | SEHA_RS03410 | SEHA_RS03410 | transcript | protein_coding | c.1163T>G | p.V388G | hgvs:NC_011083.1:SEHA_RS03410:c.1163T>G | hgvs:NC_011083.1:SEHA_RS03410:p.V388G | hgvs_gn:NC_011083.1:SEHA_RS03410:c.1163T>G | hgvs_gn:NC_011083.1:SEHA_RS03410:p.V388G |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
NC_011083.1:2010696:G:T | NC_011083.1 | 2010696 | G | T | SNP | 1 | NaN | NaN | 11 | 9.090909 | ... | SEHA_RS25120-SEHA_RS10450 | SEHA_RS25120-SEHA_RS10450 | intergenic_region | <NA> | n.2010696G>T | <NA> | hgvs:NC_011083.1:n.2010696G>T | <NA> | hgvs_gn:NC_011083.1:n.2010696G>T | <NA> |
NC_011083.1:495275:G:A | NC_011083.1 | 495275 | G | A | SNP | 1 | NaN | NaN | 11 | 9.090909 | ... | sbcC | SEHA_RS02910 | transcript | protein_coding | c.1242C>T | p.A414A | hgvs:NC_011083.1:SEHA_RS02910:c.1242C>T | hgvs:NC_011083.1:SEHA_RS02910:p.A414A | hgvs_gn:NC_011083.1:sbcC:c.1242C>T | hgvs_gn:NC_011083.1:sbcC:p.A414A |
NC_011083.1:1514675:G:A | NC_011083.1 | 1514675 | G | A | SNP | 1 | NaN | NaN | 11 | 9.090909 | ... | purR-SEHA_RS08035 | SEHA_RS08030-SEHA_RS08035 | intergenic_region | <NA> | n.1514675G>A | <NA> | hgvs:NC_011083.1:n.1514675G>A | <NA> | hgvs_gn:NC_011083.1:n.1514675G>A | <NA> |
NC_011083.1:3401393:T:A | NC_011083.1 | 3401393 | T | A | SNP | 1 | NaN | NaN | 11 | 9.090909 | ... | SEHA_RS17120-SEHA_RS17125 | SEHA_RS17120-SEHA_RS17125 | intergenic_region | <NA> | n.3401393T>A | <NA> | hgvs:NC_011083.1:n.3401393T>A | <NA> | hgvs_gn:NC_011083.1:n.3401393T>A | <NA> |
NC_011083.1:1310359:T:G | NC_011083.1 | 1310359 | T | G | SNP | 1 | NaN | NaN | 11 | 9.090909 | ... | nagK | SEHA_RS06940 | transcript | protein_coding | c.175T>G | p.S59A | hgvs:NC_011083.1:SEHA_RS06940:c.175T>G | hgvs:NC_011083.1:SEHA_RS06940:p.S59A | hgvs_gn:NC_011083.1:nagK:c.175T>G | hgvs_gn:NC_011083.1:nagK:p.S59A |
158 rows × 24 columns
This lets us summarize what other genomic features (MLST or mutations) are found in the set of samples that have a mlst:sistr_330:NZ_AOXE01000034.1_103:1944731850
MLST feature.
Setting q.features_summary(..., selection='unique')
lets us narrow-down the scope of our summary to only features that are uniquely found in the set of genomic samples that have a mlst:sistr_330:NZ_AOXE01000034.1_103:1944731850
MLST feature.
q.features_summary(kind='mutations', selection='unique').sort_values('Count', ascending=False)
Sequence | Position | Deletion | Insertion | Type | Count | Unknown Count | Present and Unknown Count | Total | Percent | ... | Gene_Name | Gene_ID | Feature_Type | Transcript_BioType | HGVS.c | HGVS.p | ID_HGVS.c | ID_HGVS.p | ID_HGVS_GN.c | ID_HGVS_GN.p | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Mutation | |||||||||||||||||||||
NC_011083.1:3691534:C:A | NC_011083.1 | 3691534 | C | A | SNP | 11 | NaN | NaN | 11 | 100.000000 | ... | malT | SEHA_RS18595 | transcript | protein_coding | c.1074C>A | p.S358R | hgvs:NC_011083.1:SEHA_RS18595:c.1074C>A | hgvs:NC_011083.1:SEHA_RS18595:p.S358R | hgvs_gn:NC_011083.1:malT:c.1074C>A | hgvs_gn:NC_011083.1:malT:p.S358R |
NC_011083.1:2428696:AT:A | NC_011083.1 | 2428696 | AT | A | INDEL | 11 | NaN | NaN | 11 | 100.000000 | ... | SEHA_RS12535-nrdA | SEHA_RS12535-SEHA_RS12540 | intergenic_region | <NA> | n.2428697delT | <NA> | hgvs:NC_011083.1:n.2428697delT | <NA> | hgvs_gn:NC_011083.1:n.2428697delT | <NA> |
NC_011083.1:3796657:C:A | NC_011083.1 | 3796657 | C | A | SNP | 11 | NaN | NaN | 11 | 100.000000 | ... | SEHA_RS19050 | SEHA_RS19050 | transcript | protein_coding | c.489C>A | p.G163G | hgvs:NC_011083.1:SEHA_RS19050:c.489C>A | hgvs:NC_011083.1:SEHA_RS19050:p.G163G | hgvs_gn:NC_011083.1:SEHA_RS19050:c.489C>A | hgvs_gn:NC_011083.1:SEHA_RS19050:p.G163G |
NC_011083.1:2841095:G:A | NC_011083.1 | 2841095 | G | A | SNP | 11 | NaN | NaN | 11 | 100.000000 | ... | bapA | SEHA_RS14380 | transcript | protein_coding | c.5317G>A | p.D1773N | hgvs:NC_011083.1:SEHA_RS14380:c.5317G>A | hgvs:NC_011083.1:SEHA_RS14380:p.D1773N | hgvs_gn:NC_011083.1:bapA:c.5317G>A | hgvs_gn:NC_011083.1:bapA:p.D1773N |
NC_011083.1:2209731:A:C | NC_011083.1 | 2209731 | A | C | SNP | 11 | NaN | NaN | 11 | 100.000000 | ... | SEHA_RS11560 | SEHA_RS11560 | transcript | protein_coding | c.250T>G | p.W84G | hgvs:NC_011083.1:SEHA_RS11560:c.250T>G | hgvs:NC_011083.1:SEHA_RS11560:p.W84G | hgvs_gn:NC_011083.1:SEHA_RS11560:c.250T>G | hgvs_gn:NC_011083.1:SEHA_RS11560:p.W84G |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
NC_011083.1:3535143:A:T | NC_011083.1 | 3535143 | A | T | SNP | 1 | NaN | NaN | 11 | 9.090909 | ... | oadA | SEHA_RS17780 | transcript | protein_coding | c.1101T>A | p.I367I | hgvs:NC_011083.1:SEHA_RS17780:c.1101T>A | hgvs:NC_011083.1:SEHA_RS17780:p.I367I | hgvs_gn:NC_011083.1:oadA:c.1101T>A | hgvs_gn:NC_011083.1:oadA:p.I367I |
NC_011083.1:1310359:T:G | NC_011083.1 | 1310359 | T | G | SNP | 1 | NaN | NaN | 11 | 9.090909 | ... | nagK | SEHA_RS06940 | transcript | protein_coding | c.175T>G | p.S59A | hgvs:NC_011083.1:SEHA_RS06940:c.175T>G | hgvs:NC_011083.1:SEHA_RS06940:p.S59A | hgvs_gn:NC_011083.1:nagK:c.175T>G | hgvs_gn:NC_011083.1:nagK:p.S59A |
NC_011083.1:936248:G:A | NC_011083.1 | 936248 | G | A | SNP | 1 | NaN | NaN | 11 | 9.090909 | ... | rhlE | SEHA_RS05055 | transcript | protein_coding | c.1225G>A | p.G409R | hgvs:NC_011083.1:SEHA_RS05055:c.1225G>A | hgvs:NC_011083.1:SEHA_RS05055:p.G409R | hgvs_gn:NC_011083.1:rhlE:c.1225G>A | hgvs_gn:NC_011083.1:rhlE:p.G409R |
NC_011083.1:4536022:C:T | NC_011083.1 | 4536022 | C | T | SNP | 1 | NaN | NaN | 11 | 9.090909 | ... | adiA | SEHA_RS22440 | transcript | protein_coding | c.909G>A | p.P303P | hgvs:NC_011083.1:SEHA_RS22440:c.909G>A | hgvs:NC_011083.1:SEHA_RS22440:p.P303P | hgvs_gn:NC_011083.1:adiA:c.909G>A | hgvs_gn:NC_011083.1:adiA:p.P303P |
NC_011083.1:4800575:C:T | NC_011083.1 | 4800575 | C | T | SNP | 1 | NaN | NaN | 11 | 9.090909 | ... | SEHA_RS23755-SEHA_RS23765 | SEHA_RS23755-SEHA_RS23765 | intergenic_region | <NA> | n.4800575C>T | <NA> | hgvs:NC_011083.1:n.4800575C>T | <NA> | hgvs_gn:NC_011083.1:n.4800575C>T | <NA> |
72 rows × 24 columns
Similar to mutations_summary()
to calculate values for the Unknown Count and Present and Unknown Count columns you can pass include_unknown_sampes=True
. This might slow down the table generation for larger datasets.
q.features_summary(kind='mutations', selection='unique', include_unknown_samples=True).sort_values('Count', ascending=False)
Sequence | Position | Deletion | Insertion | Type | Count | Unknown Count | Present and Unknown Count | Total | Percent | ... | Gene_Name | Gene_ID | Feature_Type | Transcript_BioType | HGVS.c | HGVS.p | ID_HGVS.c | ID_HGVS.p | ID_HGVS_GN.c | ID_HGVS_GN.p | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Mutation | |||||||||||||||||||||
NC_011083.1:3691534:C:A | NC_011083.1 | 3691534 | C | A | SNP | 11 | 0 | 11 | 11 | 100.000000 | ... | malT | SEHA_RS18595 | transcript | protein_coding | c.1074C>A | p.S358R | hgvs:NC_011083.1:SEHA_RS18595:c.1074C>A | hgvs:NC_011083.1:SEHA_RS18595:p.S358R | hgvs_gn:NC_011083.1:malT:c.1074C>A | hgvs_gn:NC_011083.1:malT:p.S358R |
NC_011083.1:2428696:AT:A | NC_011083.1 | 2428696 | AT | A | INDEL | 11 | 0 | 11 | 11 | 100.000000 | ... | SEHA_RS12535-nrdA | SEHA_RS12535-SEHA_RS12540 | intergenic_region | <NA> | n.2428697delT | <NA> | hgvs:NC_011083.1:n.2428697delT | <NA> | hgvs_gn:NC_011083.1:n.2428697delT | <NA> |
NC_011083.1:3796657:C:A | NC_011083.1 | 3796657 | C | A | SNP | 11 | 0 | 11 | 11 | 100.000000 | ... | SEHA_RS19050 | SEHA_RS19050 | transcript | protein_coding | c.489C>A | p.G163G | hgvs:NC_011083.1:SEHA_RS19050:c.489C>A | hgvs:NC_011083.1:SEHA_RS19050:p.G163G | hgvs_gn:NC_011083.1:SEHA_RS19050:c.489C>A | hgvs_gn:NC_011083.1:SEHA_RS19050:p.G163G |
NC_011083.1:2841095:G:A | NC_011083.1 | 2841095 | G | A | SNP | 11 | 0 | 11 | 11 | 100.000000 | ... | bapA | SEHA_RS14380 | transcript | protein_coding | c.5317G>A | p.D1773N | hgvs:NC_011083.1:SEHA_RS14380:c.5317G>A | hgvs:NC_011083.1:SEHA_RS14380:p.D1773N | hgvs_gn:NC_011083.1:bapA:c.5317G>A | hgvs_gn:NC_011083.1:bapA:p.D1773N |
NC_011083.1:2209731:A:C | NC_011083.1 | 2209731 | A | C | SNP | 11 | 0 | 11 | 11 | 100.000000 | ... | SEHA_RS11560 | SEHA_RS11560 | transcript | protein_coding | c.250T>G | p.W84G | hgvs:NC_011083.1:SEHA_RS11560:c.250T>G | hgvs:NC_011083.1:SEHA_RS11560:p.W84G | hgvs_gn:NC_011083.1:SEHA_RS11560:c.250T>G | hgvs_gn:NC_011083.1:SEHA_RS11560:p.W84G |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
NC_011083.1:3535143:A:T | NC_011083.1 | 3535143 | A | T | SNP | 1 | 2 | 3 | 11 | 9.090909 | ... | oadA | SEHA_RS17780 | transcript | protein_coding | c.1101T>A | p.I367I | hgvs:NC_011083.1:SEHA_RS17780:c.1101T>A | hgvs:NC_011083.1:SEHA_RS17780:p.I367I | hgvs_gn:NC_011083.1:oadA:c.1101T>A | hgvs_gn:NC_011083.1:oadA:p.I367I |
NC_011083.1:1310359:T:G | NC_011083.1 | 1310359 | T | G | SNP | 1 | 0 | 1 | 11 | 9.090909 | ... | nagK | SEHA_RS06940 | transcript | protein_coding | c.175T>G | p.S59A | hgvs:NC_011083.1:SEHA_RS06940:c.175T>G | hgvs:NC_011083.1:SEHA_RS06940:p.S59A | hgvs_gn:NC_011083.1:nagK:c.175T>G | hgvs_gn:NC_011083.1:nagK:p.S59A |
NC_011083.1:936248:G:A | NC_011083.1 | 936248 | G | A | SNP | 1 | 0 | 1 | 11 | 9.090909 | ... | rhlE | SEHA_RS05055 | transcript | protein_coding | c.1225G>A | p.G409R | hgvs:NC_011083.1:SEHA_RS05055:c.1225G>A | hgvs:NC_011083.1:SEHA_RS05055:p.G409R | hgvs_gn:NC_011083.1:rhlE:c.1225G>A | hgvs_gn:NC_011083.1:rhlE:p.G409R |
NC_011083.1:4536022:C:T | NC_011083.1 | 4536022 | C | T | SNP | 1 | 0 | 1 | 11 | 9.090909 | ... | adiA | SEHA_RS22440 | transcript | protein_coding | c.909G>A | p.P303P | hgvs:NC_011083.1:SEHA_RS22440:c.909G>A | hgvs:NC_011083.1:SEHA_RS22440:p.P303P | hgvs_gn:NC_011083.1:adiA:c.909G>A | hgvs_gn:NC_011083.1:adiA:p.P303P |
NC_011083.1:4800575:C:T | NC_011083.1 | 4800575 | C | T | SNP | 1 | 0 | 1 | 11 | 9.090909 | ... | SEHA_RS23755-SEHA_RS23765 | SEHA_RS23755-SEHA_RS23765 | intergenic_region | <NA> | n.4800575C>T | <NA> | hgvs:NC_011083.1:n.4800575C>T | <NA> | hgvs_gn:NC_011083.1:n.4800575C>T | <NA> |
72 rows × 24 columns
So far we've seen connecting to an index (project), querying by mutations and by relationships in a tree, attaching external metadata, and working with MLST features. We can put this all together and do some basic visualiations of our selections on the tree (using the ete3
toolkit).
q = db.samples_query(universe='mutations', reference_name='NC_011083')\
.join(metadata_df, sample_names_column='Sample Name Orig')
q
<MutationTreeSamplesQuery[selected=100% (59/59) samples, unknown=0% (0/59) samples]>
q_o1 = q.isa('1', isa_column='outbreak_number', kind='dataframe')
q_o1
<MutationTreeSamplesQuery[selected=17% (10/59) samples, unknown=0% (0/59) samples]>
q_o1.features_summary(selection='unique').sort_values('Count', ascending=False).set_index('ID_HGVS_GN.p').head(8)
Sequence | Position | Deletion | Insertion | Type | Count | Unknown Count | Present and Unknown Count | Total | Percent | ... | Annotation_Impact | Gene_Name | Gene_ID | Feature_Type | Transcript_BioType | HGVS.c | HGVS.p | ID_HGVS.c | ID_HGVS.p | ID_HGVS_GN.c | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ID_HGVS_GN.p | |||||||||||||||||||||
hgvs_gn:NC_011083.1:pqiA:p.Q219Q | NC_011083.1 | 1159308 | G | A | SNP | 10 | <NA> | <NA> | 10 | 100.0 | ... | LOW | pqiA | SEHA_RS06145 | transcript | protein_coding | c.657G>A | p.Q219Q | hgvs:NC_011083.1:SEHA_RS06145:c.657G>A | hgvs:NC_011083.1:SEHA_RS06145:p.Q219Q | hgvs_gn:NC_011083.1:pqiA:c.657G>A |
hgvs_gn:NC_011083.1:ychF:p.I128V | NC_011083.1 | 1930007 | A | G | SNP | 10 | <NA> | <NA> | 10 | 100.0 | ... | MODERATE | ychF | SEHA_RS10040 | transcript | protein_coding | c.382A>G | p.I128V | hgvs:NC_011083.1:SEHA_RS10040:c.382A>G | hgvs:NC_011083.1:SEHA_RS10040:p.I128V | hgvs_gn:NC_011083.1:ychF:c.382A>G |
<NA> | NC_011083.1 | 2732743 | G | A | SNP | 10 | <NA> | <NA> | 10 | 100.0 | ... | MODIFIER | sseB-pepB | SEHA_RS13890-SEHA_RS13895 | intergenic_region | <NA> | n.2732743G>A | <NA> | hgvs:NC_011083.1:n.2732743G>A | <NA> | hgvs_gn:NC_011083.1:n.2732743G>A |
hgvs_gn:NC_011083.1:SEHA_RS00825:p.I23F | NC_011083.1 | 58804 | T | A | SNP | 10 | <NA> | <NA> | 10 | 100.0 | ... | MODERATE | SEHA_RS00825 | SEHA_RS00825 | transcript | protein_coding | c.67A>T | p.I23F | hgvs:NC_011083.1:SEHA_RS00825:c.67A>T | hgvs:NC_011083.1:SEHA_RS00825:p.I23F | hgvs_gn:NC_011083.1:SEHA_RS00825:c.67A>T |
<NA> | NC_011083.1 | 349640 | A | G | SNP | 10 | <NA> | <NA> | 10 | 100.0 | ... | MODIFIER | istB-tcfA | SEHA_RS26020-SEHA_RS02145 | intergenic_region | <NA> | n.349640A>G | <NA> | hgvs:NC_011083.1:n.349640A>G | <NA> | hgvs_gn:NC_011083.1:n.349640A>G |
hgvs_gn:NC_011083.1:SEHA_RS22485:p.S224S | NC_011083.1 | 4548496 | C | T | SNP | 10 | <NA> | <NA> | 10 | 100.0 | ... | LOW | SEHA_RS22485 | SEHA_RS22485 | transcript | protein_coding | c.672C>T | p.S224S | hgvs:NC_011083.1:SEHA_RS22485:c.672C>T | hgvs:NC_011083.1:SEHA_RS22485:p.S224S | hgvs_gn:NC_011083.1:SEHA_RS22485:c.672C>T |
hgvs_gn:NC_011083.1:SEHA_RS08400:p.P5P | NC_011083.1 | 1582551 | C | A | SNP | 10 | <NA> | <NA> | 10 | 100.0 | ... | LOW | SEHA_RS08400 | SEHA_RS08400 | transcript | protein_coding | c.15G>T | p.P5P | hgvs:NC_011083.1:SEHA_RS08400:c.15G>T | hgvs:NC_011083.1:SEHA_RS08400:p.P5P | hgvs_gn:NC_011083.1:SEHA_RS08400:c.15G>T |
<NA> | NC_011083.1 | 1572779 | CTTCAGCGCAACGCGGCTGTAAGCATTGCTGCTAAAAAGAGGCCGG... | C | INDEL | 4 | <NA> | <NA> | 10 | 40.0 | ... | HIGH | gstA&pdxY | SEHA_RS08355&SEHA_RS08350 | gene_variant | <NA> | n.1572780_1572862delTTCAGCGCAACGCGGCTGTAAGCATT... | <NA> | hgvs:NC_011083.1:n.1572780_1572862delTTCAGCGCA... | <NA> | hgvs_gn:NC_011083.1:n.1572780_1572862delTTCAGC... |
8 rows × 23 columns
Passing selection='unique'
will select only those mutations that are unique to the selected set (useful for searching for lineage-defining mutations or mutations in subsets of the selected samples).
Let's pick one of the lineage-defining mutations (i.e., a mutation uniquely found in only Outbreak 1). We can detect these by using selection='unique'
and filtering to mutations where Count
is equal to Total
(count of samples with mutation equals total samples in selection).
o1_df = q_o1.features_summary(selection='unique').sort_values(['Count', 'Position'])
o1_df[o1_df['Count'] == o1_df['Total']].set_index('ID_HGVS.p').head(3)
Sequence | Position | Deletion | Insertion | Type | Count | Unknown Count | Present and Unknown Count | Total | Percent | ... | Annotation_Impact | Gene_Name | Gene_ID | Feature_Type | Transcript_BioType | HGVS.c | HGVS.p | ID_HGVS.c | ID_HGVS_GN.c | ID_HGVS_GN.p | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ID_HGVS.p | |||||||||||||||||||||
hgvs:NC_011083.1:SEHA_RS00825:p.I23F | NC_011083.1 | 58804 | T | A | SNP | 10 | <NA> | <NA> | 10 | 100.0 | ... | MODERATE | SEHA_RS00825 | SEHA_RS00825 | transcript | protein_coding | c.67A>T | p.I23F | hgvs:NC_011083.1:SEHA_RS00825:c.67A>T | hgvs_gn:NC_011083.1:SEHA_RS00825:c.67A>T | hgvs_gn:NC_011083.1:SEHA_RS00825:p.I23F |
<NA> | NC_011083.1 | 349640 | A | G | SNP | 10 | <NA> | <NA> | 10 | 100.0 | ... | MODIFIER | istB-tcfA | SEHA_RS26020-SEHA_RS02145 | intergenic_region | <NA> | n.349640A>G | <NA> | hgvs:NC_011083.1:n.349640A>G | hgvs_gn:NC_011083.1:n.349640A>G | <NA> |
hgvs:NC_011083.1:SEHA_RS06145:p.Q219Q | NC_011083.1 | 1159308 | G | A | SNP | 10 | <NA> | <NA> | 10 | 100.0 | ... | LOW | pqiA | SEHA_RS06145 | transcript | protein_coding | c.657G>A | p.Q219Q | hgvs:NC_011083.1:SEHA_RS06145:c.657G>A | hgvs_gn:NC_011083.1:pqiA:c.657G>A | hgvs_gn:NC_011083.1:pqiA:p.Q219Q |
3 rows × 23 columns
We will highlight hgvs:NC_011083.1:SEHA_RS00825:p.I23F
(NC_011083.1:58804:T:A
) in the tree.
Let's also select a unique MLST feature to highlight.
q_o1.features_summary(kind='mlst', scheme='sistr_330',
selection='unique').sort_values('Count', ascending=False)
Scheme | Locus | Allele | Count | Unknown Count | Present and Unknown Count | Total | Percent | Unknown Percent | Present and Unknown Percent | |
---|---|---|---|---|---|---|---|---|---|---|
MLST Feature | ||||||||||
mlst:sistr_330:NZ_AOXE01000047.1_56:1179310307 | sistr_330 | NZ_AOXE01000047.1_56 | 1179310307 | 2 | <NA> | <NA> | 10 | 20.0 | <NA> | <NA> |
Here, we can see there's only one MLST feature to pick mlst:sistr_330:NZ_AOXE01000047.1_56:1179310307
that's unique to Outbreak 1 (but not unique to every sample in the outbreak). We will show this feature on the tree too.
# Use `highlight()` to highlight Outbreak 1
# Chain with other `highlight()` methods to highlight other outbreak samples
# Use `annotate()` to add a column indicating the presence/absence of '58804:T:A' mutation
# Use another `annotate()` to add a column indicating the presence/absence of the MLST feature
# Use `highlight_style` to change the color scheme of the highlights
ts = q.tree_styler(annotate_show_box_label=True, legend_nsize=30, legend_fsize=14,
show_legend_type_labels=False, legend_title='Legend')\
.highlight(q_o1, legend_label='Outbreak 1')\
.highlight(q.isa('2', isa_column='outbreak_number', kind='dataframe'), legend_label='Outbreak 2')\
.highlight(q.isa('3', isa_column='outbreak_number', kind='dataframe'), legend_label='Outbreak 3')\
.annotate(q.hasa('hgvs:NC_011083.1:SEHA_RS00825:p.I23F'), box_label='SEHA_RS00825:I23F')\
.annotate(q.hasa('mlst:sistr_330:NZ_AOXE01000047.1_56:1179310307'), box_label='mlst:1179310307')
ts.render(w=600)
This shows the
# Save this to a PDF file
file = 'output1.pdf'
x = ts.render(file)
print(f'Saved to {file}')
# (I assign to x to surpress printing text when saving in Jupyter)
Saved to output1.pdf
q = db.samples_query().join(metadata_df, sample_names_column='Sample Name Orig',
default_isa_column='outbreak_number', default_isa_kind='dataframe')
q.isa('2')
<DataFrameSamplesQuery[selected=14% (8/59) samples, unknown=0% (0/59) samples]>
q.isa('2').features_summary().sort_values('Count', ascending=False)
Sequence | Position | Deletion | Insertion | Type | Count | Unknown Count | Present and Unknown Count | Total | Percent | ... | Gene_Name | Gene_ID | Feature_Type | Transcript_BioType | HGVS.c | HGVS.p | ID_HGVS.c | ID_HGVS.p | ID_HGVS_GN.c | ID_HGVS_GN.p | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Mutation | |||||||||||||||||||||
NC_011083.1:1947794:A:G | NC_011083.1 | 1947794 | A | G | SNP | 8 | NaN | NaN | 8 | 100.0 | ... | SEHA_RS10125 | SEHA_RS10125 | transcript | protein_coding | c.181A>G | p.T61A | hgvs:NC_011083.1:SEHA_RS10125:c.181A>G | hgvs:NC_011083.1:SEHA_RS10125:p.T61A | hgvs_gn:NC_011083.1:SEHA_RS10125:c.181A>G | hgvs_gn:NC_011083.1:SEHA_RS10125:p.T61A |
NC_011083.1:3945452:G:C | NC_011083.1 | 3945452 | G | C | SNP | 8 | NaN | NaN | 8 | 100.0 | ... | ligB | SEHA_RS19715 | transcript | protein_coding | c.400C>G | p.L134V | hgvs:NC_011083.1:SEHA_RS19715:c.400C>G | hgvs:NC_011083.1:SEHA_RS19715:p.L134V | hgvs_gn:NC_011083.1:ligB:c.400C>G | hgvs_gn:NC_011083.1:ligB:p.L134V |
NC_011083.1:602044:T:G | NC_011083.1 | 602044 | T | G | SNP | 8 | NaN | NaN | 8 | 100.0 | ... | SEHA_RS03410 | SEHA_RS03410 | transcript | protein_coding | c.1163T>G | p.V388G | hgvs:NC_011083.1:SEHA_RS03410:c.1163T>G | hgvs:NC_011083.1:SEHA_RS03410:p.V388G | hgvs_gn:NC_011083.1:SEHA_RS03410:c.1163T>G | hgvs_gn:NC_011083.1:SEHA_RS03410:p.V388G |
NC_011083.1:2209731:A:C | NC_011083.1 | 2209731 | A | C | SNP | 8 | NaN | NaN | 8 | 100.0 | ... | SEHA_RS11560 | SEHA_RS11560 | transcript | protein_coding | c.250T>G | p.W84G | hgvs:NC_011083.1:SEHA_RS11560:c.250T>G | hgvs:NC_011083.1:SEHA_RS11560:p.W84G | hgvs_gn:NC_011083.1:SEHA_RS11560:c.250T>G | hgvs_gn:NC_011083.1:SEHA_RS11560:p.W84G |
NC_011083.1:2225479:G:T | NC_011083.1 | 2225479 | G | T | SNP | 8 | NaN | NaN | 8 | 100.0 | ... | rfbB | SEHA_RS11635 | transcript | protein_coding | c.721C>A | p.R241S | hgvs:NC_011083.1:SEHA_RS11635:c.721C>A | hgvs:NC_011083.1:SEHA_RS11635:p.R241S | hgvs_gn:NC_011083.1:rfbB:c.721C>A | hgvs_gn:NC_011083.1:rfbB:p.R241S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
NC_011083.1:1514675:G:A | NC_011083.1 | 1514675 | G | A | SNP | 1 | NaN | NaN | 8 | 12.5 | ... | purR-SEHA_RS08035 | SEHA_RS08030-SEHA_RS08035 | intergenic_region | <NA> | n.1514675G>A | <NA> | hgvs:NC_011083.1:n.1514675G>A | <NA> | hgvs_gn:NC_011083.1:n.1514675G>A | <NA> |
NC_011083.1:2010696:G:T | NC_011083.1 | 2010696 | G | T | SNP | 1 | NaN | NaN | 8 | 12.5 | ... | SEHA_RS25120-SEHA_RS10450 | SEHA_RS25120-SEHA_RS10450 | intergenic_region | <NA> | n.2010696G>T | <NA> | hgvs:NC_011083.1:n.2010696G>T | <NA> | hgvs_gn:NC_011083.1:n.2010696G>T | <NA> |
NC_011083.1:4800575:C:T | NC_011083.1 | 4800575 | C | T | SNP | 1 | NaN | NaN | 8 | 12.5 | ... | SEHA_RS23755-SEHA_RS23765 | SEHA_RS23755-SEHA_RS23765 | intergenic_region | <NA> | n.4800575C>T | <NA> | hgvs:NC_011083.1:n.4800575C>T | <NA> | hgvs_gn:NC_011083.1:n.4800575C>T | <NA> |
NC_011083.1:1310359:T:G | NC_011083.1 | 1310359 | T | G | SNP | 1 | NaN | NaN | 8 | 12.5 | ... | nagK | SEHA_RS06940 | transcript | protein_coding | c.175T>G | p.S59A | hgvs:NC_011083.1:SEHA_RS06940:c.175T>G | hgvs:NC_011083.1:SEHA_RS06940:p.S59A | hgvs_gn:NC_011083.1:nagK:c.175T>G | hgvs_gn:NC_011083.1:nagK:p.S59A |
NC_011083.1:1691486:G:C | NC_011083.1 | 1691486 | G | C | SNP | 1 | NaN | NaN | 8 | 12.5 | ... | SEHA_RS08920 | SEHA_RS08920 | transcript | protein_coding | c.781C>G | p.Q261E | hgvs:NC_011083.1:SEHA_RS08920:c.781C>G | hgvs:NC_011083.1:SEHA_RS08920:p.Q261E | hgvs_gn:NC_011083.1:SEHA_RS08920:c.781C>G | hgvs_gn:NC_011083.1:SEHA_RS08920:p.Q261E |
151 rows × 24 columns
2
¶q_o2_positions = q.isa('2').features_summary()\
.groupby('Position')\
.agg({'Position': 'first', 'Count': 'sum'})
q_o2_positions
Position | Count | |
---|---|---|
Position | ||
70519 | 70519 | 7 |
140658 | 140658 | 8 |
203200 | 203200 | 8 |
231665 | 231665 | 8 |
272673 | 272673 | 8 |
... | ... | ... |
4860641 | 4860641 | 8 |
4876176 | 4876176 | 8 |
4882099 | 4882099 | 8 |
4884909 | 4884909 | 8 |
4888140 | 4888140 | 8 |
150 rows × 2 columns
import matplotlib.pyplot as plt
reference_genome = db.reference_names()[0]
# I'm just showing a histogram of positions, I'm ignoring sample counts
q_o2_positions['Position'].hist(bins=100)
plt.title('Distribution of mutations for Outbreak 2', fontdict={'size': 16})
plt.xlabel(f'Position on {reference_genome} (bp)', fontdict={'size': 14})
plt.ylabel('Count of mutation positions', fontdict={'size': 14})
Text(0, 0.5, 'Count of mutation positions')
We'll compare to outbreak 1 and 3 using the same histogram.
q_o1_positions = q.isa('1').features_summary()\
.groupby('Position')\
.agg({'Position': 'first', 'Count': 'sum'})
q_o1_positions
Position | Count | |
---|---|---|
Position | ||
58804 | 58804 | 10 |
63393 | 63393 | 10 |
71448 | 71448 | 10 |
140658 | 140658 | 10 |
203200 | 203200 | 10 |
... | ... | ... |
4860641 | 4860641 | 10 |
4876176 | 4876176 | 10 |
4882099 | 4882099 | 10 |
4884909 | 4884909 | 10 |
4888140 | 4888140 | 10 |
137 rows × 2 columns
q_o3_positions = q.isa('3').features_summary()\
.groupby('Position')\
.agg({'Position': 'first', 'Count': 'sum'})
q_o3_positions
Position | Count | |
---|---|---|
Position | ||
140658 | 140658 | 28 |
203200 | 203200 | 28 |
231665 | 231665 | 28 |
298472 | 298472 | 28 |
298943 | 298943 | 26 |
... | ... | ... |
4860641 | 4860641 | 28 |
4876176 | 4876176 | 28 |
4882099 | 4882099 | 28 |
4884909 | 4884909 | 28 |
4888140 | 4888140 | 28 |
149 rows × 2 columns
o1_positions_list = q_o1_positions['Position'].tolist()
o2_positions_list = q_o2_positions['Position'].tolist()
o3_positions_list = q_o3_positions['Position'].tolist()
data = [o1_positions_list, o2_positions_list, o3_positions_list]
labels = ['Outbreak 1', 'Outbreak 2', 'Outbreak 3']
colors = ['#a6cee3', '#1f78b4', '#b2df8a']
# Create histogram
plt.figure(figsize=(15,6))
plt.hist(data,
label=labels, color=colors, edgecolor='black',
bins=25)
plt.legend(prop={'size': 14})
plt.title('Distribution of mutations for Outbreaks 1,2, and 3', fontdict={'size': 16})
plt.xlabel(f'Position on {reference_genome} (bp)', fontdict={'size': 14})
plt.ylabel('Count of mutation positions', fontdict={'size': 14})
Text(0, 0.5, 'Count of mutation positions')
Try replacing features_summary()
with features_summary(selection='unique')
. This will change the plot from a distribution of all mutations in each outbreak to only a distribution of the mutations uniquely found in each outbreak.
You've made it to the end. You are amazing 😀🥳. Way to go. I hope you enjoyed the tutorial.