In [1]:
import sys
sys.version_info
Out[1]:
sys.version_info(major=3, minor=4, micro=2, releaselevel='final', serial=0)

petl Case Study 1 - Comparing Tables

This case study illustrates the use of the petl package for doing some simple profiling and comparison of data from two tables.

Introduction

The files used in this case study can be downloaded from the following link:

Download and unzip the files:

In [2]:
!wget http://aliman.s3.amazonaws.com/petl/petl-case-study-1-files.zip
!unzip -o petl-case-study-1-files.zip
--2015-01-19 17:37:39--  http://aliman.s3.amazonaws.com/petl/petl-case-study-1-files.zip
Resolving aliman.s3.amazonaws.com (aliman.s3.amazonaws.com)... 54.231.9.241
Connecting to aliman.s3.amazonaws.com (aliman.s3.amazonaws.com)|54.231.9.241|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3076773 (2.9M) [application/zip]
Saving to: ‘petl-case-study-1-files.zip’

100%[======================================>] 3,076,773   2.34MB/s   in 1.3s   

2015-01-19 17:37:41 (2.34 MB/s) - ‘petl-case-study-1-files.zip’ saved [3076773/3076773]

Archive:  petl-case-study-1-files.zip
  inflating: popdata.csv             
  inflating: snpdata.csv             

The first file (snpdata.csv) contains a list of locations in the genome of the malaria parasite P. falciparum, along with some basic data about genetic variations found at those locations.

The second file (popdata.csv) is supposed to contain the same list of genome locations, along with some additional data such as allele frequencies in different populations.

The main point for this case study is that the first file (snpdata.csv) contains the canonical list of genome locations, and the second file (popdata.csv) contains some additional data about the same genome locations and therefore should be consistent with the first file. We want to check whether this second file is in fact consistent with the first file.

Preparing the data

Start by importing the petl package:

In [3]:
import petl as etl
etl.__version__
Out[3]:
'1.0.0'

To save some typing, let a be the table of data extracted from the first file (snpdata.csv), and let b be the table of data extracted from the second file (popdata.csv), using the fromcsv() function:

In [4]:
a = etl.fromtsv('snpdata.csv')
b = etl.fromtsv('popdata.csv')

Examine the header from each file:

In [5]:
a.header()
Out[5]:
('Chr',
 'Pos',
 'Ref',
 'Nref',
 'Der',
 'Mut',
 'isTypable',
 'GeneId',
 'GeneAlias',
 'GeneDescr')
In [6]:
b.header()
Out[6]:
('Chromosome',
 'Coordinates',
 'Ref. Allele',
 'Non-Ref. Allele',
 'Outgroup Allele',
 'Ancestral Allele',
 'Derived Allele',
 'Ref. Aminoacid',
 'Non-Ref. Aminoacid',
 'Private Allele',
 'Private population',
 'maf AFR',
 'maf PNG',
 'maf SEA',
 'daf AFR',
 'daf PNG',
 'daf SEA',
 'nraf AFR',
 'nraf PNG',
 'nraf SEA',
 'Mutation type',
 'Gene',
 'Gene Aliases',
 'Gene Description',
 'Gene Information')

There is a common set of 9 fields that is present in both tables, and we would like focus on comparing these common fields, however different field names have been used in the two files. To simplify comparison, use rename() to rename some fields in the second file:

In [7]:
b_renamed = b.rename({'Chromosome': 'Chr', 
                      'Coordinates': 'Pos', 
                      'Ref. Allele': 'Ref', 
                      'Non-Ref. Allele': 'Nref', 
                      'Derived Allele': 'Der', 
                      'Mutation type': 'Mut', 
                      'Gene': 'GeneId', 
                      'Gene Aliases': 'GeneAlias', 
                      'Gene Description': 'GeneDescr'})
b_renamed.header()
Out[7]:
('Chr',
 'Pos',
 'Ref',
 'Nref',
 'Outgroup Allele',
 'Ancestral Allele',
 'Der',
 'Ref. Aminoacid',
 'Non-Ref. Aminoacid',
 'Private Allele',
 'Private population',
 'maf AFR',
 'maf PNG',
 'maf SEA',
 'daf AFR',
 'daf PNG',
 'daf SEA',
 'nraf AFR',
 'nraf PNG',
 'nraf SEA',
 'Mut',
 'GeneId',
 'GeneAlias',
 'GeneDescr',
 'Gene Information')

Use cut() to extract only the fields we're interested in from both tables:

In [8]:
common_fields = ['Chr', 'Pos', 'Ref', 'Nref', 'Der', 'Mut', 'GeneId', 'GeneAlias', 'GeneDescr']
a_common = a.cut(common_fields)
b_common = b_renamed.cut(common_fields)

Inspect the data:

In [9]:
a_common
Out[9]:
Chr Pos Ref Nref Der Mut GeneId GeneAlias GeneDescr
MAL1 91099 G A - S PFA0095c MAL1P1.10 rifin
MAL1 91104 A T - N PFA0095c MAL1P1.10 rifin
MAL1 93363 T A - N PFA0100c MAL1P1.11 hypothetical protein, conserved in P. falciparum
MAL1 93382 T G - N PFA0100c MAL1P1.11 hypothetical protein, conserved in P. falciparum
MAL1 93384 G A - N PFA0100c MAL1P1.11 hypothetical protein, conserved in P. falciparum

...

In [10]:
b_common
Out[10]:
Chr Pos Ref Nref Der Mut GeneId GeneAlias GeneDescr
MAL1 91099 G A - SYN PFA0095c MAL1P1.10,RIF rifin
MAL1 91104 A T - NON PFA0095c MAL1P1.10,RIF rifin
MAL1 93363 T A - NON PFA0100c MAL1P1.11 Plasmodium exported protein (PHISTa), unknown function
MAL1 93382 T G - NON PFA0100c MAL1P1.11 Plasmodium exported protein (PHISTa), unknown function
MAL1 93384 G A - NON PFA0100c MAL1P1.11 Plasmodium exported protein (PHISTa), unknown function

...

The fromucsv() function does not attempt to parse any of the values from the underlying CSV file, so all values are reported as strings:

In [11]:
b_common.display(vrepr=repr)
Chr Pos Ref Nref Der Mut GeneId GeneAlias GeneDescr
'MAL1' '91099' 'G' 'A' '-' 'SYN' 'PFA0095c' 'MAL1P1.10,RIF' 'rifin'
'MAL1' '91104' 'A' 'T' '-' 'NON' 'PFA0095c' 'MAL1P1.10,RIF' 'rifin'
'MAL1' '93363' 'T' 'A' '-' 'NON' 'PFA0100c' 'MAL1P1.11' 'Plasmodium exported protein (PHISTa), unknown function'
'MAL1' '93382' 'T' 'G' '-' 'NON' 'PFA0100c' 'MAL1P1.11' 'Plasmodium exported protein (PHISTa), unknown function'
'MAL1' '93384' 'G' 'A' '-' 'NON' 'PFA0100c' 'MAL1P1.11' 'Plasmodium exported protein (PHISTa), unknown function'

...

However, the 'Pos' field should be interpreted as an integer.

Also, the 'Mut' field has a different representation in the two tables, which needs to be converted before the data can be compared:

In [12]:
a_common.valuecounts('Mut')
Out[12]:
Mut count frequency
N 71162 0.6865804123611875
S 31535 0.30425386166507473
- 950 0.009165725973737783
In [13]:
b_common.valuecounts('Mut')
Out[13]:
Mut count frequency
NON 70880 0.6840510336042
SYN 32738 0.31594896639579995

Use the convert() function to convert the type of the 'Pos' field in both tables and the representation of the 'Mut' field in table b:

In [14]:
a_conv = a_common.convert('Pos', int)
b_conv = (
    b_common
    .convert('Pos', int)
    .convert('Mut', {'SYN': 'S', 'NON': 'N'})
)
In [15]:
highlight = 'background-color: yellow'
a_conv.display(caption='a', vrepr=repr, td_styles={'Pos': highlight})
a
Chr Pos Ref Nref Der Mut GeneId GeneAlias GeneDescr
'MAL1' 91099 'G' 'A' '-' 'S' 'PFA0095c' 'MAL1P1.10' 'rifin'
'MAL1' 91104 'A' 'T' '-' 'N' 'PFA0095c' 'MAL1P1.10' 'rifin'
'MAL1' 93363 'T' 'A' '-' 'N' 'PFA0100c' 'MAL1P1.11' 'hypothetical protein, conserved in P. falciparum'
'MAL1' 93382 'T' 'G' '-' 'N' 'PFA0100c' 'MAL1P1.11' 'hypothetical protein, conserved in P. falciparum'
'MAL1' 93384 'G' 'A' '-' 'N' 'PFA0100c' 'MAL1P1.11' 'hypothetical protein, conserved in P. falciparum'

...

In [16]:
b_conv.display(caption='b', vrepr=repr, td_styles={'Pos': highlight, 'Mut': highlight})
b
Chr Pos Ref Nref Der Mut GeneId GeneAlias GeneDescr
'MAL1' 91099 'G' 'A' '-' 'S' 'PFA0095c' 'MAL1P1.10,RIF' 'rifin'
'MAL1' 91104 'A' 'T' '-' 'N' 'PFA0095c' 'MAL1P1.10,RIF' 'rifin'
'MAL1' 93363 'T' 'A' '-' 'N' 'PFA0100c' 'MAL1P1.11' 'Plasmodium exported protein (PHISTa), unknown function'
'MAL1' 93382 'T' 'G' '-' 'N' 'PFA0100c' 'MAL1P1.11' 'Plasmodium exported protein (PHISTa), unknown function'
'MAL1' 93384 'G' 'A' '-' 'N' 'PFA0100c' 'MAL1P1.11' 'Plasmodium exported protein (PHISTa), unknown function'

...

Now the tables are ready for comparison.

Looking for missing or unexpected rows

Because both tables should contain the same list of genome locations, they should have the same number of rows. Use nrows() to compare:

In [17]:
a_conv.nrows()
Out[17]:
103647
In [18]:
b_conv.nrows()
Out[18]:
103618

There is some discrepancy. First investigate by comparing just the genomic locations, defined by the 'Chr' and 'Pos' fields, using complement():

In [19]:
a_locs = a_conv.cut('Chr', 'Pos')
b_locs = b_conv.cut('Chr', 'Pos')
locs_only_in_a = a_locs.complement(b_locs)
locs_only_in_a.nrows()
Out[19]:
29
In [20]:
locs_only_in_a.displayall(caption='a only')
a only
Chr Pos
MAL1 216961
MAL10 538210
MAL10 548779
MAL10 1432969
MAL11 500289
MAL11 1119809
MAL11 1278859
MAL12 51827
MAL13 183727
MAL13 398404
MAL13 627342
MAL13 1216664
MAL13 2750149
MAL14 1991758
MAL14 2297918
MAL14 2372268
MAL14 2994810
MAL2 38577
MAL2 64017
MAL4 1094258
MAL5 1335335
MAL5 1338718
MAL7 670602
MAL7 690509
MAL8 489937
MAL9 416116
MAL9 868677
MAL9 1201970
MAL9 1475245
In [21]:
locs_only_in_b = b_locs.complement(a_locs)
locs_only_in_b.nrows()
Out[21]:
0

So it appears that 29 locations are missing from table b. Export these missing locations to a CSV file using toucsv():

In [22]:
locs_only_in_a.tocsv('missing_locations.csv')

An alternative method for finding rows in one table where some key value is not present in another table is to use the antijoin() function:

In [23]:
locs_only_in_a = a_conv.antijoin(b_conv, key=('Chr', 'Pos'))
locs_only_in_a.nrows()
Out[23]:
29

Finding conflicts

We'd also like to compare the values given in the other fields, to find any discrepancies between the two tables.

The simplest way to find conflicts is to merge() both tables under a given key:

In [24]:
ab_merge = etl.merge(a_conv, b_conv, key=('Chr', 'Pos'))
ab_merge.display(caption='ab_merge', 
                 td_styles=lambda v: highlight if isinstance(v, etl.Conflict) else '')
ab_merge
Chr Pos Ref Nref Der Mut GeneId GeneAlias GeneDescr
MAL1 91099 G A - S PFA0095c Conflict({'MAL1P1.10', 'MAL1P1.10,RIF'}) rifin
MAL1 91104 A T - N PFA0095c Conflict({'MAL1P1.10', 'MAL1P1.10,RIF'}) rifin
MAL1 93363 T A - N PFA0100c MAL1P1.11 Conflict({'Plasmodium exported protein (PHISTa), unknown function', 'hypothetical protein, conserved in P. falciparum'})
MAL1 93382 T G - N PFA0100c MAL1P1.11 Conflict({'Plasmodium exported protein (PHISTa), unknown function', 'hypothetical protein, conserved in P. falciparum'})
MAL1 93384 G A - N PFA0100c MAL1P1.11 Conflict({'Plasmodium exported protein (PHISTa), unknown function', 'hypothetical protein, conserved in P. falciparum'})

...

From a glance at the conflicts above, it appears there are discrepancies in the 'GeneAlias' and 'GeneDescr' fields. There may also be conflicts in other fields, so we need to investigate further.

Note that the table ab_merge will contain all rows, not only those containing conflicts. To find only conflicting rows, use cat() then conflicts(), e.g.:

In [25]:
ab = etl.cat(a_conv.addfield('source', 'a', index=0), 
             b_conv.addfield('source', 'b', index=0))
ab_conflicts = ab.conflicts(key=('Chr', 'Pos'), exclude='source')
ab_conflicts.display(10)
source Chr Pos Ref Nref Der Mut GeneId GeneAlias GeneDescr
a MAL1 91099 G A - S PFA0095c MAL1P1.10 rifin
b MAL1 91099 G A - S PFA0095c MAL1P1.10,RIF rifin
a MAL1 91104 A T - N PFA0095c MAL1P1.10 rifin
b MAL1 91104 A T - N PFA0095c MAL1P1.10,RIF rifin
a MAL1 93363 T A - N PFA0100c MAL1P1.11 hypothetical protein, conserved in P. falciparum
b MAL1 93363 T A - N PFA0100c MAL1P1.11 Plasmodium exported protein (PHISTa), unknown function
a MAL1 93382 T G - N PFA0100c MAL1P1.11 hypothetical protein, conserved in P. falciparum
b MAL1 93382 T G - N PFA0100c MAL1P1.11 Plasmodium exported protein (PHISTa), unknown function
a MAL1 93384 G A - N PFA0100c MAL1P1.11 hypothetical protein, conserved in P. falciparum
b MAL1 93384 G A - N PFA0100c MAL1P1.11 Plasmodium exported protein (PHISTa), unknown function

...

Finally, let's find conflicts in a specific field:

In [26]:
ab_conflicts_mut = ab.conflicts(key=('Chr', 'Pos'), include='Mut')
ab_conflicts_mut.display(10, caption='Mut conflicts', td_styles={'Mut': highlight})
Mut conflicts
source Chr Pos Ref Nref Der Mut GeneId GeneAlias GeneDescr
a MAL1 99099 G T - - PFA0110w MAL1P1.13,Pf155 ring-infected erythrocyte surface antigen
b MAL1 99099 G T - N PFA0110w MAL1P1.13,Pf155,RESA ring-infected erythrocyte surface antigen
a MAL1 99211 C T - - PFA0110w MAL1P1.13,Pf155 ring-infected erythrocyte surface antigen
b MAL1 99211 C T - N PFA0110w MAL1P1.13,Pf155,RESA ring-infected erythrocyte surface antigen
a MAL1 197903 C A A S PFA0220w MAL1P1.34b ubiquitin carboxyl-terminal hydrolase, putative
b MAL1 197903 C A A N PFA0220w PFA0215w,MAL1P1.34b ubiquitin carboxyl-terminal hydrolase, putative
a MAL1 384429 C T - N PFA0485w MAL1P2.26 dolichol kinase
b MAL1 384429 C T - S - - -
a MAL1 513268 A G - N PFA0650w MAL1P3.12,MAL1P3.12a,PFA0655w surface-associated interspersed gene pseudogene, (SURFIN) pseudogene
b MAL1 513268 A G - S PFA0650w MAL1P3.12,PFA0655,MAL1P3.12a,3D7surf1.2,PFA0655w,MAL1P12a surface-associated interspersed gene (SURFIN), pseudogene

...

In [27]:
ab_conflicts_mut.nrows()
Out[27]:
3592

For more information about the petl package see the petl online documentation.