Database Programming¶

Due to the sheer size of many modern datasets, statistical applications can quickly outstrip the capabilities of unstructured data stores, such as spreadsheets and text files. So, it is useful to be able to work with databases for both the retrieval of information when it is required for analysis and for the storage of processed data and analytic outputs.

Relational database management systems (RDBMS) are used to store user-defined records in large tables. Each table has columns (also known as fields) which describe the data, and rows (also known as records) which contain the data.

Unlike when we are using a spreadsheet, where we put formulas into cells to calculate values, when we are using a database, we send queries to the database management system to manipulate the database for us. The database manager can process complex query commands that combine data from multiple tables to generate reports and data summaries.

Every database manager—Oracle, IBM DB2, PostgreSQL, MySQL, Microsoft Access, and SQLite—stores data in a different way, so a database created with one cannot be used directly by another. However, every database manager can import and export data in a variety of formats, so it is possible to move information from one to another.

Writing Python code to access databases is made easier by the presence of the Python Database API (commonly referred to as the DB-API). This API fits nicely into existing Python code and allows programmers to easily store and retrieve data from databases.

The Python DB-API specifies a standard way to connect to databases and issue commands to them, and provides a common starting point for anyone writing code to interact with databases. There are several Python modules that extend the DB-API with different levels of abstraction. They range from simple result set wrappers to full blown object relational mappers. The simple wrappers typically wrap the results of database operations in more Pythonic data structures like dictionaries, while object relational mappers graft object interfaces onto the database, and model tables as Python classes. This allows Python programmers to (largely) distance themselves from writing SQL statements.

Client-server model¶

Databases are typically based on a client-server model. Just a web server provides content to multiple website visitors that request and render pages, databases return query results to one or more clients.

The model consists of one server instance and many client instances.

rdbms

Server:

accepts connections from clients
processes requests and provide results

Client:

connects to server
makes requests and receives results from server

Popular database servers include: Oracle, MySQL, Microsoft SQL, SQLite

The database interaction paradigm consists of multiple steps:

Connect to database server
Issue request
Fetch result
Repeat 2,3 as necessary
Disconnect from server

The SQL Interface¶

Databases can be accessed from Python through the interface described by the Python DB-API. Different brands of database will have slightly different implementations, and may not completely conform to the DB-API, but most are quite similar.

Irrespective of the database system you are using, there are three core concepts to database access in Python:

Connection objects

: This is a connection to a database, which provides specific implementation details and access control.

Cursor objects

: A cursor is a means for traversing the database, and governs the execution of SQL statements and the return of results.

Query results

: The output from database queries in Python is always a sequence of sequences, representing database tables of rows.

Database queries are sent via the cursor as strings of SQL. Thus, productive use of databases requires some mastery of the SQL language, and we will teach the basics of SQL here.

SQLite¶

A good entry point is the SQLite database; SQLite is a public-domain software package that provides a relational database management system.

SQLite is defined by the following features:

Serverless

: SQLite does not require a separate server process or system to operate. The SQLite library accesses its storage files directly.

Zero Configuration

: No server means no setup. Creating an SQLite database instance is as easy as opening a file.

Cross-Platform

: The entire database instance resides in a single cross-platform file, requiring no administration.

Self-Contained

: A single library contains the entire database system, which integrates directly into a host application.

Small Runtime Footprint

: The default build is less than a megabyte of code and requires only a few megabytes of memory.

Transactional

: SQLite transactions are fully ACID-compliant, allowing safe access from multiple processes or threads.

Unlike most RDBMS products, SQLite does not have a client/server architecture. Most large-scale database systems have a large server package that makes up the database engine. The database server often consists of multiple processes that work in concert to manage client connections, file I/O, caches, query optimization, and query processing. A database instance typically consists of a large number of files organized into one or more directory trees on the server filesystem. In order to access the database, all of the files must be present and correct. This can make it somewhat difficult to move or reliably back up a database instance. All of these components require resources and support from the host computer.

sqlite arch

In contrast, SQLite has no separate server. The entire database engine is integrated into whatever application needs to access a database. The only shared resource among applications is the single database file as it sits on disk. If you need to move or back up the database, you can simply copy the file. Unlike a traditional RDBMS server that requires advanced multitasking and high-performance inter-process communication, SQLite requires little more than the ability to read and write to some type of storage.

SQLite doesn't make sense for scenarios where there are a high number of transactions, a very large quantity of data, or multiple users requiring access control. Its suitability depends on the degree to which trading off simplicity and portability with security and robustness makes sense.

The Python Standard Library includes a module called sqlite3 intended for working with this database. This module is a SQL interface compliant with the DB-API 2.0 specification.

Running SQLite¶

The sqlite3 command line tool accepts SQL commands from an interactive prompt and passes those commands to the SQLite core for processing. To get started, just run the sqlite3 command; If you provide a filename (such as our sample database survey.db), sqlite3 will open (or create) that file. If no filename is given, sqlite3 will automatically open an unnamed temporary database.

terminal

The sqlite> prompt means sqlite3 is ready to accept commands, similar to >>> in Python. We can issue some basic expressions:

interaction

The first line is a simple SQL expression that returns a sum and an integer. All SQL commands in the command line tool must end with semicolons, which indicate the command is complete. We will explore the SELECT statement fully later on.

In addition to processing SQL statements, there is a series of shell-specific commands. These are sometimes referred to as dot-commands because they start with a period. Dot-commands control the shell’s output formatting, and also provide a number of utility features.

In the above example, .databases lists the names and files of attached databases, while .tables lists the tables in the attached databases. Finally, we exited with .quit.

Running SQLite from Python¶

Relational databases are more useful to us when integrated with our analytic tools. The SQLite3 can be integrated with Python using the sqlite3 module which was written by Gerhard Haring. It provides an SQL interface compliant with the DB-API 2.0 specification. You do not need to install this module separately because it is part of Python's core set of modules.

The connect function is used to establish a connection to a new or existing database. Since we're using SQLite, all we need to specify is the name of the database file. Other systems may require us to provide a username and password as well.

Let's create a new database.

In [1]:

import sqlite3

con = sqlite3.connect('microbiome.db')
con

Out[1]:

<sqlite3.Connection at 0x103a13940>

In order to communicate with the database, we need to create a cursor object. A database cursor is a control structure that enables traversal over the records in a database. Cursors facilitate subsequent processing in conjunction with the traversal, such as retrieval, addition and removal of database records. Just like the cursor in an editor, its role is to keep track of where we are in the database.

In [2]:

cur = con.cursor()   

In [3]:

cur

Out[3]:

<sqlite3.Cursor at 0x1045dd810>

The cursor's execute method provides the means for passing queries to the database. Let's create a table.

In [4]:

cur.execute('create table samples (Taxon text, Patient integer, NEC integer, Tissue integer, Stool integer)')

Out[4]:

<sqlite3.Cursor at 0x1045dd810>

It's our job to make sure that SQL is properly formatted; if it isn't, or if something goes wrong when it is being executed, the database will report an error.

In [5]:

con.commit()

The commit function completes the transaction.

Now to add some data to the empty table. We will now pass an INSERT query to the database, with some new data.

In [6]:

cur.execute('insert into samples values (\'Bacteroidetes\',14,1,102,33)')

Out[6]:

<sqlite3.Cursor at 0x1045dd810>

In [7]:

con.commit()

Of course, it would be painstaking to construct a new query for each new observation that we intend to store in SQLite. Fortunately, data can be added as a set; for example, here are some microbiome sample from high-throughput RNA 16S sequencing.

In [8]:

samples = [('Firmicutes',1,0,136,4182),
('Firmicutes',2,1,1174,703),
('Firmicutes',3,0,408,3946),
('Firmicutes',4,1,831,8605),
('Firmicutes',5,0,693,50),
('Firmicutes',6,1,718,717),
('Proteobacteria',8,1,2651,767),
('Proteobacteria',9,0,1195,76),
('Proteobacteria',10,1,6857,795),
('Proteobacteria',11,0,483,666),
('Proteobacteria',12,1,2950,3994),
('Actinobacteria',11,0,42,75),
('Actinobacteria',12,1,109,59),
('Actinobacteria',13,0,51,183),
('Actinobacteria',14,1,310,204),
('Bacteroidetes',1,0,67,0),
('Bacteroidetes',2,1,0,0),
('Bacteroidetes',3,0,85,5),
('Bacteroidetes',4,1,143,7)]

The executemany function will allow us to add them as a group, iterating over the list of tuples.

In [9]:

cur.executemany('insert into samples values (?,?,?,?,?)', samples)

Out[9]:

<sqlite3.Cursor at 0x1045dd810>

Let's have a look at what is in the table now. For this we will execute a SELECT command. This returns the cursor, which contains an iterable data structure.

In [10]:

results = cur.execute('select * from samples')
results

Out[10]:

<sqlite3.Cursor at 0x1045dd810>

In [11]:

results.fetchall()

Out[11]:

[(u'Bacteroidetes', 14, 1, 102, 33),
 (u'Firmicutes', 1, 0, 136, 4182),
 (u'Firmicutes', 2, 1, 1174, 703),
 (u'Firmicutes', 3, 0, 408, 3946),
 (u'Firmicutes', 4, 1, 831, 8605),
 (u'Firmicutes', 5, 0, 693, 50),
 (u'Firmicutes', 6, 1, 718, 717),
 (u'Proteobacteria', 8, 1, 2651, 767),
 (u'Proteobacteria', 9, 0, 1195, 76),
 (u'Proteobacteria', 10, 1, 6857, 795),
 (u'Proteobacteria', 11, 0, 483, 666),
 (u'Proteobacteria', 12, 1, 2950, 3994),
 (u'Actinobacteria', 11, 0, 42, 75),
 (u'Actinobacteria', 12, 1, 109, 59),
 (u'Actinobacteria', 13, 0, 51, 183),
 (u'Actinobacteria', 14, 1, 310, 204),
 (u'Bacteroidetes', 1, 0, 67, 0),
 (u'Bacteroidetes', 2, 1, 0, 0),
 (u'Bacteroidetes', 3, 0, 85, 5),
 (u'Bacteroidetes', 4, 1, 143, 7)]

In [12]:

for row in results:
    print(row)

Finally, we need to close our cursor and our connection, since the database can only keep a limited number of these open at one time. Since establishing a connection takes time, though, we shouldn't open a connection, do one operation, then close the connection, only to reopen it a few microseconds later to do another operation. Instead, it's normal to create one connection that stays open for the lifetime of the program.

In [13]:

cur.close()
con.close()

IPython SQL Magic¶

Catherine Devlin has created a SQL %magic function for IPython that allows for more direct querying of SQLite and other databases.

Go ahead and install the extension using pip:

In [14]:

!pip install ipython-sql

Downloading/unpacking ipython-sql
  Downloading ipython-sql-0.3.4.tar.gz
  Running setup.py (path:/private/var/folders/z_/kgsm76k94s7_jhnkfydfnd380000gn/T/pip_build_fonnescj/ipython-sql/setup.py) egg_info for package ipython-sql
    
Downloading/unpacking prettytable (from ipython-sql)
  Downloading prettytable-0.7.2.tar.bz2
  Running setup.py (path:/private/var/folders/z_/kgsm76k94s7_jhnkfydfnd380000gn/T/pip_build_fonnescj/prettytable/setup.py) egg_info for package prettytable
    
Requirement already satisfied (use --upgrade to upgrade): ipython>=1.0 in /Users/fonnescj/anaconda/lib/python2.7/site-packages (from ipython-sql)
Requirement already satisfied (use --upgrade to upgrade): sqlalchemy>=0.6.7 in /Users/fonnescj/anaconda/lib/python2.7/site-packages (from ipython-sql)
Downloading/unpacking sqlparse (from ipython-sql)
  Downloading sqlparse-0.1.13.tar.gz (54kB): 54kB downloaded
  Running setup.py (path:/private/var/folders/z_/kgsm76k94s7_jhnkfydfnd380000gn/T/pip_build_fonnescj/sqlparse/setup.py) egg_info for package sqlparse
    
Requirement already satisfied (use --upgrade to upgrade): six in /Users/fonnescj/anaconda/lib/python2.7/site-packages (from ipython-sql)
Installing collected packages: ipython-sql, prettytable, sqlparse
  Running setup.py install for ipython-sql
    
  Running setup.py install for prettytable
    
  Running setup.py install for sqlparse
    changing mode of build/scripts-2.7/sqlformat from 644 to 755
    
    changing mode of /Users/fonnescj/anaconda/bin/sqlformat to 755
Successfully installed ipython-sql prettytable sqlparse
Cleaning up...

After loading the extension in the current session, it is easy to connect to databases and execute queries, without having to use the sqlite3 module at all.

In [15]:

%load_ext sql

In [16]:

%sql sqlite:///microbiome.db

Out[16]:

u'Connected: None@microbiome.db'

In [17]:

%%sql
INSERT INTO samples VALUES ('Actinobacteria',7,0,260,58);

1 rows affected.

Out[17]:

[]

In [18]:

writer_results = %sql select * from samples where NEC==1;

Done.

In [19]:

writer_results

Out[19]:

Taxon	Patient	NEC	Tissue	Stool
Bacteroidetes	14	1	102	33

In [20]:

writer_results.DataFrame()

Out[20]:

	Taxon	Patient	NEC	Tissue	Stool
0	Bacteroidetes	14	1	102	33

This functionality is better for interactive computing than the sqlite3 mdoule, so we will be using it for most of the remainder of the section.

Selecting Data¶

In the late 1920s and early 1930s, William Dyer, Frank Pabodie, and Valentina Roerich led expeditions to the Pole of Inaccessibility in the South Pacific, and then onward to Antarctica. Two years ago, their expeditions were found in a storage locker at Miskatonic University. We have scanned and OCR'd the data they contain, and we now want to store that information in a way that will make search and analysis easy.

To load this database, we pass its URI to the SQL magic:

In [21]:

%sql sqlite:///../data/survey.db

Out[21]:

u'Connected: None@../data/survey.db'

Person: people who took readings.

ident	personal	family
dyer	William	Dyer
pb	Frank	Pabodie
lake	Anderson	Lake
roe	Valentina	Roerich
danforth	Frank	Danforth

Site: locations where readings were taken.

name	lat	long
DR-1	-49.85	-128.57
DR-3	-47.15	-126.72
MSK-4	-48.87	-123.4

Visited: when readings were taken at specific sites.

ident	site	dated
619	DR-1	1927-02-08
622	DR-1	1927-02-10
734	DR-3	1939-01-07
735	DR-3	1930-01-12
751	DR-3	1930-02-26
752	DR-3
837	MSK-4	1932-01-14
844	DR-1	1932-03-22

Survey: the actual readings.

taken	person	quant	reading
619	dyer	rad	9.82
619	dyer	sal	0.13
622	dyer	rad	7.8
622	dyer	sal	0.09
734	pb	rad	8.41
734	lake	sal	0.05
734	pb	temp	-21.5
735	pb	rad	7.22
735		sal	0.06
735		temp	-26.0
751	pb	rad	4.35
751	pb	temp	-18.5
751	lake	sal	0.1
752	lake	rad	2.19
752	lake	sal	0.09
752	lake	temp	-16.0
752	roe	sal	41.6
837	lake	rad	1.46
837	lake	sal	0.21
837	roe	sal	22.5
844	roe	rad	11.25

Notice that three entries—one in the Visited table, and two in the Survey table—are shown in red because they don't contain any actual data: we'll return to these missing values. For now, let's write an SQL query that displays scientists' names. We do this using the SQL command select, giving it the names of the columns we want and the table we want them from. Our query and its output look like this:

In [22]:

%sql select family, personal from Person;

Done.

Out[22]:

family	personal
Dyer	William
Pabodie	Frank
Lake	Anderson
Roerich	Valentina
Danforth	Frank

The semi-colon at the end of the query tells the database manager that the query is complete and ready to run. We have written our commands and column names in lower case, and the table name in Title Case, but we don't have to: as the example below shows, SQL is case insensitive.

In [23]:

%%sql
SeLeCt FaMiLy, PeRsOnAl FrOm PeRsOn;

Done.

Out[23]:

family	personal
Dyer	William
Pabodie	Frank
Lake	Anderson
Roerich	Valentina
Danforth	Frank

Many people format queries as:

SELECT family, personal FROM person;

or as:

select Family, Personal from PERSON;

Whatever casing convention you choose, please be consistent: complex queries are hard enough to read without the extra cognitive load of random capitalization.

Going back to our query, it's important to understand that the rows and columns in a database table aren't actually stored in any particular order. They will always be displayed in some order, but we can control that in various ways. For example, we could swap the columns in the output by writing our query as:

In [24]:

%%sql
select personal, family from Person;

Done.

Out[24]:

personal	family
William	Dyer
Frank	Pabodie
Anderson	Lake
Valentina	Roerich
Frank	Danforth

or even repeat columns:

In [25]:

%%sql
select ident, ident, ident from Person;

Done.

Out[25]:

ident	ident_1	ident_2
dyer	dyer	dyer
pb	pb	pb
lake	lake	lake
roe	roe	roe
danforth	danforth	danforth

As a shortcut, we can select all of the columns in a table using *:

In [26]:

%%sql
select * from Person;

Done.

Out[26]:

ident	personal	family
dyer	William	Dyer
pb	Frank	Pabodie
lake	Anderson	Lake
roe	Valentina	Roerich
danforth	Frank	Danforth

Exercise¶

Write a query that selects only site names from the Site table.

In [27]:

%%sql
select name from site;

Done.

Out[27]:

name
DR-1
DR-3
MSK-4

Sorting and Removing Duplicates¶

Data is often redundant, so queries often return redundant information. For example, if we select the quantitites that have been measured from the survey table, we get this:

In [28]:

%%sql
select quant from Survey;

Done.

Out[28]:

quant
rad
sal
rad
sal
rad
sal
temp
rad
sal
temp
rad
temp
sal
rad
sal
temp
sal
rad
sal
sal
rad

We can eliminate the redundant output to make the result more readable by adding the distinct keyword to our query:

In [29]:

%%sql
select distinct quant from Survey;

Done.

Out[29]:

quant
rad
sal
temp

If we select more than one column—for example, both the survey site ID and the quantity measured—then the distinct pairs of values are returned:

In [30]:

%%sql
select distinct taken, quant from Survey;

Done.

Out[30]:

taken	quant
619	rad
619	sal
622	rad
622	sal
734	rad
734	sal
734	temp
735	rad
735	sal
735	temp
751	rad
751	temp
751	sal
752	rad
752	sal
752	temp
837	rad
837	sal
844	rad

Notice in both cases that duplicates are removed even if they didn't appear to be adjacent in the database. Again, it's important to remember that rows aren't actually ordered: they're just displayed that way.

Exercise¶

Write a query that selects distinct dates from the Site table.

In [31]:

%%sql
select dated from visited; -- here is a comment

Done.

Out[31]:

dated
1927-02-08
1927-02-10
1939-01-07
1930-01-12
1930-02-26
None
1932-01-14
1932-03-22

As we mentioned earlier, database records are not stored in any particular order. This means that query results aren't necessarily sorted, and even if they are, we often want to sort them in a different way, e.g., by the name of the project instead of by the name of the scientist. We can do this in SQL by adding an order by clause to our query:

In [32]:

%%sql
select * from Person order by ident;

Done.

Out[32]:

ident	personal	family
danforth	Frank	Danforth
dyer	William	Dyer
lake	Anderson	Lake
pb	Frank	Pabodie
roe	Valentina	Roerich

By default, results are sorted in ascending order (i.e., from least to greatest). We can sort in the opposite order using desc (for "descending"):

In [33]:

%%sql
select * from person order by ident desc;

Done.

Out[33]:

ident	personal	family
roe	Valentina	Roerich
pb	Frank	Pabodie
lake	Anderson	Lake
dyer	William	Dyer
danforth	Frank	Danforth

(And if we want to make it clear that we're sorting in ascending order, we can use asc instead of desc.)

We can also sort on several fields at once. For example, this query sorts results first in ascending order by taken, and then in descending order by person within each group of equal taken values:

In [34]:

%%sql
select taken, person from Survey order by taken asc, person desc;

Done.

Out[34]:

taken	person
619	dyer
619	dyer
622	dyer
622	dyer
734	pb
734	pb
734	lake
735	pb
735	None
735	None
751	pb
751	pb
751	lake
752	roe
752	lake
752	lake
752	lake
837	roe
837	lake
837	lake
844	roe

This is easier to understand if we also remove duplicates:

In [35]:

%%sql
select distinct taken, person from Survey order by taken asc, person desc;

Done.

Out[35]:

taken	person
619	dyer
622	dyer
734	pb
734	lake
735	pb
735	None
751	pb
751	lake
752	roe
752	lake
837	roe
837	lake
844	roe

Exercises¶

Write a query that returns the distinct dates in the Visited table.
Write a query that displays the full names of the scientists in the Person table, ordered by family name.

In [36]:

%%sql
select personal, family from Person order by family;

Done.

Out[36]:

personal	family
Frank	Danforth
William	Dyer
Anderson	Lake
Frank	Pabodie
Valentina	Roerich

Filtering¶

One of the most powerful features of a database is the ability to filter data, i.e., to select only those records that match certain criteria. For example, suppose we want to see when a particular site was visited. We can select these records from the Visited table by using a where clause in our query:

In [37]:

%%sql
select * from Visited where site='DR-1';

Done.

Out[37]:

ident	site	dated
619	DR-1	1927-02-08
622	DR-1	1927-02-10
844	DR-1	1932-03-22

The database manager executes this query in two stages. First, it checks at each row in the Visited table to see which ones satisfy the where. It then uses the column names following the select keyword to determine what columns to display.

This processing order means that we can filter records using where based on values in columns that aren't then displayed:

In [38]:

%%sql
select ident from Visited where site='DR-1';

Done.

Out[38]:

ident
619
622
844

We can use many other Boolean operators to filter our data.¶

For example, we can ask for all information from the DR-1 site collected since 1930:

In [39]:

%%sql
select * from Visited where (site='DR-1') and (dated>='1930-00-00');

Done.

Out[39]:

ident	site	dated
844	DR-1	1932-03-22

(The parentheses around the individual tests aren't strictly required, but they help make the query easier to read.)

Most database managers have a special data type for dates. In fact, many have two: one for dates, such as "May 31, 1971", and one for durations, such as "31 days". SQLite doesn't: instead, it stores dates as either text (in the ISO-8601 standard format "YYYY-MM-DD HH:MM:SS.SSSS"), real numbers (the number of days since November 24, 4714 BCE), or integers (the number of seconds since midnight, January 1, 1970).

If we want to find out what measurements were taken by either Lake or Roerich, we can combine the tests on their names using or:

In [40]:

%%sql
select * from Survey where person='lake' or person='roe';

Done.

Out[40]:

taken	person	quant	reading
734	lake	sal	0.05
751	lake	sal	0.1
752	lake	rad	2.19
752	lake	sal	0.09
752	lake	temp	-16.0
752	roe	sal	41.6
837	lake	rad	1.46
837	lake	sal	0.21
837	roe	sal	22.5
844	roe	rad	11.25

Alternatively, we can use in to see if a value is in a specific set:

In [41]:

%%sql
select * from Survey where person in ('lake', 'roe');

Done.

Out[41]:

taken	person	quant	reading
734	lake	sal	0.05
751	lake	sal	0.1
752	lake	rad	2.19
752	lake	sal	0.09
752	lake	temp	-16.0
752	roe	sal	41.6
837	lake	rad	1.46
837	lake	sal	0.21
837	roe	sal	22.5
844	roe	rad	11.25

We can combine and with or, but we need to be careful about which operator is executed first. If we don't use parentheses, we get this:

In [42]:

%%sql
select * from Survey where quant='sal' and person='lake' or person='roe';

Done.

Out[42]:

taken	person	quant	reading
734	lake	sal	0.05
751	lake	sal	0.1
752	lake	sal	0.09
752	roe	sal	41.6
837	lake	sal	0.21
837	roe	sal	22.5
844	roe	rad	11.25

which is salinity measurements by Lake, and any measurement by Roerich. We probably want this instead:

In [43]:

%%sql
select * from Survey where quant='sal' and (person='lake' or person='roe');

Done.

Out[43]:

taken	person	quant	reading
734	lake	sal	0.05
751	lake	sal	0.1
752	lake	sal	0.09
752	roe	sal	41.6
837	lake	sal	0.21
837	roe	sal	22.5

Finally, we can use distinct with where to give a second level of filtering:

In [44]:

%%sql
select distinct person, quant from Survey where person='lake' or person='roe';

Done.

Out[44]:

person	quant
lake	sal
lake	rad
lake	temp
roe	sal
roe	rad

But remember: distinct is applied to the values displayed in the chosen columns, not to the entire rows as they are being processed.

What we have just done is how most people "grow" their SQL queries. We started with something simple that did part of what we wanted, then added more clauses one by one, testing their effects as we went. This is a good strategy—in fact, for complex queries it's often the only strategy—but it depends on quick turnaround, and on us recognizing the right answer when we get it.

The best way to achieve quick turnaround is often to put a subset of data in a temporary database and run our queries against that, or to fill a small database with synthesized records. For example, instead of trying our queries against an actual database of 20 million records, we could run it against a sample of ten thousand, or write a small program to generate ten thousand random (but plausible) records and use that.

We can also filter results using pattern-matching. The SQL test *column-name* like *pattern* is true if the value in the named column matches the pattern given; the character '%' can be used any number of times in the pattern to mean "match zero or more characters".

Expression	Value
`'a' like 'a'`	True
`'a' like '%a'`	True
`'b' like '%a'`	False
`'alpha' like 'a%'`	True
`'alpha' like 'a%p%'`	True

The expression *column-name* not like *pattern* inverts the test.

For example, we can write a query that finds all the records in Visited that aren't from sites labelled 'DR-something'.

In [45]:

%%sql
select * from Visited where site not like 'DR-%';

Done.

Out[45]:

ident	site	dated
837	MSK-4	1932-01-14

Example¶

Normalized salinity readings are supposed to be between 0.0 and 1.0. Write a query that selects all records from Survey with salinity values outside this range.

In [46]:

%%sql
select * from Survey where quant='sal' and (reading>1 or reading<0);

Done.

Out[46]:

taken	person	quant	reading
752	roe	sal	41.6
837	roe	sal	22.5

Calculating New Values¶

After carefully re-reading the expedition logs, we realize that the radiation measurements they report may need to be corrected upward by 5%. Rather than modifying the stored data, we can do this calculation on the fly as part of our query:

In [47]:

%%sql
select 1.05 * reading from Survey where quant='rad';

Done.

Out[47]:

05 * reading
10.311
8.19
8.8305
7.581
4.5675
2.2995
1.533
11.8125

When we run the query, the expression 1.05 * reading is evaluated for each row. Expressions can use any of the fields, all of usual arithmetic operators, and a variety of common functions. (Exactly which ones depends on which database manager is being used.) For example, we can convert temperature readings from Fahrenheit to Celsius and round to two decimal places:

In [48]:

%%sql
select taken, round(5*(reading-32)/9, 2) from Survey where quant='temp';

Done.

Out[48]:

taken	round(5*(reading-32)/9, 2)
734	-29.72
735	-32.22
751	-28.06
752	-26.67

We can also combine values from different fields, for example by using the string concatenation operator ||:

In [49]:

%%sql
select personal || ' ' || family from Person;

Done.

Out[49]:

personal \|\| ' ' \|\| family
William Dyer
Frank Pabodie
Anderson Lake
Valentina Roerich
Frank Danforth

The union operator combines the results of two queries:

In [50]:

%%sql
select * from Person where ident='dyer' union select * from Person where ident='roe';

Done.

Out[50]:

ident	personal	family
dyer	William	Dyer
roe	Valentina	Roerich

As a more complicated example, let's assume that Roerich mistakenly recorded all her measurements as percentages, rather than decimal values. We can use union to create a consolidated list of salinity measurements in which Roerich's (and only Roerich's) have been corrected.

In [51]:

%%sql
select taken, reading/100. from Survey where person='roe' and quant='sal' 
union select taken, reading from Survey where not person='roe' and quant='sal';

Done.

Out[51]:

taken
619	0.13
622	0.09
734	0.05
751	0.1
752	0.09
752	0.416
837	0.21
837	0.225

Example¶

The site identifiers in the Visited table have two parts separated by a '-':

In [52]:

%%sql
select distinct site from Visited;

Done.

Out[52]:

site
DR-1
DR-3
MSK-4

Some major site identifiers are two letters long and some are three. The "in string" function instr(X, Y) returns the 1-based index of the first occurrence of string Y in string X, or 0 if Y does not exist in X. The substring function substr(X, I, Z) returns the substring of X starting at index I that is Z characters long. Use these two functions to produce a list of unique major site identifiers. (For this data, the list should contain only "DR" and "MSK").

In [53]:

%%sql
select distinct substr(site, 0, instr(site, '-')) from visited;

Done.

Out[53]:

substr(site, 0, instr(site, '-'))
DR
MSK

Missing Data¶

Real-world data is rarely complete. Databases represent missing values using special value called null. null is not zero, False, or the empty string; it is a one-of-a-kind value that means "nothing here". Dealing with null requires a few special tricks and some careful thinking.

To start, let's have a look at the Visited table. There are eight records, but #752 doesn't have a date—or rather, its date is null:

In [54]:

%%sql
select * from Visited;

Done.

Out[54]:

ident	site	dated
619	DR-1	1927-02-08
622	DR-1	1927-02-10
734	DR-3	1939-01-07
735	DR-3	1930-01-12
751	DR-3	1930-02-26
752	DR-3	None
837	MSK-4	1932-01-14
844	DR-1	1932-03-22

Null doesn't behave like other values. If we select the records that come before 1930:

In [55]:

%%sql
select * from Visited where dated<'1930-00-00';

Done.

Out[55]:

ident	site	dated
619	DR-1	1927-02-08
622	DR-1	1927-02-10

we get two results, and if we select the ones that come during or after 1930:

In [56]:

%%sql
select * from Visited where dated>='1930-00-00';

Done.

Out[56]:

ident	site	dated
734	DR-3	1939-01-07
735	DR-3	1930-01-12
751	DR-3	1930-02-26
837	MSK-4	1932-01-14
844	DR-1	1932-03-22

we get five, but record #752 isn't in either set of results. The reason is that null<'1930-00-00' is neither true nor false: null means, "We don't know," and if we don't know the value on the left side of a comparison, we don't know whether the comparison is true or false. Since databases represent "don't know" as null, the value of null<'1930-00-00' is actually null. null>='1930-00-00' is also null because we can't answer to that question either. And since the only records kept by a where are those for which the test is true, record #752 isn't included in either set of results.

Comparisons aren't the only operations that behave this way with nulls. 1+null is null, 5*null is null, log(null) is null, and so on. In particular, comparing things to null with = and != produces null:

In [57]:

%%sql
select * from Visited where dated=NULL;

Done.

Out[57]:

ident	site	dated

In [58]:

%%sql
select * from Visited where dated!=NULL;

Done.

Out[58]:

ident	site	dated

To check whether a value is null or not, we must use a special test is null:

In [59]:

%%sql
select * from Visited where dated is NULL;

Done.

Out[59]:

ident	site	dated
752	DR-3	None

or its inverse is not null:

In [60]:

%%sql
select * from Visited where dated is not NULL;

Done.

Out[60]:

ident	site	dated
619	DR-1	1927-02-08
622	DR-1	1927-02-10
734	DR-3	1939-01-07
735	DR-3	1930-01-12
751	DR-3	1930-02-26
837	MSK-4	1932-01-14
844	DR-1	1932-03-22

Null values cause headaches wherever they appear. For example, suppose we want to find all the salinity measurements that weren't taken by Dyer. It's natural to write the query like this:

In [61]:

%%sql
select * from Survey where quant='sal' and person!='lake';

Done.

Out[61]:

taken	person	quant	reading
619	dyer	sal	0.13
622	dyer	sal	0.09
752	roe	sal	41.6
837	roe	sal	22.5

but this query filters omits the records where we don't know who took the measurement. Once again, the reason is that when person is null, the != comparison produces null, so the record isn't kept in our results. If we want to keep these records we need to add an explicit check:

In [62]:

%%sql
select * from Survey where quant='sal' and (person!='lake' or person is null);

Done.

Out[62]:

taken	person	quant	reading
619	dyer	sal	0.13
622	dyer	sal	0.09
735	None	sal	0.06
752	roe	sal	41.6
837	roe	sal	22.5

We still have to decide whether this is the right thing to do or not. If we want to be absolutely sure that we aren't including any measurements by Lake in our results, we need to exclude all the records for which we don't know who did the work.

Note also that the following does not work as you might expect:

In [63]:

%%sql
select * from Visited where dated in ('1927-02-08', null);

Done.

Out[63]:

ident	site	dated
619	DR-1	1927-02-08

Exercise¶

Write a query that sorts the records in Visited by date, omitting entries for which the date is not known (i.e., is null).

In [64]:

%%sql
select * from visited where dated is not null order by dated;

Done.

Out[64]:

ident	site	dated
619	DR-1	1927-02-08
622	DR-1	1927-02-10
735	DR-3	1930-01-12
751	DR-3	1930-02-26
837	MSK-4	1932-01-14
844	DR-1	1932-03-22
734	DR-3	1939-01-07

Aggregation¶

We now want to calculate ranges and averages for our data. We know how to select all of the dates from the Visited table:

In [65]:

%%sql
select dated from Visited;

Done.

Out[65]:

dated
1927-02-08
1927-02-10
1939-01-07
1930-01-12
1930-02-26
None
1932-01-14
1932-03-22

but to combine them, we must use an aggregation function such as min or max. Each of these functions takes a set of records as input, and produces a single record as output:

In [66]:

%%sql
select min(dated) from Visited;

Done.

Out[66]:

min(dated)
1927-02-08

In [67]:

%%sql
select max(dated) from Visited;

Done.

Out[67]:

max(dated)
1939-01-07

min and max are just two of the aggregation functions built into SQL. Three others are avg, count, and sum:

In [68]:

%%sql
select avg(reading) from Survey where quant='sal';

Done.

Out[68]:

avg(reading)
7.20333333333

In [69]:

%%sql
select count(reading) from Survey where quant='sal';

Done.

Out[69]:

count(reading)
9

In [70]:

%%sql
select sum(reading) from Survey where quant='sal';

Done.

Out[70]:

sum(reading)
64.83

We used count(reading) here, but we could just as easily have counted quant or any other field in the table, or even used count(*), since the function doesn't care about the values themselves, just how many values there are.

SQL lets us do several aggregations at once. We can, for example, find the range of sensible salinity measurements:

In [71]:

%%sql
select min(reading), max(reading) from Survey where quant='sal' and reading<=1.0;

Done.

Out[71]:

min(reading)	max(reading)
0.05	0.21

We can also combine aggregated results with raw results, although the output might surprise you:

In [72]:

%%sql
select person, count(*) from Survey where quant='sal' and reading<=1.0;

Done.

Out[72]:

person	count(*)
lake	7

Why does Lake's name appear rather than Roerich's or Dyer's? The answer is that when it has to aggregate a field, but isn't told how to, the database manager chooses an actual value from the input set. It might use the first one processed, the last one, or something else entirely.

Another important fact is that when there are no values to aggregate, aggregation's result is "don't know" rather than zero or some other arbitrary value:

In [73]:

%%sql
select person, max(reading), sum(reading) from Survey where quant='missing';

Done.

Out[73]:

person	max(reading)	sum(reading)
None	None	None

One final important feature of aggregation functions is that they are inconsistent with the rest of SQL in a very useful way. If we add two values, and one of them is null, the result is null. By extension, if we use sum to add all the values in a set, and any of those values are null, the result should also be null. It's much more useful, though, for aggregation functions to ignore null values and only combine those that are non-null. This behavior lets us write our queries as:

In [74]:

%%sql
select min(dated) from Visited;

Done.

Out[74]:

min(dated)
1927-02-08

instead of always having to filter explicitly:

In [75]:

%%sql
select min(dated) from Visited where dated is not null;

Done.

Out[75]:

min(dated)
1927-02-08

Aggregating all records at once doesn't always make sense. For example, suppose we suspect that there is a systematic bias in the data, and that some scientists' radiation readings are higher than others.

We know that this doesn't work:

In [76]:

%%sql
select person, count(reading), round(avg(reading), 2)
from  Survey
where quant='rad';

Done.

Out[76]:

person	count(reading)	round(avg(reading), 2)
roe	8	6.56

because the database manager selects a single arbitrary scientist's name rather than aggregating separately for each scientist. Since there are only five scientists, she could write five queries of the form:

In [77]:

%%sql
select person, count(reading), round(avg(reading), 2)
from  Survey
where quant='rad'
and   person='dyer';

Done.

Out[77]:

person	count(reading)	round(avg(reading), 2)
dyer	2	8.81

but this would be tedious, and if she ever had a data set with fifty or five hundred scientists, the chances of her getting all of those queries right is small.

What we need to do is tell the database manager to aggregate the hours for each scientist separately using a group by clause (recall a similar function in Pandas):

In [78]:

%%sql
select   person, count(reading), round(avg(reading), 2)
from     Survey
where    quant='rad'
group by person;

Done.

Out[78]:

person	count(reading)	round(avg(reading), 2)
dyer	2	8.81
lake	2	1.82
pb	3	6.66
roe	1	11.25

group by does exactly what its name implies: groups all the records with the same value for the specified field together so that aggregation can process each batch separately. Since all the records in each batch have the same value for person, it no longer matters that the database manager is picking an arbitrary one to display alongside the aggregated reading values.

Just as we can sort by multiple criteria at once, we can also group by multiple criteria. To get the average reading by scientist and quantity measured, for example, we just add another field to the group by clause:

In [79]:

%%sql
select   person, quant, count(reading), round(avg(reading), 2)
from     Survey
group by person, quant;

Done.

Out[79]:

person	quant	count(reading)	round(avg(reading), 2)
None	sal	1	0.06
None	temp	1	-26.0
dyer	rad	2	8.81
dyer	sal	2	0.11
lake	rad	2	1.82
lake	sal	4	0.11
lake	temp	1	-16.0
pb	rad	3	6.66
pb	temp	2	-20.0
roe	rad	1	11.25
roe	sal	2	32.05

Note that we have added person to the list of fields displayed, since the results wouldn't make much sense otherwise.

Let's go one step further and remove all the entries where we don't know who took the measurement:

In [80]:

%%sql
select   person, quant, count(reading), round(avg(reading), 2)
from     Survey
where    person is not null
group by person, quant
order by person, quant;

Done.

Out[80]:

person	quant	count(reading)	round(avg(reading), 2)
dyer	rad	2	8.81
dyer	sal	2	0.11
lake	rad	2	1.82
lake	sal	4	0.11
lake	temp	1	-16.0
pb	rad	3	6.66
pb	temp	2	-20.0
roe	rad	1	11.25
roe	sal	2	32.05

The order of operations for this query is as follows:

selected records from the Survey table where the person field was not null;
grouped those records into subsets so that the person and quant values in each subset were the same;
ordered those subsets first by person, and then within each sub-group by quant; and
counted the number of records in each subset, calculated the average reading in each, and chose a person and quant value from each (it doesn't matter which ones, since they're all equal).

Suppose now we wanted to center some of the readings, by subtracting the mean. One might expect the following would do the trick:

In [81]:

%%sql
select reading - avg(reading) from Survey where quant='rad';

Done.

Out[81]:

reading - avg(reading)
4.6875

However, you cannot mix aggregated and non-aggregated values in the same query. The (somewhat awkward) solution is to calculate the average in a separate query, and use that value in a query of the non-aggregated values.

In [82]:

%%sql
select reading - (select avg(reading) from Survey where quant='rad') 
from Survey where quant='rad';

Done.

Out[82]:

reading - (select avg(reading) from Survey where quant='rad')
3.2575
1.2375
1.8475
0.6575
-2.2125
-4.3725
-5.1025
4.6875

Exercise¶

The function group_concat(field, separator) concatenates all the values in a field using the specified separator character (or ',' if the separator isn't specified). Use this to produce a one-line list of scientists' names, such as:

William Dyer, Frank Pabodie, Anderson Lake, Valentina Roerich, Frank Danforth

Can you find a way to order the list by surname?

In [83]:

%%sql
select group_concat(personal || ' ' || family, ', ') from (select * from person order by family);

Done.

Out[83]:

group_concat(personal \|\| ' ' \|\| family, ', ')
Frank Danforth, William Dyer, Anderson Lake, Frank Pabodie, Valentina Roerich

Combining Data¶

Suppose we wanted to submit the data to a web site that aggregates historical meteorological data, which needs the data formated as: latitude, longitude, date, quantity, and reading. However, the latitudes and longitudes are in the Site table, while the dates of measurements are in the Visited table, and the readings themselves are in the Survey table. These fields need to be sensibly combined.

The SQL command to do this is join.

In [84]:

%%sql
select * from Site join Visited;

Done.

Out[84]:

name	lat	long	ident	site	dated
DR-1	-49.85	-128.57	619	DR-1	1927-02-08
DR-1	-49.85	-128.57	622	DR-1	1927-02-10
DR-1	-49.85	-128.57	734	DR-3	1939-01-07
DR-1	-49.85	-128.57	735	DR-3	1930-01-12
DR-1	-49.85	-128.57	751	DR-3	1930-02-26
DR-1	-49.85	-128.57	752	DR-3	None
DR-1	-49.85	-128.57	837	MSK-4	1932-01-14
DR-1	-49.85	-128.57	844	DR-1	1932-03-22
DR-3	-47.15	-126.72	619	DR-1	1927-02-08
DR-3	-47.15	-126.72	622	DR-1	1927-02-10
DR-3	-47.15	-126.72	734	DR-3	1939-01-07
DR-3	-47.15	-126.72	735	DR-3	1930-01-12
DR-3	-47.15	-126.72	751	DR-3	1930-02-26
DR-3	-47.15	-126.72	752	DR-3	None
DR-3	-47.15	-126.72	837	MSK-4	1932-01-14
DR-3	-47.15	-126.72	844	DR-1	1932-03-22
MSK-4	-48.87	-123.4	619	DR-1	1927-02-08
MSK-4	-48.87	-123.4	622	DR-1	1927-02-10
MSK-4	-48.87	-123.4	734	DR-3	1939-01-07
MSK-4	-48.87	-123.4	735	DR-3	1930-01-12
MSK-4	-48.87	-123.4	751	DR-3	1930-02-26
MSK-4	-48.87	-123.4	752	DR-3	None
MSK-4	-48.87	-123.4	837	MSK-4	1932-01-14
MSK-4	-48.87	-123.4	844	DR-1	1932-03-22

join creates the cross product of two tables, that is, it joins each record of one with each record of the other to give all possible combinations. Since there are three records in Site and eight in Visited, the join's output has 24 records. And since each table has three fields, the output has six fields.

What the join hasn't done is figure out if the records being joined have anything to do with each other. It has no way of knowing whether they do or not until we tell it how. To do that, we add a clause specifying that we're only interested in combinations that have the same site name:

In [85]:

%%sql
select * from Site join Visited on Site.name=Visited.site;

Done.

Out[85]:

name	lat	long	ident	site	dated
DR-1	-49.85	-128.57	619	DR-1	1927-02-08
DR-1	-49.85	-128.57	622	DR-1	1927-02-10
DR-1	-49.85	-128.57	844	DR-1	1932-03-22
DR-3	-47.15	-126.72	734	DR-3	1939-01-07
DR-3	-47.15	-126.72	735	DR-3	1930-01-12
DR-3	-47.15	-126.72	751	DR-3	1930-02-26
DR-3	-47.15	-126.72	752	DR-3	None
MSK-4	-48.87	-123.4	837	MSK-4	1932-01-14

on does the same job as where: it only keeps records that pass some test. (The difference between the two is that on filters records as they're being created, while where waits until the join is done and then does the filtering.) Once we add this to our query, the database manager throws away records that combined information about two different sites, leaving us with just the ones we want.

Notice that we used <table>.<field> to specify field names in the output of the join. We do this because tables can have fields with the same name, and we need to be specific which ones we're talking about. For example, if we joined the person and visited tables, the result would inherit a field called ident from each of the original tables.

We can now use the same dotted notation to select the three columns we actually want out of our join:

In [86]:

%%sql
select Site.lat, Site.long, Visited.dated
from   Site join Visited
on     Site.name=Visited.site;

Done.

Out[86]:

lat	long	dated
-49.85	-128.57	1927-02-08
-49.85	-128.57	1927-02-10
-49.85	-128.57	1932-03-22
-47.15	-126.72	None
-47.15	-126.72	1930-01-12
-47.15	-126.72	1930-02-26
-47.15	-126.72	1939-01-07
-48.87	-123.4	1932-01-14

If joining two tables is good, joining many tables must be better. In fact, we can join any number of tables simply by adding more join clauses to our query, and more on tests to filter out combinations of records that don't make sense:

In [87]:

%%sql
select Site.lat, Site.long, Visited.dated, Survey.quant, Survey.reading
from   Site join Visited join Survey
on     Site.name=Visited.site
and    Visited.ident=Survey.taken
and    Visited.dated is not null;

Done.

Out[87]:

lat	long	dated	quant	reading
-49.85	-128.57	1927-02-08	rad	9.82
-49.85	-128.57	1927-02-08	sal	0.13
-49.85	-128.57	1927-02-10	rad	7.8
-49.85	-128.57	1927-02-10	sal	0.09
-47.15	-126.72	1939-01-07	rad	8.41
-47.15	-126.72	1939-01-07	sal	0.05
-47.15	-126.72	1939-01-07	temp	-21.5
-47.15	-126.72	1930-01-12	rad	7.22
-47.15	-126.72	1930-01-12	sal	0.06
-47.15	-126.72	1930-01-12	temp	-26.0
-47.15	-126.72	1930-02-26	rad	4.35
-47.15	-126.72	1930-02-26	sal	0.1
-47.15	-126.72	1930-02-26	temp	-18.5
-48.87	-123.4	1932-01-14	rad	1.46
-48.87	-123.4	1932-01-14	sal	0.21
-48.87	-123.4	1932-01-14	sal	22.5
-49.85	-128.57	1932-03-22	rad	11.25

We can tell which records from Site, Visited, and Survey correspond with each other because those tables contain primary keys and foreign keys.

primary key

: a value, or combination of values, that uniquely identifies each record in a table.

foreign key

: a value (or combination of values) from one table that identifies a unique record in another table.

In other words, a foreign key is the primary key of one table that appears in some other table. In our database, Person.ident is the primary key in the Person table, while Survey.person is a foreign key relating the Survey table's entries to entries in Person.

Most database designers believe that every table should have a well-defined primary key. They also believe that this key should be separate from the data itself, so that if we ever need to change the data, we only need to make one change in one place. One easy way to do this is to create an arbitrary, unique ID for each record as we add it to the database. This is actually very common: those IDs have names like "student numbers" and "patient numbers", and they almost always turn out to have originally been a unique record identifier in some database system or other. As the query below demonstrates, SQLite automatically numbers records as they're added to tables, and we can use those record numbers in queries:

In [88]:

%%sql
select rowid, * from Person;

Done.

Out[88]:

rowid	ident	personal	family
1	dyer	William	Dyer
2	pb	Frank	Pabodie
3	lake	Anderson	Lake
4	roe	Valentina	Roerich
5	danforth	Frank	Danforth

Data Hygiene¶

Now that we have seen how joins work, we can see why the relational model is so useful and how best to use it.

The first rule is that every value should be atomic, i.e., not contain parts that we might want to work with separately. We store personal and family names in separate columns instead of putting the entire name in one column so that we don't have to use substring operations to get the name's components. More importantly, we store the two parts of the name separately because splitting on spaces is unreliable: just think of a name like "Eloise St. Cyr" or "Jan Mikkel Steubart".

The second rule is that every record should have a unique primary key. This can be a serial number that has no intrinsic meaning, one of the values in the record (like the ident field in the Person table), or even a combination of values: the triple (taken, person, quant) from the Survey table uniquely identifies every measurement.

The third rule is that there should be no redundant information. For example, we could get rid of the Site table and rewrite the Visited table like this:

619	-49.85	-128.57	1927-02-08
622	-49.85	-128.57	1927-02-10
734	-47.15	-126.72	1939-01-07
735	-47.15	-126.72	1930-01-12
751	-47.15	-126.72	1930-02-26
752	-47.15	-126.72	null
837	-48.87	-123.40	1932-01-14
844	-49.85	-128.57	1932-03-22

In fact, we could use a single table that recorded all the information about each reading in each row, just as a spreadsheet would. The problem is that it's very hard to keep data organized this way consistent: if we realize that the date of a particular visit to a particular site is wrong, we have to change multiple records in the database. What's worse, we may have to guess which records to change, since other sites may also have been visited on that date.

The fourth rule is that the units for every value should be stored explicitly. Our database doesn't do this, and that's a problem: Roerich's salinity measurements are several orders of magnitude larger than anyone else's, but we don't know if that means she was using parts per million instead of parts per thousand, or whether there actually was a saline anomaly at that site in 1932.

Stepping back, data and the tools used to store it have a symbiotic relationship: we use tables and joins because it's efficient, provided our data is organized a certain way, but organize our data that way because we have tools to manipulate it efficiently if it's in a certain form. As anthropologists say, the tool shapes the hand that shapes the tool.

Example¶

Write a query that lists all radiation readings from the DR-1 site.

In [88]:

Creating and Modifying Data¶

The following statements create the four tables in our survey database:

create table Person(ident text, personal text, family text);
create table Site(name text, lat real, long real);
create table Visited(ident integer, site text, dated text);
create table Survey(taken integer, person text, quant real, reading real);

We can get rid of one of our tables using:

drop table Survey;

Be very careful when doing this: most databases have some support for undoing changes, but it's better not to have to rely on it.

Different database systems support different data types for table columns, but most provide the following:

integer	a signed integer
real	a floating point number
text	a character string
blob	a "binary large object", such as an image

Most databases also support Booleans and date/time values; SQLite uses the integers 0 and 1 for the former, and there are several ways to represent dates, as discussed earlier. An increasing number of databases also support geographic data types, such as latitude and longitude. Keeping track of what particular systems do or do not offer, and what names they give different data types, is an unending portability headache.

When we create a table, we can specify several kinds of constraints on its columns. For example, a better definition for the Survey table would be:

create table Survey(
    taken   integer not null, -- where reading taken
    person  text,             -- may not know who took it
    quant   real not null,    -- the quantity measured
    reading real not null,    -- the actual reading
    primary key(taken, quant),
    foreign key(taken) references Visited(ident),
    foreign key(person) references Person(ident)
);

Once again, exactly what constraints are avialable and what they're called depends on which database manager we are using.

Adding, Removing, and Updating Data¶

Once tables have been created, we can add, change, and remove records using our other set of commands, insert, update, and delete.

The simplest form of insert statement lists values in order:

insert into Site values('DR-1', -49.85, -128.57);
insert into Site values('DR-3', -47.15, -126.72);
insert into Site values('MSK-4', -48.87, -123.40);

We can also insert values into one table directly from another:

In [89]:

%%sql
create table JustLatLong(name text, lat text, long text);
insert into JustLatLong select name, lat, long from site;

(OperationalError) table JustLatLong already exists u'create table JustLatLong(name text, lat text, long text);' ()

Modifying existing records is done using the update statement. To do this we tell the database which table we want to update, what we want to change the values to for any or all of the fields, and under what conditions we should update the values.

For example, if we made a mistake when entering the lat and long values of the last insert statement above:

In [90]:

%%sql
update JustLatLong set lat=-47.87, long=-122.40 where name='MSK-4'

1 rows affected.

Out[90]:

[]

Be care to not forget the where clause or the update statement will modify all of the records in the database!

Deleting records can be a bit trickier, because we have to ensure that the database remains internally consistent. If all we care about is a single table, we can use the delete command with a where clause that matches the records we want to discard. For example, once we realize that Frank Danforth didn't take any measurements, we can remove him from the Person table like this:

delete from Person where ident = "danforth";

But what if we removed Anderson Lake instead? Our Survey table would still contain seven records of measurements he'd taken, but that's never supposed to happen: Survey.person is a foreign key into the Person table, and all our queries assume there will be a row in the latter matching every value in the former.

This problem is called referential integrity; we need to ensure that all references between tables can always be resolved correctly. One way to do this is to delete all the records that use 'lake' as a foreign key before deleting the record that uses it as a primary key.

Many applications use a hybrid storage model instead of putting everything into a database: the actual data (such as images) are stored in files, while the database stores the files' names, their modification dates, the region of the sky they cover, their spectral characteristics, and so on. This is also how most music player software is built: the database inside the application keeps track of the MP3 files, but the files themselves live on disk.

In [91]:

%%sql
drop table JustLatLong;

Done.

Out[91]:

[]

Exercise¶

One of our colleagues has sent us a CSV file containing temperature readings by Robert Olmstead, which is formatted like this:

Taken,Temp
619,-21.5
622,-15.5

Write some Python code that reads this in and prints out the SQL insert statements needed to add these records to the survey database.

(Note: you will need to add an entry for Olmstead to the Person table. If you are testing your program repeatedly, you may want to investigate SQL's insert or replace command.)

In [91]:

Programming with Databases¶

Let's return to the sqlite3 module. Here's a short Python program that selects latitudes and longitudes from survey.db:

In [92]:

connection = sqlite3.connect("../data/survey.db")
cursor = connection.cursor()
cursor.execute("select site.lat, site.long from site;")
results = cursor.fetchall()
for r in results:
    print(r)
cursor.close()
connection.close()

(-49.85, -128.57)
(-47.15, -126.72)
(-48.87, -123.4)

Queries in real applications will often depend on values provided by users. For example, this function takes a user's ID as a parameter and returns their name:

In [93]:

def get_name(database_file, person_ident):
    query = "select personal || ' ' || family from Person where ident='" + person_ident + "';"

    connection = sqlite3.connect(database_file)
    cursor = connection.cursor()
    cursor.execute(query)
    results = cursor.fetchall()
    cursor.close()
    connection.close()

    return results[0][0]

print("full name for dyer:", get_name('../data/survey.db', 'dyer'))

('full name for dyer:', u'William Dyer')

We use string concatenation on the first line of this function to construct a query containing the user ID we have been given. This seems simple enough, but what happens if someone gives us this string as input?

dyer'; drop table Survey; select '

It looks like there's garbage after the name of the project, but it is very carefully chosen garbage. If we insert this string into our query, the result is:

select personal || ' ' || family from Person where ident='dyer'; drop tale Survey; select '';

If we execute this, it will erase one of the tables in our database.

This is called an SQL injection attack, and it has been used to attack thousands of programs over the years. In particular, many web sites that take data from users insert values directly into queries without checking them carefully first.

Since a villain might try to smuggle commands into our queries in many different ways, the safest way to deal with this threat is to replace characters like quotes with their escaped equivalents, so that we can safely put whatever the user gives us inside a string. We can do this by using a prepared statement instead of formatting our statements as strings. Here's what our example program looks like if we do this:

In [94]:

def get_name(database_file, person_ident):
    query = "select personal || ' ' || family from Person where ident=?;"

    connection = sqlite3.connect(database_file)
    cursor = connection.cursor()
    cursor.execute(query, [person_ident])
    results = cursor.fetchall()
    cursor.close()
    connection.close()

    return results[0][0]

print("full name for dyer:", get_name('../data/survey.db', 'dyer'))

('full name for dyer:', u'William Dyer')

The key changes are in the query string and the execute call. Instead of formatting the query ourselves, we put question marks in the query template where we want to insert values. When we call execute, we provide a list that contains as many values as there are question marks in the query. The library matches values to question marks in order, and translates any special characters in the values into their escaped equivalents so that they are safe to use.

SQLite in PyMC¶

PyMC uses SQLite as an optional backend for storing MCMC traces. This is a good idea when you are collecting a lot of simulation output, or you want your output to be permanent and portable. PyMC stores traces in memory by default, meaning that they disappear when the current session ends.

The SQLite support resides in a subclass of PyMC's Database class that provides a common interface for all backend types. Instantiating the SQLite Database object intitializes (or opens) a database, and provides methods for committing and closing.

class Database(base.Database):

    """SQLite database.
    """

    def __init__(self, dbname, dbmode='a'):
        """Open or create an SQL database.

        :Parameters:
        dbname : string
          The name of the database file.
        dbmode : {'a', 'w'}
          File mode.  Use `a` to append values, and `w` to overwrite
          an existing file.
        """
        self.__name__ = 'sqlite'
        self.dbname = dbname
        self.__Trace__ = Trace

        self.trace_names = []
        # A list of sequences of names of the objects to tally.
        self._traces = {}  # A dictionary of the Trace objects.

        if os.path.exists(dbname) and dbmode == 'w':
            os.remove(dbname)

        self.DB = sqlite3.connect(dbname, check_same_thread=False)
        self.cur = self.DB.cursor()

        existing_tables = get_table_list(self.cur)
        if existing_tables:
            # Get number of existing chains
            self.cur.execute(
                'SELECT MAX(trace) FROM [%s]' %
                existing_tables[0])
            self.chains = self.cur.fetchall()[0][0] + 1
            self.trace_names = self.chains * [existing_tables, ]
        else:
            self.chains = 0

    def commit(self):
        """Commit updates to database"""
        self.DB.commit()

    def close(self, *args, **kwds):
        """Close database."""
        self.cur.close()
        self.commit()
        self.DB.close()

The second class is the Trace, which represents a single MCMC trace, and interacts with the Database to execute queries.

class Trace(base.Trace):

    """SQLite Trace class."""

    def _initialize(self, chain, length):
        """Create an SQL table.
        """

        if self._getfunc is None:
            self._getfunc = self.db.model._funs_to_tally[self.name]

        # Determine size
        try:
            self._shape = np.shape(self._getfunc())
        except TypeError:
            self._shape = None

        self._vstr = ', '.join(var_str(self._shape))

        # If the table already exists, exit now.
        if chain != 0:
            return

        # Create the variable name strings.
        vstr = ', '.join(v + ' FLOAT' for v in var_str(self._shape))
        query = """CREATE TABLE IF NOT EXISTS [%s]
                     (recid INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT,
                      trace  int(5), %s)""" % (self.name, vstr)
        self.db.cur.execute(query)

    def tally(self, chain):
        """Adds current value to trace."""

        try:
            valstring = ', '.join(
                ['%f' %
                 x for x in np.ravel(self._getfunc())])
        except:
            valstring = str(self._getfunc())

        # Add value to database
        query = "INSERT INTO [%s] (recid, trace, %s) values (NULL, %s, %s)" % \
            (self.name, self._vstr, chain, valstring)
        self.db.cur.execute(query)

    def gettrace(self, burn=0, thin=1, chain=-1, slicing=None):
        """Return the trace (last by default).

        Input:
          - burn (int): The number of transient steps to skip.
          - thin (int): Keep one in thin.
          - chain (int): The index of the chain to fetch. If None, return all
            chains. By default, the last chain is returned.
          - slicing: A slice, overriding burn and thin assignement.
        """
        if not slicing:
            slicing = slice(burn, None, thin)

        # If chain is None, get the data from all chains.
        if chain is None:
            self.db.cur.execute('SELECT * FROM [%s]' % self.name)
            trace = self.db.cur.fetchall()
        else:
            # Deal with negative chains (starting from the end)
            if chain < 0:
                chain = range(self.db.chains)[chain]
            self.db.cur.execute(
                'SELECT * FROM [%s] WHERE trace=%s' %
                (self.name, chain))
            trace = self.db.cur.fetchall()
        trace = np.array(trace)[:, 2:]
        if len(self._shape) > 1:
            trace = trace.reshape(-1, *self._shape)
        return squeeze(trace[slicing])

Exercise¶

Write SQL code to import the tables from the course project data into an SQLite database.

In [ ]:

References¶

Kreibich JA. Using SQLite. O'Reilly Media. 2010:1–528.

Using Databases and SQL. Software Carpentry.

In [1]:

from IPython.core.display import HTML
def css_styling():
    styles = open("styles/custom.css", "r").read()
    return HTML(styles)
css_styling()

Out[1]:

In [ ]: