Due to the sheer size of many modern datasets, statistical applications can quickly outstrip the capabilities of unstructured data stores, such as spreadsheets and text files. So, it is useful to be able to work with databases for both the retrieval of information when it is required for analysis and for the storage of processed data and analytic outputs.
Relational database management systems (RDBMS) are used to store user-defined records in large tables. Each table has columns (also known as fields) which describe the data, and rows (also known as records) which contain the data.
Unlike when we are using a spreadsheet, where we put formulas into cells to calculate values, when we are using a database, we send queries to the database management system to manipulate the database for us. The database manager can process complex query commands that combine data from multiple tables to generate reports and data summaries.
Every database manager—Oracle, IBM DB2, PostgreSQL, MySQL, Microsoft Access, and SQLite—stores data in a different way, so a database created with one cannot be used directly by another. However, every database manager can import and export data in a variety of formats, so it is possible to move information from one to another.
Writing Python code to access databases is made easier by the presence of the Python Database API (commonly referred to as the DB-API). This API fits nicely into existing Python code and allows programmers to easily store and retrieve data from databases.
The Python DB-API specifies a standard way to connect to databases and issue commands to them, and provides a common starting point for anyone writing code to interact with databases. There are several Python modules that extend the DB-API with different levels of abstraction. They range from simple result set wrappers to full blown object relational mappers. The simple wrappers typically wrap the results of database operations in more Pythonic data structures like dictionaries, while object relational mappers graft object interfaces onto the database, and model tables as Python classes. This allows Python programmers to (largely) distance themselves from writing SQL statements.
Databases are typically based on a client-server model. Just a web server provides content to multiple website visitors that request and render pages, databases return query results to one or more clients.
The model consists of one server instance and many client instances.
Server:
Client:
Popular database servers include: Oracle, MySQL, Microsoft SQL, SQLite
The database interaction paradigm consists of multiple steps:
Databases can be accessed from Python through the interface described by the Python DB-API. Different brands of database will have slightly different implementations, and may not completely conform to the DB-API, but most are quite similar.
Irrespective of the database system you are using, there are three core concepts to database access in Python:
: This is a connection to a database, which provides specific implementation details and access control.
: A cursor is a means for traversing the database, and governs the execution of SQL statements and the return of results.
: The output from database queries in Python is always a sequence of sequences, representing database tables of rows.
Database queries are sent via the cursor as strings of SQL. Thus, productive use of databases requires some mastery of the SQL language, and we will teach the basics of SQL here.
A good entry point is the SQLite database; SQLite is a public-domain software package that provides a relational database management system.
SQLite is defined by the following features:
: SQLite does not require a separate server process or system to operate. The SQLite library accesses its storage files directly.
: No server means no setup. Creating an SQLite database instance is as easy as opening a file.
: The entire database instance resides in a single cross-platform file, requiring no administration.
: A single library contains the entire database system, which integrates directly into a host application.
: The default build is less than a megabyte of code and requires only a few megabytes of memory.
: SQLite transactions are fully ACID-compliant, allowing safe access from multiple processes or threads.
Unlike most RDBMS products, SQLite does not have a client/server architecture. Most large-scale database systems have a large server package that makes up the database engine. The database server often consists of multiple processes that work in concert to manage client connections, file I/O, caches, query optimization, and query processing. A database instance typically consists of a large number of files organized into one or more directory trees on the server filesystem. In order to access the database, all of the files must be present and correct. This can make it somewhat difficult to move or reliably back up a database instance. All of these components require resources and support from the host computer.
In contrast, SQLite has no separate server. The entire database engine is integrated into whatever application needs to access a database. The only shared resource among applications is the single database file as it sits on disk. If you need to move or back up the database, you can simply copy the file. Unlike a traditional RDBMS server that requires advanced multitasking and high-performance inter-process communication, SQLite requires little more than the ability to read and write to some type of storage.
SQLite doesn't make sense for scenarios where there are a high number of transactions, a very large quantity of data, or multiple users requiring access control. Its suitability depends on the degree to which trading off simplicity and portability with security and robustness makes sense.
The Python Standard Library includes a module called sqlite3
intended for working with this database. This module is a SQL interface compliant with the DB-API 2.0 specification.
The sqlite3 command line tool accepts SQL commands from an interactive prompt and passes those commands to the SQLite core for processing. To get started, just run the sqlite3
command; If you provide a filename (such as our sample database survey.db
), sqlite3 will open (or create) that file. If no filename is given, sqlite3 will automatically open an unnamed temporary database.
The sqlite>
prompt means sqlite3 is ready to accept commands, similar to >>>
in Python. We can issue some basic expressions:
The first line is a simple SQL expression that returns a sum and an integer. All SQL commands in the command line tool must end with semicolons, which indicate the command is complete. We will explore the SELECT statement fully later on.
In addition to processing SQL statements, there is a series of shell-specific commands. These are sometimes referred to as dot-commands because they start with a period. Dot-commands control the shell’s output formatting, and also provide a number of utility features.
In the above example, .databases
lists the names and files of attached databases, while .tables
lists the tables in the attached databases. Finally, we exited with .quit
.
Relational databases are more useful to us when integrated with our analytic tools. The SQLite3 can be integrated with Python using the sqlite3
module which was written by Gerhard Haring. It provides an SQL interface compliant with the DB-API 2.0 specification. You do not need to install this module separately because it is part of Python's core set of modules.
The connect
function is used to establish a connection to a new or existing database. Since we're using SQLite, all we need to specify is the name of the database file. Other systems may require us to provide a username and password as well.
Let's create a new database.
import sqlite3
con = sqlite3.connect('microbiome.db')
con
<sqlite3.Connection at 0x103a13940>
In order to communicate with the database, we need to create a cursor object. A database cursor is a control structure that enables traversal over the records in a database. Cursors facilitate subsequent processing in conjunction with the traversal, such as retrieval, addition and removal of database records. Just like the cursor in an editor, its role is to keep track of where we are in the database.
cur = con.cursor()
cur
<sqlite3.Cursor at 0x1045dd810>
The cursor's execute
method provides the means for passing queries to the database. Let's create a table.
cur.execute('create table samples (Taxon text, Patient integer, NEC integer, Tissue integer, Stool integer)')
<sqlite3.Cursor at 0x1045dd810>
It's our job to make sure that SQL is properly formatted; if it isn't, or if something goes wrong when it is being executed, the database will report an error.
con.commit()
The commit
function completes the transaction.
Now to add some data to the empty table. We will now pass an INSERT query to the database, with some new data.
cur.execute('insert into samples values (\'Bacteroidetes\',14,1,102,33)')
<sqlite3.Cursor at 0x1045dd810>
con.commit()
Of course, it would be painstaking to construct a new query for each new observation that we intend to store in SQLite. Fortunately, data can be added as a set; for example, here are some microbiome sample from high-throughput RNA 16S sequencing.
samples = [('Firmicutes',1,0,136,4182),
('Firmicutes',2,1,1174,703),
('Firmicutes',3,0,408,3946),
('Firmicutes',4,1,831,8605),
('Firmicutes',5,0,693,50),
('Firmicutes',6,1,718,717),
('Proteobacteria',8,1,2651,767),
('Proteobacteria',9,0,1195,76),
('Proteobacteria',10,1,6857,795),
('Proteobacteria',11,0,483,666),
('Proteobacteria',12,1,2950,3994),
('Actinobacteria',11,0,42,75),
('Actinobacteria',12,1,109,59),
('Actinobacteria',13,0,51,183),
('Actinobacteria',14,1,310,204),
('Bacteroidetes',1,0,67,0),
('Bacteroidetes',2,1,0,0),
('Bacteroidetes',3,0,85,5),
('Bacteroidetes',4,1,143,7)]
The executemany
function will allow us to add them as a group, iterating over the list of tuples.
cur.executemany('insert into samples values (?,?,?,?,?)', samples)
<sqlite3.Cursor at 0x1045dd810>
Let's have a look at what is in the table now. For this we will execute a SELECT command. This returns the cursor, which contains an iterable data structure.
results = cur.execute('select * from samples')
results
<sqlite3.Cursor at 0x1045dd810>
results.fetchall()
[(u'Bacteroidetes', 14, 1, 102, 33), (u'Firmicutes', 1, 0, 136, 4182), (u'Firmicutes', 2, 1, 1174, 703), (u'Firmicutes', 3, 0, 408, 3946), (u'Firmicutes', 4, 1, 831, 8605), (u'Firmicutes', 5, 0, 693, 50), (u'Firmicutes', 6, 1, 718, 717), (u'Proteobacteria', 8, 1, 2651, 767), (u'Proteobacteria', 9, 0, 1195, 76), (u'Proteobacteria', 10, 1, 6857, 795), (u'Proteobacteria', 11, 0, 483, 666), (u'Proteobacteria', 12, 1, 2950, 3994), (u'Actinobacteria', 11, 0, 42, 75), (u'Actinobacteria', 12, 1, 109, 59), (u'Actinobacteria', 13, 0, 51, 183), (u'Actinobacteria', 14, 1, 310, 204), (u'Bacteroidetes', 1, 0, 67, 0), (u'Bacteroidetes', 2, 1, 0, 0), (u'Bacteroidetes', 3, 0, 85, 5), (u'Bacteroidetes', 4, 1, 143, 7)]
for row in results:
print(row)
Finally, we need to close our cursor and our connection, since the database can only keep a limited number of these open at one time. Since establishing a connection takes time, though, we shouldn't open a connection, do one operation, then close the connection, only to reopen it a few microseconds later to do another operation. Instead, it's normal to create one connection that stays open for the lifetime of the program.
cur.close()
con.close()
Catherine Devlin has created a SQL %magic function for IPython that allows for more direct querying of SQLite and other databases.
Go ahead and install the extension using pip:
!pip install ipython-sql
Downloading/unpacking ipython-sql Downloading ipython-sql-0.3.4.tar.gz Running setup.py (path:/private/var/folders/z_/kgsm76k94s7_jhnkfydfnd380000gn/T/pip_build_fonnescj/ipython-sql/setup.py) egg_info for package ipython-sql Downloading/unpacking prettytable (from ipython-sql) Downloading prettytable-0.7.2.tar.bz2 Running setup.py (path:/private/var/folders/z_/kgsm76k94s7_jhnkfydfnd380000gn/T/pip_build_fonnescj/prettytable/setup.py) egg_info for package prettytable Requirement already satisfied (use --upgrade to upgrade): ipython>=1.0 in /Users/fonnescj/anaconda/lib/python2.7/site-packages (from ipython-sql) Requirement already satisfied (use --upgrade to upgrade): sqlalchemy>=0.6.7 in /Users/fonnescj/anaconda/lib/python2.7/site-packages (from ipython-sql) Downloading/unpacking sqlparse (from ipython-sql) Downloading sqlparse-0.1.13.tar.gz (54kB): 54kB downloaded Running setup.py (path:/private/var/folders/z_/kgsm76k94s7_jhnkfydfnd380000gn/T/pip_build_fonnescj/sqlparse/setup.py) egg_info for package sqlparse Requirement already satisfied (use --upgrade to upgrade): six in /Users/fonnescj/anaconda/lib/python2.7/site-packages (from ipython-sql) Installing collected packages: ipython-sql, prettytable, sqlparse Running setup.py install for ipython-sql Running setup.py install for prettytable Running setup.py install for sqlparse changing mode of build/scripts-2.7/sqlformat from 644 to 755 changing mode of /Users/fonnescj/anaconda/bin/sqlformat to 755 Successfully installed ipython-sql prettytable sqlparse Cleaning up...
After loading the extension in the current session, it is easy to connect to databases and execute queries, without having to use the sqlite3
module at all.
%load_ext sql
%sql sqlite:///microbiome.db
u'Connected: None@microbiome.db'
%%sql
INSERT INTO samples VALUES ('Actinobacteria',7,0,260,58);
1 rows affected.
[]
writer_results = %sql select * from samples where NEC==1;
Done.
writer_results
Taxon | Patient | NEC | Tissue | Stool |
---|---|---|---|---|
Bacteroidetes | 14 | 1 | 102 | 33 |
writer_results.DataFrame()
Taxon | Patient | NEC | Tissue | Stool | |
---|---|---|---|---|---|
0 | Bacteroidetes | 14 | 1 | 102 | 33 |
This functionality is better for interactive computing than the sqlite3
mdoule, so we will be using it for most of the remainder of the section.
In the late 1920s and early 1930s, William Dyer, Frank Pabodie, and Valentina Roerich led expeditions to the Pole of Inaccessibility in the South Pacific, and then onward to Antarctica. Two years ago, their expeditions were found in a storage locker at Miskatonic University. We have scanned and OCR'd the data they contain, and we now want to store that information in a way that will make search and analysis easy.
To load this database, we pass its URI to the SQL magic:
%sql sqlite:///../data/survey.db
u'Connected: None@../data/survey.db'
Person: people who took readings.
Site: locations where readings were taken.
Visited: when readings were taken at specific sites.
|
Survey: the actual readings.
|
Notice that three entries—one in the Visited
table,
and two in the Survey
table—are shown in red
because they don't contain any actual data:
we'll return to these missing values.
For now,
let's write an SQL query that displays scientists' names.
We do this using the SQL command select
,
giving it the names of the columns we want and the table we want them from.
Our query and its output look like this:
%sql select family, personal from Person;
Done.
family | personal |
---|---|
Dyer | William |
Pabodie | Frank |
Lake | Anderson |
Roerich | Valentina |
Danforth | Frank |
The semi-colon at the end of the query tells the database manager that the query is complete and ready to run. We have written our commands and column names in lower case, and the table name in Title Case, but we don't have to: as the example below shows, SQL is case insensitive.
%%sql
SeLeCt FaMiLy, PeRsOnAl FrOm PeRsOn;
Done.
family | personal |
---|---|
Dyer | William |
Pabodie | Frank |
Lake | Anderson |
Roerich | Valentina |
Danforth | Frank |
Many people format queries as:
SELECT family, personal FROM person;
or as:
select Family, Personal from PERSON;
Whatever casing convention you choose, please be consistent: complex queries are hard enough to read without the extra cognitive load of random capitalization.
Going back to our query, it's important to understand that the rows and columns in a database table aren't actually stored in any particular order. They will always be displayed in some order, but we can control that in various ways. For example, we could swap the columns in the output by writing our query as:
%%sql
select personal, family from Person;
Done.
personal | family |
---|---|
William | Dyer |
Frank | Pabodie |
Anderson | Lake |
Valentina | Roerich |
Frank | Danforth |
or even repeat columns:
%%sql
select ident, ident, ident from Person;
Done.
ident | ident_1 | ident_2 |
---|---|---|
dyer | dyer | dyer |
pb | pb | pb |
lake | lake | lake |
roe | roe | roe |
danforth | danforth | danforth |
As a shortcut,
we can select all of the columns in a table using *
:
%%sql
select * from Person;
Done.
ident | personal | family |
---|---|---|
dyer | William | Dyer |
pb | Frank | Pabodie |
lake | Anderson | Lake |
roe | Valentina | Roerich |
danforth | Frank | Danforth |
Write a query that selects only site names from the Site
table.
%%sql
select name from site;
Done.
name |
---|
DR-1 |
DR-3 |
MSK-4 |
Data is often redundant,
so queries often return redundant information.
For example,
if we select the quantitites that have been measured
from the survey
table,
we get this:
%%sql
select quant from Survey;
Done.
quant |
---|
rad |
sal |
rad |
sal |
rad |
sal |
temp |
rad |
sal |
temp |
rad |
temp |
sal |
rad |
sal |
temp |
sal |
rad |
sal |
sal |
rad |
We can eliminate the redundant output
to make the result more readable
by adding the distinct
keyword
to our query:
%%sql
select distinct quant from Survey;
Done.
quant |
---|
rad |
sal |
temp |
If we select more than one column—for example, both the survey site ID and the quantity measured—then the distinct pairs of values are returned:
%%sql
select distinct taken, quant from Survey;
Done.
taken | quant |
---|---|
619 | rad |
619 | sal |
622 | rad |
622 | sal |
734 | rad |
734 | sal |
734 | temp |
735 | rad |
735 | sal |
735 | temp |
751 | rad |
751 | temp |
751 | sal |
752 | rad |
752 | sal |
752 | temp |
837 | rad |
837 | sal |
844 | rad |
Notice in both cases that duplicates are removed even if they didn't appear to be adjacent in the database. Again, it's important to remember that rows aren't actually ordered: they're just displayed that way.
Write a query that selects distinct dates from the Site
table.
%%sql
select dated from visited; -- here is a comment
Done.
dated |
---|
1927-02-08 |
1927-02-10 |
1939-01-07 |
1930-01-12 |
1930-02-26 |
None |
1932-01-14 |
1932-03-22 |
As we mentioned earlier,
database records are not stored in any particular order.
This means that query results aren't necessarily sorted,
and even if they are,
we often want to sort them in a different way,
e.g., by the name of the project instead of by the name of the scientist.
We can do this in SQL by adding an order by
clause to our query:
%%sql
select * from Person order by ident;
Done.
ident | personal | family |
---|---|---|
danforth | Frank | Danforth |
dyer | William | Dyer |
lake | Anderson | Lake |
pb | Frank | Pabodie |
roe | Valentina | Roerich |
By default,
results are sorted in ascending order
(i.e.,
from least to greatest).
We can sort in the opposite order using desc
(for "descending"):
%%sql
select * from person order by ident desc;
Done.
ident | personal | family |
---|---|---|
roe | Valentina | Roerich |
pb | Frank | Pabodie |
lake | Anderson | Lake |
dyer | William | Dyer |
danforth | Frank | Danforth |
(And if we want to make it clear that we're sorting in ascending order,
we can use asc
instead of desc
.)
We can also sort on several fields at once.
For example,
this query sorts results first in ascending order by taken
,
and then in descending order by person
within each group of equal taken
values:
%%sql
select taken, person from Survey order by taken asc, person desc;
Done.
taken | person |
---|---|
619 | dyer |
619 | dyer |
622 | dyer |
622 | dyer |
734 | pb |
734 | pb |
734 | lake |
735 | pb |
735 | None |
735 | None |
751 | pb |
751 | pb |
751 | lake |
752 | roe |
752 | lake |
752 | lake |
752 | lake |
837 | roe |
837 | lake |
837 | lake |
844 | roe |
This is easier to understand if we also remove duplicates:
%%sql
select distinct taken, person from Survey order by taken asc, person desc;
Done.
taken | person |
---|---|
619 | dyer |
622 | dyer |
734 | pb |
734 | lake |
735 | pb |
735 | None |
751 | pb |
751 | lake |
752 | roe |
752 | lake |
837 | roe |
837 | lake |
844 | roe |
Write a query that returns the distinct dates in the Visited
table.
Write a query that displays the full names of the scientists in the Person
table, ordered by family name.
%%sql
select personal, family from Person order by family;
Done.
personal | family |
---|---|
Frank | Danforth |
William | Dyer |
Anderson | Lake |
Frank | Pabodie |
Valentina | Roerich |
One of the most powerful features of a database is
the ability to filter data,
i.e.,
to select only those records that match certain criteria.
For example,
suppose we want to see when a particular site was visited.
We can select these records from the Visited
table
by using a where
clause in our query:
%%sql
select * from Visited where site='DR-1';
Done.
ident | site | dated |
---|---|---|
619 | DR-1 | 1927-02-08 |
622 | DR-1 | 1927-02-10 |
844 | DR-1 | 1932-03-22 |
The database manager executes this query in two stages.
First,
it checks at each row in the Visited
table
to see which ones satisfy the where
.
It then uses the column names following the select
keyword
to determine what columns to display.
This processing order means that
we can filter records using where
based on values in columns that aren't then displayed:
%%sql
select ident from Visited where site='DR-1';
Done.
ident |
---|
619 |
622 |
844 |
For example, we can ask for all information from the DR-1 site collected since 1930:
%%sql
select * from Visited where (site='DR-1') and (dated>='1930-00-00');
Done.
ident | site | dated |
---|---|---|
844 | DR-1 | 1932-03-22 |
(The parentheses around the individual tests aren't strictly required, but they help make the query easier to read.)
Most database managers have a special data type for dates. In fact, many have two: one for dates, such as "May 31, 1971", and one for durations, such as "31 days". SQLite doesn't: instead, it stores dates as either text (in the ISO-8601 standard format "YYYY-MM-DD HH:MM:SS.SSSS"), real numbers (the number of days since November 24, 4714 BCE), or integers (the number of seconds since midnight, January 1, 1970).
If we want to find out what measurements were taken by either Lake or Roerich,
we can combine the tests on their names using or
:
%%sql
select * from Survey where person='lake' or person='roe';
Done.
taken | person | quant | reading |
---|---|---|---|
734 | lake | sal | 0.05 |
751 | lake | sal | 0.1 |
752 | lake | rad | 2.19 |
752 | lake | sal | 0.09 |
752 | lake | temp | -16.0 |
752 | roe | sal | 41.6 |
837 | lake | rad | 1.46 |
837 | lake | sal | 0.21 |
837 | roe | sal | 22.5 |
844 | roe | rad | 11.25 |
Alternatively,
we can use in
to see if a value is in a specific set:
%%sql
select * from Survey where person in ('lake', 'roe');
Done.
taken | person | quant | reading |
---|---|---|---|
734 | lake | sal | 0.05 |
751 | lake | sal | 0.1 |
752 | lake | rad | 2.19 |
752 | lake | sal | 0.09 |
752 | lake | temp | -16.0 |
752 | roe | sal | 41.6 |
837 | lake | rad | 1.46 |
837 | lake | sal | 0.21 |
837 | roe | sal | 22.5 |
844 | roe | rad | 11.25 |
We can combine and
with or
,
but we need to be careful about which operator is executed first.
If we don't use parentheses,
we get this:
%%sql
select * from Survey where quant='sal' and person='lake' or person='roe';
Done.
taken | person | quant | reading |
---|---|---|---|
734 | lake | sal | 0.05 |
751 | lake | sal | 0.1 |
752 | lake | sal | 0.09 |
752 | roe | sal | 41.6 |
837 | lake | sal | 0.21 |
837 | roe | sal | 22.5 |
844 | roe | rad | 11.25 |
which is salinity measurements by Lake, and any measurement by Roerich. We probably want this instead:
%%sql
select * from Survey where quant='sal' and (person='lake' or person='roe');
Done.
taken | person | quant | reading |
---|---|---|---|
734 | lake | sal | 0.05 |
751 | lake | sal | 0.1 |
752 | lake | sal | 0.09 |
752 | roe | sal | 41.6 |
837 | lake | sal | 0.21 |
837 | roe | sal | 22.5 |
Finally,
we can use distinct
with where
to give a second level of filtering:
%%sql
select distinct person, quant from Survey where person='lake' or person='roe';
Done.
person | quant |
---|---|
lake | sal |
lake | rad |
lake | temp |
roe | sal |
roe | rad |
But remember:
distinct
is applied to the values displayed in the chosen columns,
not to the entire rows as they are being processed.
What we have just done is how most people "grow" their SQL queries. We started with something simple that did part of what we wanted, then added more clauses one by one, testing their effects as we went. This is a good strategy—in fact, for complex queries it's often the only strategy—but it depends on quick turnaround, and on us recognizing the right answer when we get it.
The best way to achieve quick turnaround is often to put a subset of data in a temporary database and run our queries against that, or to fill a small database with synthesized records. For example, instead of trying our queries against an actual database of 20 million records, we could run it against a sample of ten thousand, or write a small program to generate ten thousand random (but plausible) records and use that.
We can also filter results using pattern-matching. The SQL test *column-name* like *pattern*
is true if the value in the named column matches the pattern given; the character '%' can be used any number of times in the pattern to mean "match zero or more characters".
Expression | Value |
---|---|
'a' like 'a' | True |
'a' like '%a' | True |
'b' like '%a' | False |
'alpha' like 'a%' | True |
'alpha' like 'a%p%' | True |
The expression *column-name* not like *pattern*
inverts the test.
For example, we can write a query that finds all the records in Visited
that aren't from sites labelled 'DR-something'.
%%sql
select * from Visited where site not like 'DR-%';
Done.
ident | site | dated |
---|---|---|
837 | MSK-4 | 1932-01-14 |
Normalized salinity readings are supposed to be between 0.0 and 1.0. Write a query that selects all records from Survey
with salinity values outside this range.
%%sql
select * from Survey where quant='sal' and (reading>1 or reading<0);
Done.
taken | person | quant | reading |
---|---|---|---|
752 | roe | sal | 41.6 |
837 | roe | sal | 22.5 |
After carefully re-reading the expedition logs, we realize that the radiation measurements they report may need to be corrected upward by 5%. Rather than modifying the stored data, we can do this calculation on the fly as part of our query:
%%sql
select 1.05 * reading from Survey where quant='rad';
Done.
05 * reading |
---|
10.311 |
8.19 |
8.8305 |
7.581 |
4.5675 |
2.2995 |
1.533 |
11.8125 |
When we run the query,
the expression 1.05 * reading
is evaluated for each row.
Expressions can use any of the fields,
all of usual arithmetic operators,
and a variety of common functions.
(Exactly which ones depends on which database manager is being used.)
For example,
we can convert temperature readings from Fahrenheit to Celsius
and round to two decimal places:
%%sql
select taken, round(5*(reading-32)/9, 2) from Survey where quant='temp';
Done.
taken | round(5*(reading-32)/9, 2) |
---|---|
734 | -29.72 |
735 | -32.22 |
751 | -28.06 |
752 | -26.67 |
We can also combine values from different fields,
for example by using the string concatenation operator ||
:
%%sql
select personal || ' ' || family from Person;
Done.
personal || ' ' || family |
---|
William Dyer |
Frank Pabodie |
Anderson Lake |
Valentina Roerich |
Frank Danforth |
The union
operator combines the results of two queries:
%%sql
select * from Person where ident='dyer' union select * from Person where ident='roe';
Done.
ident | personal | family |
---|---|---|
dyer | William | Dyer |
roe | Valentina | Roerich |
As a more complicated example, let's assume that Roerich mistakenly recorded all her measurements as percentages, rather than decimal values. We can use union
to create a consolidated list of salinity measurements in which Roerich's (and only Roerich's) have been corrected.
%%sql
select taken, reading/100. from Survey where person='roe' and quant='sal'
union select taken, reading from Survey where not person='roe' and quant='sal';
Done.
taken | |
---|---|
619 | 0.13 |
622 | 0.09 |
734 | 0.05 |
751 | 0.1 |
752 | 0.09 |
752 | 0.416 |
837 | 0.21 |
837 | 0.225 |
The site identifiers in the Visited
table have two parts separated by a '-':
%%sql
select distinct site from Visited;
Done.
site |
---|
DR-1 |
DR-3 |
MSK-4 |
Some major site identifiers are two letters long and some are three.
The "in string" function instr(X, Y)
returns the 1-based index of the first occurrence of string Y in string X,
or 0 if Y does not exist in X.
The substring function substr(X, I, Z)
returns the substring of X starting at index I that is Z characters long.
Use these two functions to produce a list of unique major site identifiers.
(For this data,
the list should contain only "DR" and "MSK").
%%sql
select distinct substr(site, 0, instr(site, '-')) from visited;
Done.
substr(site, 0, instr(site, '-')) |
---|
DR |
MSK |
Real-world data is rarely complete.
Databases represent missing values using special value called null
.
null
is not zero, False
, or the empty string;
it is a one-of-a-kind value that means "nothing here".
Dealing with null
requires a few special tricks
and some careful thinking.
To start,
let's have a look at the Visited
table.
There are eight records,
but #752 doesn't have a date—or rather,
its date is null:
%%sql
select * from Visited;
Done.
ident | site | dated |
---|---|---|
619 | DR-1 | 1927-02-08 |
622 | DR-1 | 1927-02-10 |
734 | DR-3 | 1939-01-07 |
735 | DR-3 | 1930-01-12 |
751 | DR-3 | 1930-02-26 |
752 | DR-3 | None |
837 | MSK-4 | 1932-01-14 |
844 | DR-1 | 1932-03-22 |
Null doesn't behave like other values. If we select the records that come before 1930:
%%sql
select * from Visited where dated<'1930-00-00';
Done.
ident | site | dated |
---|---|---|
619 | DR-1 | 1927-02-08 |
622 | DR-1 | 1927-02-10 |
we get two results, and if we select the ones that come during or after 1930:
%%sql
select * from Visited where dated>='1930-00-00';
Done.
ident | site | dated |
---|---|---|
734 | DR-3 | 1939-01-07 |
735 | DR-3 | 1930-01-12 |
751 | DR-3 | 1930-02-26 |
837 | MSK-4 | 1932-01-14 |
844 | DR-1 | 1932-03-22 |
we get five,
but record #752 isn't in either set of results.
The reason is that
null<'1930-00-00'
is neither true nor false:
null means, "We don't know,"
and if we don't know the value on the left side of a comparison,
we don't know whether the comparison is true or false.
Since databases represent "don't know" as null,
the value of null<'1930-00-00'
is actually null
.
null>='1930-00-00'
is also null
because we can't answer to that question either.
And since the only records kept by a where
are those for which the test is true,
record #752 isn't included in either set of results.
Comparisons aren't the only operations that behave this way with nulls.
1+null
is null
,
5*null
is null
,
log(null)
is null
,
and so on.
In particular,
comparing things to null with = and != produces null:
%%sql
select * from Visited where dated=NULL;
Done.
ident | site | dated |
---|
%%sql
select * from Visited where dated!=NULL;
Done.
ident | site | dated |
---|
To check whether a value is null
or not,
we must use a special test is null
:
%%sql
select * from Visited where dated is NULL;
Done.
ident | site | dated |
---|---|---|
752 | DR-3 | None |
or its inverse is not null
:
%%sql
select * from Visited where dated is not NULL;
Done.
ident | site | dated |
---|---|---|
619 | DR-1 | 1927-02-08 |
622 | DR-1 | 1927-02-10 |
734 | DR-3 | 1939-01-07 |
735 | DR-3 | 1930-01-12 |
751 | DR-3 | 1930-02-26 |
837 | MSK-4 | 1932-01-14 |
844 | DR-1 | 1932-03-22 |
Null values cause headaches wherever they appear. For example, suppose we want to find all the salinity measurements that weren't taken by Dyer. It's natural to write the query like this:
%%sql
select * from Survey where quant='sal' and person!='lake';
Done.
taken | person | quant | reading |
---|---|---|---|
619 | dyer | sal | 0.13 |
622 | dyer | sal | 0.09 |
752 | roe | sal | 41.6 |
837 | roe | sal | 22.5 |
but this query filters omits the records
where we don't know who took the measurement.
Once again,
the reason is that when person
is null
,
the !=
comparison produces null
,
so the record isn't kept in our results.
If we want to keep these records
we need to add an explicit check:
%%sql
select * from Survey where quant='sal' and (person!='lake' or person is null);
Done.
taken | person | quant | reading |
---|---|---|---|
619 | dyer | sal | 0.13 |
622 | dyer | sal | 0.09 |
735 | None | sal | 0.06 |
752 | roe | sal | 41.6 |
837 | roe | sal | 22.5 |
We still have to decide whether this is the right thing to do or not. If we want to be absolutely sure that we aren't including any measurements by Lake in our results, we need to exclude all the records for which we don't know who did the work.
Note also that the following does not work as you might expect:
%%sql
select * from Visited where dated in ('1927-02-08', null);
Done.
ident | site | dated |
---|---|---|
619 | DR-1 | 1927-02-08 |
Write a query that sorts the records in Visited
by date, omitting entries for which the date is not known (i.e., is null).
%%sql
select * from visited where dated is not null order by dated;
Done.
ident | site | dated |
---|---|---|
619 | DR-1 | 1927-02-08 |
622 | DR-1 | 1927-02-10 |
735 | DR-3 | 1930-01-12 |
751 | DR-3 | 1930-02-26 |
837 | MSK-4 | 1932-01-14 |
844 | DR-1 | 1932-03-22 |
734 | DR-3 | 1939-01-07 |
We now want to calculate ranges and averages for our data.
We know how to select all of the dates from the Visited
table:
%%sql
select dated from Visited;
Done.
dated |
---|
1927-02-08 |
1927-02-10 |
1939-01-07 |
1930-01-12 |
1930-02-26 |
None |
1932-01-14 |
1932-03-22 |
but to combine them, we must use an aggregation function such as min
or max
.
Each of these functions takes a set of records as input, and produces a single record as output:
%%sql
select min(dated) from Visited;
Done.
min(dated) |
---|
1927-02-08 |
%%sql
select max(dated) from Visited;
Done.
max(dated) |
---|
1939-01-07 |
min
and max
are just two of
the aggregation functions built into SQL.
Three others are avg
,
count
,
and sum
:
%%sql
select avg(reading) from Survey where quant='sal';
Done.
avg(reading) |
---|
7.20333333333 |
%%sql
select count(reading) from Survey where quant='sal';
Done.
count(reading) |
---|
9 |
%%sql
select sum(reading) from Survey where quant='sal';
Done.
sum(reading) |
---|
64.83 |
We used count(reading)
here,
but we could just as easily have counted quant
or any other field in the table,
or even used count(*)
,
since the function doesn't care about the values themselves,
just how many values there are.
SQL lets us do several aggregations at once. We can, for example, find the range of sensible salinity measurements:
%%sql
select min(reading), max(reading) from Survey where quant='sal' and reading<=1.0;
Done.
min(reading) | max(reading) |
---|---|
0.05 | 0.21 |
We can also combine aggregated results with raw results, although the output might surprise you:
%%sql
select person, count(*) from Survey where quant='sal' and reading<=1.0;
Done.
person | count(*) |
---|---|
lake | 7 |
Why does Lake's name appear rather than Roerich's or Dyer's? The answer is that when it has to aggregate a field, but isn't told how to, the database manager chooses an actual value from the input set. It might use the first one processed, the last one, or something else entirely.
Another important fact is that when there are no values to aggregate, aggregation's result is "don't know" rather than zero or some other arbitrary value:
%%sql
select person, max(reading), sum(reading) from Survey where quant='missing';
Done.
person | max(reading) | sum(reading) |
---|---|---|
None | None | None |
One final important feature of aggregation functions is that
they are inconsistent with the rest of SQL in a very useful way.
If we add two values,
and one of them is null,
the result is null.
By extension,
if we use sum
to add all the values in a set,
and any of those values are null,
the result should also be null.
It's much more useful,
though,
for aggregation functions to ignore null values
and only combine those that are non-null.
This behavior lets us write our queries as:
%%sql
select min(dated) from Visited;
Done.
min(dated) |
---|
1927-02-08 |
instead of always having to filter explicitly:
%%sql
select min(dated) from Visited where dated is not null;
Done.
min(dated) |
---|
1927-02-08 |
Aggregating all records at once doesn't always make sense. For example, suppose we suspect that there is a systematic bias in the data, and that some scientists' radiation readings are higher than others.
We know that this doesn't work:
%%sql
select person, count(reading), round(avg(reading), 2)
from Survey
where quant='rad';
Done.
person | count(reading) | round(avg(reading), 2) |
---|---|---|
roe | 8 | 6.56 |
because the database manager selects a single arbitrary scientist's name rather than aggregating separately for each scientist. Since there are only five scientists, she could write five queries of the form:
%%sql
select person, count(reading), round(avg(reading), 2)
from Survey
where quant='rad'
and person='dyer';
Done.
person | count(reading) | round(avg(reading), 2) |
---|---|---|
dyer | 2 | 8.81 |
but this would be tedious, and if she ever had a data set with fifty or five hundred scientists, the chances of her getting all of those queries right is small.
What we need to do is
tell the database manager to aggregate the hours for each scientist separately
using a group by
clause (recall a similar function in Pandas):
%%sql
select person, count(reading), round(avg(reading), 2)
from Survey
where quant='rad'
group by person;
Done.
person | count(reading) | round(avg(reading), 2) |
---|---|---|
dyer | 2 | 8.81 |
lake | 2 | 1.82 |
pb | 3 | 6.66 |
roe | 1 | 11.25 |
group by
does exactly what its name implies:
groups all the records with the same value for the specified field together
so that aggregation can process each batch separately.
Since all the records in each batch have the same value for person
,
it no longer matters that the database manager
is picking an arbitrary one to display
alongside the aggregated reading
values.
Just as we can sort by multiple criteria at once,
we can also group by multiple criteria.
To get the average reading by scientist and quantity measured,
for example,
we just add another field to the group by
clause:
%%sql
select person, quant, count(reading), round(avg(reading), 2)
from Survey
group by person, quant;
Done.
person | quant | count(reading) | round(avg(reading), 2) |
---|---|---|---|
None | sal | 1 | 0.06 |
None | temp | 1 | -26.0 |
dyer | rad | 2 | 8.81 |
dyer | sal | 2 | 0.11 |
lake | rad | 2 | 1.82 |
lake | sal | 4 | 0.11 |
lake | temp | 1 | -16.0 |
pb | rad | 3 | 6.66 |
pb | temp | 2 | -20.0 |
roe | rad | 1 | 11.25 |
roe | sal | 2 | 32.05 |
Note that we have added person
to the list of fields displayed,
since the results wouldn't make much sense otherwise.
Let's go one step further and remove all the entries where we don't know who took the measurement:
%%sql
select person, quant, count(reading), round(avg(reading), 2)
from Survey
where person is not null
group by person, quant
order by person, quant;
Done.
person | quant | count(reading) | round(avg(reading), 2) |
---|---|---|---|
dyer | rad | 2 | 8.81 |
dyer | sal | 2 | 0.11 |
lake | rad | 2 | 1.82 |
lake | sal | 4 | 0.11 |
lake | temp | 1 | -16.0 |
pb | rad | 3 | 6.66 |
pb | temp | 2 | -20.0 |
roe | rad | 1 | 11.25 |
roe | sal | 2 | 32.05 |
The order of operations for this query is as follows:
selected records from the Survey
table
where the person
field was not null;
grouped those records into subsets
so that the person
and quant
values in each subset
were the same;
ordered those subsets first by person
,
and then within each sub-group by quant
;
and
counted the number of records in each subset,
calculated the average reading
in each,
and chose a person
and quant
value from each
(it doesn't matter which ones,
since they're all equal).
Suppose now we wanted to center some of the readings, by subtracting the mean. One might expect the following would do the trick:
%%sql
select reading - avg(reading) from Survey where quant='rad';
Done.
reading - avg(reading) |
---|
4.6875 |
However, you cannot mix aggregated and non-aggregated values in the same query. The (somewhat awkward) solution is to calculate the average in a separate query, and use that value in a query of the non-aggregated values.
%%sql
select reading - (select avg(reading) from Survey where quant='rad')
from Survey where quant='rad';
Done.
reading - (select avg(reading) from Survey where quant='rad') |
---|
3.2575 |
1.2375 |
1.8475 |
0.6575 |
-2.2125 |
-4.3725 |
-5.1025 |
4.6875 |
The function group_concat(field, separator)
concatenates all the values in a field using the specified separator character (or ',' if the separator isn't specified). Use this to produce a one-line list of scientists' names, such as:
William Dyer, Frank Pabodie, Anderson Lake, Valentina Roerich, Frank Danforth
Can you find a way to order the list by surname?
%%sql
select group_concat(personal || ' ' || family, ', ') from (select * from person order by family);
Done.
group_concat(personal || ' ' || family, ', ') |
---|
Frank Danforth, William Dyer, Anderson Lake, Frank Pabodie, Valentina Roerich |
Suppose we wanted to submit the data to a web site that aggregates historical meteorological data,
which needs the data formated as: latitude, longitude, date, quantity, and reading.
However, the latitudes and longitudes are in the Site
table, while the dates of measurements are in the Visited
table, and the readings themselves are in the Survey
table. These fields need to be sensibly combined.
The SQL command to do this is join
.
%%sql
select * from Site join Visited;
Done.
name | lat | long | ident | site | dated |
---|---|---|---|---|---|
DR-1 | -49.85 | -128.57 | 619 | DR-1 | 1927-02-08 |
DR-1 | -49.85 | -128.57 | 622 | DR-1 | 1927-02-10 |
DR-1 | -49.85 | -128.57 | 734 | DR-3 | 1939-01-07 |
DR-1 | -49.85 | -128.57 | 735 | DR-3 | 1930-01-12 |
DR-1 | -49.85 | -128.57 | 751 | DR-3 | 1930-02-26 |
DR-1 | -49.85 | -128.57 | 752 | DR-3 | None |
DR-1 | -49.85 | -128.57 | 837 | MSK-4 | 1932-01-14 |
DR-1 | -49.85 | -128.57 | 844 | DR-1 | 1932-03-22 |
DR-3 | -47.15 | -126.72 | 619 | DR-1 | 1927-02-08 |
DR-3 | -47.15 | -126.72 | 622 | DR-1 | 1927-02-10 |
DR-3 | -47.15 | -126.72 | 734 | DR-3 | 1939-01-07 |
DR-3 | -47.15 | -126.72 | 735 | DR-3 | 1930-01-12 |
DR-3 | -47.15 | -126.72 | 751 | DR-3 | 1930-02-26 |
DR-3 | -47.15 | -126.72 | 752 | DR-3 | None |
DR-3 | -47.15 | -126.72 | 837 | MSK-4 | 1932-01-14 |
DR-3 | -47.15 | -126.72 | 844 | DR-1 | 1932-03-22 |
MSK-4 | -48.87 | -123.4 | 619 | DR-1 | 1927-02-08 |
MSK-4 | -48.87 | -123.4 | 622 | DR-1 | 1927-02-10 |
MSK-4 | -48.87 | -123.4 | 734 | DR-3 | 1939-01-07 |
MSK-4 | -48.87 | -123.4 | 735 | DR-3 | 1930-01-12 |
MSK-4 | -48.87 | -123.4 | 751 | DR-3 | 1930-02-26 |
MSK-4 | -48.87 | -123.4 | 752 | DR-3 | None |
MSK-4 | -48.87 | -123.4 | 837 | MSK-4 | 1932-01-14 |
MSK-4 | -48.87 | -123.4 | 844 | DR-1 | 1932-03-22 |
join
creates the cross product of two tables, that is, it joins each record of one with each record of the other to give all possible combinations. Since there are three records in Site
and eight in Visited
, the join's output has 24 records. And since each table has three fields, the output has six fields.
What the join hasn't done is figure out if the records being joined have anything to do with each other. It has no way of knowing whether they do or not until we tell it how. To do that, we add a clause specifying that we're only interested in combinations that have the same site name:
%%sql
select * from Site join Visited on Site.name=Visited.site;
Done.
name | lat | long | ident | site | dated |
---|---|---|---|---|---|
DR-1 | -49.85 | -128.57 | 619 | DR-1 | 1927-02-08 |
DR-1 | -49.85 | -128.57 | 622 | DR-1 | 1927-02-10 |
DR-1 | -49.85 | -128.57 | 844 | DR-1 | 1932-03-22 |
DR-3 | -47.15 | -126.72 | 734 | DR-3 | 1939-01-07 |
DR-3 | -47.15 | -126.72 | 735 | DR-3 | 1930-01-12 |
DR-3 | -47.15 | -126.72 | 751 | DR-3 | 1930-02-26 |
DR-3 | -47.15 | -126.72 | 752 | DR-3 | None |
MSK-4 | -48.87 | -123.4 | 837 | MSK-4 | 1932-01-14 |
on
does the same job as where
: it only keeps records that pass some test. (The difference between the two is that on
filters records as they're being created, while where
waits until the join is done and then does the filtering.) Once we add this to our query, the database manager throws away records that combined information about two different sites, leaving us with just the ones we want.
Notice that we used <table>.<field>
to specify field names in the output of the join. We do this because tables can have fields with the same name, and we need to be specific which ones we're talking about. For example, if we joined the person
and visited
tables, the result would inherit a field called ident
from each of the original tables.
We can now use the same dotted notation to select the three columns we actually want out of our join:
%%sql
select Site.lat, Site.long, Visited.dated
from Site join Visited
on Site.name=Visited.site;
Done.
lat | long | dated |
---|---|---|
-49.85 | -128.57 | 1927-02-08 |
-49.85 | -128.57 | 1927-02-10 |
-49.85 | -128.57 | 1932-03-22 |
-47.15 | -126.72 | None |
-47.15 | -126.72 | 1930-01-12 |
-47.15 | -126.72 | 1930-02-26 |
-47.15 | -126.72 | 1939-01-07 |
-48.87 | -123.4 | 1932-01-14 |
If joining two tables is good,
joining many tables must be better.
In fact,
we can join any number of tables
simply by adding more join
clauses to our query,
and more on
tests to filter out combinations of records
that don't make sense:
%%sql
select Site.lat, Site.long, Visited.dated, Survey.quant, Survey.reading
from Site join Visited join Survey
on Site.name=Visited.site
and Visited.ident=Survey.taken
and Visited.dated is not null;
Done.
lat | long | dated | quant | reading |
---|---|---|---|---|
-49.85 | -128.57 | 1927-02-08 | rad | 9.82 |
-49.85 | -128.57 | 1927-02-08 | sal | 0.13 |
-49.85 | -128.57 | 1927-02-10 | rad | 7.8 |
-49.85 | -128.57 | 1927-02-10 | sal | 0.09 |
-47.15 | -126.72 | 1939-01-07 | rad | 8.41 |
-47.15 | -126.72 | 1939-01-07 | sal | 0.05 |
-47.15 | -126.72 | 1939-01-07 | temp | -21.5 |
-47.15 | -126.72 | 1930-01-12 | rad | 7.22 |
-47.15 | -126.72 | 1930-01-12 | sal | 0.06 |
-47.15 | -126.72 | 1930-01-12 | temp | -26.0 |
-47.15 | -126.72 | 1930-02-26 | rad | 4.35 |
-47.15 | -126.72 | 1930-02-26 | sal | 0.1 |
-47.15 | -126.72 | 1930-02-26 | temp | -18.5 |
-48.87 | -123.4 | 1932-01-14 | rad | 1.46 |
-48.87 | -123.4 | 1932-01-14 | sal | 0.21 |
-48.87 | -123.4 | 1932-01-14 | sal | 22.5 |
-49.85 | -128.57 | 1932-03-22 | rad | 11.25 |
We can tell which records from Site
, Visited
, and Survey
correspond with each other because those tables contain primary keys and foreign keys.
: a value, or combination of values, that uniquely identifies each record in a table.
: a value (or combination of values) from one table that identifies a unique record in another table.
In other words, a foreign key is the primary key of one table that appears in some other table. In our database, Person.ident
is the primary key in the Person
table, while Survey.person
is a foreign key relating the Survey
table's entries to entries in Person
.
Most database designers believe that every table should have a well-defined primary key. They also believe that this key should be separate from the data itself, so that if we ever need to change the data, we only need to make one change in one place. One easy way to do this is to create an arbitrary, unique ID for each record as we add it to the database. This is actually very common: those IDs have names like "student numbers" and "patient numbers", and they almost always turn out to have originally been a unique record identifier in some database system or other. As the query below demonstrates, SQLite automatically numbers records as they're added to tables, and we can use those record numbers in queries:
%%sql
select rowid, * from Person;
Done.
rowid | ident | personal | family |
---|---|---|---|
1 | dyer | William | Dyer |
2 | pb | Frank | Pabodie |
3 | lake | Anderson | Lake |
4 | roe | Valentina | Roerich |
5 | danforth | Frank | Danforth |
Now that we have seen how joins work, we can see why the relational model is so useful and how best to use it.
The first rule is that every value should be atomic, i.e., not contain parts that we might want to work with separately. We store personal and family names in separate columns instead of putting the entire name in one column so that we don't have to use substring operations to get the name's components. More importantly, we store the two parts of the name separately because splitting on spaces is unreliable: just think of a name like "Eloise St. Cyr" or "Jan Mikkel Steubart".
The second rule is that every record should have a unique primary key. This can be a serial number that has no intrinsic meaning, one of the values in the record (like the ident
field in the Person
table), or even a combination of values: the triple (taken, person, quant)
from the Survey
table uniquely identifies every measurement.
The third rule is that there should be no redundant information. For example, we could get rid of the Site
table and rewrite the Visited
table like this:
619 | -49.85 | -128.57 | 1927-02-08 |
622 | -49.85 | -128.57 | 1927-02-10 |
734 | -47.15 | -126.72 | 1939-01-07 |
735 | -47.15 | -126.72 | 1930-01-12 |
751 | -47.15 | -126.72 | 1930-02-26 |
752 | -47.15 | -126.72 | null |
837 | -48.87 | -123.40 | 1932-01-14 |
844 | -49.85 | -128.57 | 1932-03-22 |
In fact, we could use a single table that recorded all the information about each reading in each row, just as a spreadsheet would. The problem is that it's very hard to keep data organized this way consistent: if we realize that the date of a particular visit to a particular site is wrong, we have to change multiple records in the database. What's worse, we may have to guess which records to change, since other sites may also have been visited on that date.
The fourth rule is that the units for every value should be stored explicitly. Our database doesn't do this, and that's a problem: Roerich's salinity measurements are several orders of magnitude larger than anyone else's, but we don't know if that means she was using parts per million instead of parts per thousand, or whether there actually was a saline anomaly at that site in 1932.
Stepping back, data and the tools used to store it have a symbiotic relationship: we use tables and joins because it's efficient, provided our data is organized a certain way, but organize our data that way because we have tools to manipulate it efficiently if it's in a certain form. As anthropologists say, the tool shapes the hand that shapes the tool.
Write a query that lists all radiation readings from the DR-1 site.
The following statements create the four tables in our survey database:
create table Person(ident text, personal text, family text);
create table Site(name text, lat real, long real);
create table Visited(ident integer, site text, dated text);
create table Survey(taken integer, person text, quant real, reading real);
We can get rid of one of our tables using:
drop table Survey;
Be very careful when doing this: most databases have some support for undoing changes, but it's better not to have to rely on it.
Different database systems support different data types for table columns, but most provide the following:
integer | a signed integer |
real | a floating point number |
text | a character string |
blob | a "binary large object", such as an image |
Most databases also support Booleans and date/time values; SQLite uses the integers 0 and 1 for the former, and there are several ways to represent dates, as discussed earlier. An increasing number of databases also support geographic data types, such as latitude and longitude. Keeping track of what particular systems do or do not offer, and what names they give different data types, is an unending portability headache.
When we create a table, we can specify several kinds of constraints on its columns. For example, a better definition for the Survey
table would be:
create table Survey(
taken integer not null, -- where reading taken
person text, -- may not know who took it
quant real not null, -- the quantity measured
reading real not null, -- the actual reading
primary key(taken, quant),
foreign key(taken) references Visited(ident),
foreign key(person) references Person(ident)
);
Once again, exactly what constraints are avialable and what they're called depends on which database manager we are using.
Once tables have been created,
we can add, change, and remove records using our other set of commands,
insert
, update
, and delete
.
The simplest form of insert
statement lists values in order:
insert into Site values('DR-1', -49.85, -128.57);
insert into Site values('DR-3', -47.15, -126.72);
insert into Site values('MSK-4', -48.87, -123.40);
We can also insert values into one table directly from another:
%%sql
create table JustLatLong(name text, lat text, long text);
insert into JustLatLong select name, lat, long from site;
(OperationalError) table JustLatLong already exists u'create table JustLatLong(name text, lat text, long text);' ()
Modifying existing records is done using the update
statement. To do this we tell the database which table we want to update, what we want to change the values to for any or all of the fields, and under what conditions we should update the values.
For example, if we made a mistake when entering the lat and long values of the last insert
statement above:
%%sql
update JustLatLong set lat=-47.87, long=-122.40 where name='MSK-4'
1 rows affected.
[]
Be care to not forget the where
clause or the update statement will modify all of the records in the database!
Deleting records can be a bit trickier, because we have to ensure that the database remains internally consistent. If all we care about is a single table, we can use the delete
command with a where
clause that matches the records we want to discard. For example, once we realize that Frank Danforth didn't take any measurements, we can remove him from the Person
table like this:
delete from Person where ident = "danforth";
But what if we removed Anderson Lake instead? Our Survey
table would still contain seven records of measurements he'd taken, but that's never supposed to happen: Survey.person
is a foreign key into the Person
table, and all our queries assume there will be a row in the latter matching every value in the former.
This problem is called referential integrity; we need to ensure that all references between tables can always be resolved correctly. One way to do this is to delete all the records that use 'lake'
as a foreign key before deleting the record that uses it as a primary key.
Many applications use a hybrid storage model instead of putting everything into a database: the actual data (such as images) are stored in files, while the database stores the files' names, their modification dates, the region of the sky they cover, their spectral characteristics, and so on. This is also how most music player software is built: the database inside the application keeps track of the MP3 files, but the files themselves live on disk.
%%sql
drop table JustLatLong;
Done.
[]
One of our colleagues has sent us a CSV file containing temperature readings by Robert Olmstead, which is formatted like this:
Taken,Temp
619,-21.5
622,-15.5
Write some Python code that reads this in and prints out the SQL insert
statements needed
to add these records to the survey database.
(Note: you will need to add an entry for Olmstead to the Person
table.
If you are testing your program repeatedly, you may want to investigate SQL's insert or replace
command.)
Let's return to the sqlite3
module. Here's a short Python program that selects latitudes and longitudes
from survey.db
:
connection = sqlite3.connect("../data/survey.db")
cursor = connection.cursor()
cursor.execute("select site.lat, site.long from site;")
results = cursor.fetchall()
for r in results:
print(r)
cursor.close()
connection.close()
(-49.85, -128.57) (-47.15, -126.72) (-48.87, -123.4)
Queries in real applications will often depend on values provided by users. For example, this function takes a user's ID as a parameter and returns their name:
def get_name(database_file, person_ident):
query = "select personal || ' ' || family from Person where ident='" + person_ident + "';"
connection = sqlite3.connect(database_file)
cursor = connection.cursor()
cursor.execute(query)
results = cursor.fetchall()
cursor.close()
connection.close()
return results[0][0]
print("full name for dyer:", get_name('../data/survey.db', 'dyer'))
('full name for dyer:', u'William Dyer')
We use string concatenation on the first line of this function to construct a query containing the user ID we have been given. This seems simple enough, but what happens if someone gives us this string as input?
dyer'; drop table Survey; select '
It looks like there's garbage after the name of the project, but it is very carefully chosen garbage. If we insert this string into our query, the result is:
select personal || ' ' || family from Person where ident='dyer'; drop tale Survey; select '';
If we execute this, it will erase one of the tables in our database.
This is called an SQL injection attack, and it has been used to attack thousands of programs over the years. In particular, many web sites that take data from users insert values directly into queries without checking them carefully first.
Since a villain might try to smuggle commands into our queries in many different ways, the safest way to deal with this threat is to replace characters like quotes with their escaped equivalents, so that we can safely put whatever the user gives us inside a string. We can do this by using a prepared statement instead of formatting our statements as strings. Here's what our example program looks like if we do this:
def get_name(database_file, person_ident):
query = "select personal || ' ' || family from Person where ident=?;"
connection = sqlite3.connect(database_file)
cursor = connection.cursor()
cursor.execute(query, [person_ident])
results = cursor.fetchall()
cursor.close()
connection.close()
return results[0][0]
print("full name for dyer:", get_name('../data/survey.db', 'dyer'))
('full name for dyer:', u'William Dyer')
The key changes are in the query string and the execute
call.
Instead of formatting the query ourselves,
we put question marks in the query template where we want to insert values.
When we call execute
,
we provide a list
that contains as many values as there are question marks in the query.
The library matches values to question marks in order,
and translates any special characters in the values
into their escaped equivalents
so that they are safe to use.
PyMC uses SQLite as an optional backend for storing MCMC traces. This is a good idea when you are collecting a lot of simulation output, or you want your output to be permanent and portable. PyMC stores traces in memory by default, meaning that they disappear when the current session ends.
The SQLite support resides in a subclass of PyMC's Database
class that provides a common interface for all backend types. Instantiating the SQLite Database
object intitializes (or opens) a database, and provides methods for committing and closing.
class Database(base.Database):
"""SQLite database.
"""
def __init__(self, dbname, dbmode='a'):
"""Open or create an SQL database.
:Parameters:
dbname : string
The name of the database file.
dbmode : {'a', 'w'}
File mode. Use `a` to append values, and `w` to overwrite
an existing file.
"""
self.__name__ = 'sqlite'
self.dbname = dbname
self.__Trace__ = Trace
self.trace_names = []
# A list of sequences of names of the objects to tally.
self._traces = {} # A dictionary of the Trace objects.
if os.path.exists(dbname) and dbmode == 'w':
os.remove(dbname)
self.DB = sqlite3.connect(dbname, check_same_thread=False)
self.cur = self.DB.cursor()
existing_tables = get_table_list(self.cur)
if existing_tables:
# Get number of existing chains
self.cur.execute(
'SELECT MAX(trace) FROM [%s]' %
existing_tables[0])
self.chains = self.cur.fetchall()[0][0] + 1
self.trace_names = self.chains * [existing_tables, ]
else:
self.chains = 0
def commit(self):
"""Commit updates to database"""
self.DB.commit()
def close(self, *args, **kwds):
"""Close database."""
self.cur.close()
self.commit()
self.DB.close()
The second class is the Trace
, which represents a single MCMC trace, and interacts with the Database
to execute queries.
class Trace(base.Trace):
"""SQLite Trace class."""
def _initialize(self, chain, length):
"""Create an SQL table.
"""
if self._getfunc is None:
self._getfunc = self.db.model._funs_to_tally[self.name]
# Determine size
try:
self._shape = np.shape(self._getfunc())
except TypeError:
self._shape = None
self._vstr = ', '.join(var_str(self._shape))
# If the table already exists, exit now.
if chain != 0:
return
# Create the variable name strings.
vstr = ', '.join(v + ' FLOAT' for v in var_str(self._shape))
query = """CREATE TABLE IF NOT EXISTS [%s]
(recid INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT,
trace int(5), %s)""" % (self.name, vstr)
self.db.cur.execute(query)
def tally(self, chain):
"""Adds current value to trace."""
try:
valstring = ', '.join(
['%f' %
x for x in np.ravel(self._getfunc())])
except:
valstring = str(self._getfunc())
# Add value to database
query = "INSERT INTO [%s] (recid, trace, %s) values (NULL, %s, %s)" % \
(self.name, self._vstr, chain, valstring)
self.db.cur.execute(query)
def gettrace(self, burn=0, thin=1, chain=-1, slicing=None):
"""Return the trace (last by default).
Input:
- burn (int): The number of transient steps to skip.
- thin (int): Keep one in thin.
- chain (int): The index of the chain to fetch. If None, return all
chains. By default, the last chain is returned.
- slicing: A slice, overriding burn and thin assignement.
"""
if not slicing:
slicing = slice(burn, None, thin)
# If chain is None, get the data from all chains.
if chain is None:
self.db.cur.execute('SELECT * FROM [%s]' % self.name)
trace = self.db.cur.fetchall()
else:
# Deal with negative chains (starting from the end)
if chain < 0:
chain = range(self.db.chains)[chain]
self.db.cur.execute(
'SELECT * FROM [%s] WHERE trace=%s' %
(self.name, chain))
trace = self.db.cur.fetchall()
trace = np.array(trace)[:, 2:]
if len(self._shape) > 1:
trace = trace.reshape(-1, *self._shape)
return squeeze(trace[slicing])
Write SQL code to import the tables from the course project data into an SQLite database.
Kreibich JA. Using SQLite. O'Reilly Media. 2010:1–528.
Using Databases and SQL. Software Carpentry.
from IPython.core.display import HTML
def css_styling():
styles = open("styles/custom.css", "r").read()
return HTML(styles)
css_styling()