A gentle introduction to SQL

This notebook introduces some of the basic commands for querying and modifying a database using the Structured Query Language, SQL.

Part One retrieves data in the form of some csv files, from elsewhere in the internet. Then we're going to use a particular handy tool to turn this data into a database.

In Part two, we'll walk through some basic SQL commands for exploring and transforming this data. The amphitheatre data is courtesy Sebastian Heath, https://doi.org/10.5281/zenodo.596149. The aqueduct data is from Pelagios Commons.

Part One

We've already obtained the amphitheatre data from Sebastian Heath using the curl command, like so:

!curl https://raw.githubusercontent.com/sfsheath/roman-amphitheaters/master/roman-amphitheaters.csv > amphi.csv

We also downloaded information about roman aqueducts from the peripleo api as json, and converted to csv using this tool.

Finally, we want to turn these two csv files into a single database containing two tables. We will use the command line tool 'sqlitebiter' by Tsuyoshi Hombashi to do this, which we have already installed.

In [1]:
!sqlitebiter -o roman.db file "amphi.csv" "aqua.csv"
[INFO] sqlitebiter file: convert 'amphi.csv' to 'amphi' table
[INFO] sqlitebiter file: convert 'aqua.csv' to 'aqua' table
[INFO] sqlitebiter file: number of created tables: 2
[INFO] sqlitebiter file: database path: /home/jovyan/roman.db
In [80]:
!sqlite3 roman.db .schema .exit
CREATE TABLE IF NOT EXISTS '_source_info_' (source_id INTEGER PRIMARY KEY AUTOINCREMENT, dir_name TEXT, base_name TEXT NOT NULL, format TEXT NOT NULL, size INTEGER, mtime INTEGER);
CREATE TABLE sqlite_sequence(name,seq);
CREATE TABLE IF NOT EXISTS 'amphi' (id TEXT, title TEXT, label TEXT, latintoponym TEXT, pleiades TEXT, welchid INTEGER, golvinid INTEGER, buildingtype TEXT, chronogroup TEXT, secondcentury INTEGER, capacity INTEGER, modcountry TEXT, romanregion TEXT, arenamajor REAL, arenaminor REAL, extmajor REAL, extminor REAL, exteriorheight REAL, longitude REAL, latitude REAL, elevation INTEGER);
CREATE TABLE IF NOT EXISTS 'aqua' (identifier TEXT, title TEXT, "object_type" TEXT, "dataset_path" TEXT, "geo_bounds" TEXT, names TEXT, "temporal_bounds" TEXT, depictions TEXT, matches TEXT);
In [81]:
!sqlite3 roman.db .tables .exit
_source_info_  amphi          aqua         

Part Two

Now that we have a database, we'll bring it into python so that we can query it. Once a database is in python, we can do a wide variety of data science type visualizations or explorations, although these are beyond the remit of the current notebook.

The first thing we're going to do is create a function that opens a connection to the database, and allows us to build queries. After we create the function, we can create query objects, and then run_query. Students might also want to consult this tutorial.

In [82]:
# create a function for querying the database
import sqlite3
import pandas as pd

db = sqlite3.connect('roman.db')

def run_query(query):
    return pd.read_sql_query(query,db)

Let's give it a try. We're going to build a query that asks, 'show us every column in the amphi table, but only for the first five rows.'

In [18]:
query = 'SELECT * FROM amphi LIMIT 5;'
run_query(query)
Out[18]:
id title label latintoponym pleiades welchid golvinid buildingtype chronogroup secondcentury ... modcountry romanregion arenamajor arenaminor extmajor extminor exteriorheight longitude latitude elevation
0 duraEuroposAmphitheater Amphitheater at Dura Europos Dura Dura Europus https://pleiades.stoa.org/places/893989 129 amphitheater severan 0 ... Syria syria 31 25 50 44 40.728926 34.749855 223
1 arlesAmphitheater Amphitheater at Arles Arles Arelate https://pleiades.stoa.org/places/148217 154 amphitheater flavian 1 ... France narbonensis 47 32 136 107 4.631111 43.677778 21
2 lyonAmphitheater Amphitheater at Lyon Lyon Lugdunum https://pleiades.stoa.org/places/167717 amphitheater second-century 1 ... France lugdunensis 67.6 42 105 4.830556 45.770556 206
3 ludusMagnusArena Ludus Magnus Arena Ludus Magnus Ludus Magnus https://pleiades.stoa.org/places/423025 practice-arena imperial 0 ... Italy regio-i 12.494913 41.889950 22
4 romeFlavianAmphitheater Flavian Amphitheater at Rome Colosseum https://pleiades.stoa.org/places/423025 152 amphitheater flavian 1 ... Italy regio-i 83 48 189 156 52 12.492269 41.890169 22

5 rows × 21 columns

In [19]:
# just check that the aquaduct table is in there too
query = 'SELECT * FROM aqua LIMIT 5;'
run_query(query)
Out[19]:
identifier title object_type dataset_path geo_bounds names temporal_bounds depictions matches
0 http://dare.ht.lu.se/places/43530 Roman aqueduct, Mitiline, Lesbos Place [{"title":"dare-20160328","id":"dare-20160328"}] {"min_lon":26.514763,"max_lon":26.514763,"min_... ["Roman aqueduct, Mitiline, Lesbos"]
1 http://pleiades.stoa.org/places/738817254 Ansignan aqueduct Place [{"title":"pleiades","id":"pleiades"}] {"min_lon":2.5140266,"max_lon":2.5140266,"min_...
2 http://pleiades.stoa.org/places/403927 Caldaccoli aqueduct Place [{"title":"pleiades","id":"pleiades"}] {"min_lon":10.4367702,"max_lon":10.4367702,"mi...
3 http://topostext.org/place/352254BTei Lyttos aqueduct (Lasithi) Place [{"title":"ToposText Places","id":"ToposText P... {"min_lon":25.3818,"max_lon":25.3818,"min_lat"... {"start":-750,"end":640}
4 http://pleiades.stoa.org/places/246891 Albarracín-Cella Roman aqueduct Place [{"title":"pleiades","id":"pleiades"}] {"min_lon":-1.5,"max_lon":-1.5,"min_lat":40.5,...

Basic Query Commands

SELECT, LIMIT, ORDER BY : using these, we can ask, 'Which amphitheatre is at the highest elevation?'

Use SELECT to retrieve the id and elevation columns FROM the amphi table

Use ORDER BY to sort the elevation column and use the DESC keyword to specify that you want to sort in descending order

Use LIMIT to restrict the output to 1 row

In [30]:
query = '''
SELECT id, elevation 
FROM amphi
ORDER BY elevation DESC
LIMIT 1;
'''

run_query(query)
Out[30]:
id elevation
0 lambaesisAmphitheater 1170

Let's get the top 10 now

In [31]:
query = '''
SELECT id, elevation 
FROM amphi
ORDER BY elevation DESC
LIMIT 10;
'''

run_query(query)
Out[31]:
id elevation
0 lambaesisAmphitheater 1170
1 albaFucensAmphitheater 977
2 mactarisAmphitheater 911
3 tebessaAmphitheater 877
4 leonAmphitheater 846
5 bostraAmphitheater 845
6 segobrigaAmphitheater 817
7 siccaVeneriaAmphitheater 748
8 sanBenedettoDeiMarsiAmphitheater 683
9 sanVittorinoAmphitheater 672

Following this pattern, can you create a query that also provides the geographic coordinates? In the block below see if you can construct and run that query.

In [ ]:
 

Querying with Conditions

Now let's create a query that creates a subset of data using a logical operator. We need the 'WHERE' command.

In [32]:
query = '''
SELECT * 
FROM amphi
WHERE elevation > 500;
'''

run_query(query)
Out[32]:
id title label latintoponym pleiades welchid golvinid buildingtype chronogroup secondcentury ... modcountry romanregion arenamajor arenaminor extmajor extminor exteriorheight longitude latitude elevation
0 segobrigaAmphitheater Amphitheater at Segobriga Segobriga Segobriga https://pleiades.stoa.org/places/266036 88 amphitheater flavian 1 ... Spain tarraconensis 47 31 75 64 -2.813944 39.886018 817
1 cyreneAmphitheater Amphitheater at Cyrene Cyrene https://pleiades.stoa.org/places/373778 amphitheater imperial 1 ... Libya crete-et-cyrenaica 32.7 28.8 21.850808 32.824526 550
2 grumentoAmphitheater Amphitheater at Grumento Grumento Grumentum https://pleiades.stoa.org/places/442603 92 amphitheater imperial 1 ... Italy regio-iii 60 40 90 70 15.911579 40.287640 579
3 albaFucensAmphitheater Amphitheater at Alba Fucens Alba Fucens Alba Fucens https://pleiades.stoa.org/places/413005 amphitheater julio-claudian 1 ... Italy regio-iv 64 37 103 76 13.412289 42.077200 977
4 lambaesisAmphitheater Amphitheater at Lambaesis Lambaesis https://pleiades.stoa.org/places/334570 amphitheater second-century 1 ... Algeria mauretania 68 55 104.6 94 6.259935 35.489247 1170
5 mactarisAmphitheater Amphitheater at Mactaris Mactaris https://pleiades.stoa.org/places/324774 119 amphitheater second-century 1 ... Tunisia proconsularis 38.4 24.8 63.2 49.6 9.206673 35.855628 911
6 sbeitlaAmphitheater Amphitheater at Sbeitla Sbeitla Sufetula https://pleiades.stoa.org/places/324816 113 amphitheater second-century 1 ... Tunisia proconsularis 47 37 72 60 9.114582 35.243617 556
7 tebessaAmphitheater Amphitheater at Theveste Tebessa Theveste https://pleiades.stoa.org/places/324831 amphitheater flavian 1 ... Algeria proconsularis 52.8 39.5 83 70 8.123809 35.401171 877
8 susaAmphitheater Amphitheater at Susa Susa Segusio https://pleiades.stoa.org/places/167919 15 amphitheater imperial 1 ... Italy alpes-cottiae 44 36 60 52 7.044970 45.133185 546
9 sophiaAmphitheater Amphitheater at Sophia Sophia Ulpia Serdica https://pleiades.stoa.org/places/207439 amphitheater post-severan 0 ... Bulgaria thracia 60.5 43 23.328650 42.697178 554
10 rodezAmphitheater Amphitheater at Rodez Rodez Segodunum https://pleiades.stoa.org/places/138579 22 amphitheater imperial 1 ... France aquitania 2.571162 44.353698 611
11 sanBenedettoDeiMarsiAmphitheater Amphitheater at San Benedetto dei Marsi San Benedetto Marruvium https://pleiades.stoa.org/places/432927 amphitheater imperial 1 ... Italy regio-iv 101 13.627505 42.006504 683
12 sanVittorinoAmphitheater Amphitheater at San Vittorino San Vittorino Amiternum https://pleiades.stoa.org/places/413013 amphitheater first-century 1 ... Italy regio-iv 64 42 90 68 13.306027 42.400531 672
13 sisapoAmphitheater Amphitheater at Sisapo Sisapo https://pleiades.stoa.org/places/266043 amphitheater imperial 1 ... Spain baetica -4.516667 38.645833 659
14 leonAmphitheater Amphitheater at León León Legio https://pleiades.stoa.org/places/236512 amphitheater imperial 1 ... Spain tarraconensis -5.566944 42.598889 846
15 bernAmphitheater Amphitheater at Bern Bern Bremodorum https://pleiades.stoa.org/places/177471 42 amphitheater imperial 1 ... Switzerland germania-superior 27.6 23.5 42.6 38.5 7.451179 46.975715 547
16 bostraAmphitheater Amphitheater at Bosra Bosra Nova Trajana Bostra https://pleiades.stoa.org/places/678073 amphitheater imperial 1 ... Syria syria 36.479844 32.517923 845
17 siccaVeneriaAmphitheater Amphitheater at Sicca Veneria Sicca Veneria Sicca Veneria https://pleiades.stoa.org/places/315152 amphitheater imperial 1 ... Tunisia proconsularis 70 50 100 80 8.715302 36.185361 748
18 aostaAmphitheater Amphitheater at Aosta Aosta Augusta Praetoria https://pleiades.stoa.org/places/383579 amphitheater julio-claudian 1 ... Italy regio-xi 150 115 7.323481 45.740914 595
19 aphrodisiasAmphitheater Late Roman Amphitheater at Aphrodisas Aphrodisias Aphrodisias https://pleiades.stoa.org/places/638753 arena-in-stadium fourth-century 0 ... Turkey asia 59 30 28.723061 37.712553 526
20 tusculumAmphitheater Tusculum Amphitheater Tusculum https://pleiades.stoa.org/places/423108 amphitheater second-century 1 ... Italy regio-i 49 30 73 54 12.702103 41.798372 582
21 compsaAmphitheater Amphitheater at Compsa Compsa Compsa https://pleiades.stoa.org/places/442550 amphitheater republican 1 ... Italy regio-ii 15.330556 40.870278 572

22 rows × 21 columns

Our condition can be string data too; in which case we put the string in quotation marks:

In [33]:
query = '''
SELECT * 
FROM amphi
WHERE chronogroup = "flavian";
'''

run_query(query)
Out[33]:
id title label latintoponym pleiades welchid golvinid buildingtype chronogroup secondcentury ... modcountry romanregion arenamajor arenaminor extmajor extminor exteriorheight longitude latitude elevation
0 arlesAmphitheater Amphitheater at Arles Arles Arelate https://pleiades.stoa.org/places/148217 154 amphitheater flavian 1 ... France narbonensis 47 32 136 107 4.631111 43.677778 21
1 romeFlavianAmphitheater Flavian Amphitheater at Rome Colosseum https://pleiades.stoa.org/places/423025 152 amphitheater flavian 1 ... Italy regio-i 83 48 189 156 52 12.492269 41.890169 22
2 newsteadAmphitheater Amphitheater at Newstead Newstead Trimontium https://pleiades.stoa.org/places/89304 amphitheater flavian 1 ... United Kingdom britannia 37 23 70 60 -2.691900 55.602600 83
3 pozzuoliFlavianAmphitheater Flavian Amphitheater at Pozzuoli Pozzuoli (Flavian) Puteoli https://pleiades.stoa.org/places/432815 153 amphitheater flavian 1 ... Italy regio-i 74.8 42 149 116 14.125315 40.825929 41
4 segobrigaAmphitheater Amphitheater at Segobriga Segobriga Segobriga https://pleiades.stoa.org/places/266036 88 amphitheater flavian 1 ... Spain tarraconensis 47 31 75 64 -2.813944 39.886018 817
5 parisAmphitheater Amphitheater at Paris Paris Lutetia Parisiorum https://pleiades.stoa.org/places/109126 gallo-roman-amphitheater flavian 1 ... France lugdunensis 2.352855 48.845097 46
6 nimesAmphitheater Amphitheater at Nimes Nimes Nemausus https://pleiades.stoa.org/places/148142 amphitheater flavian 1 ... France narbonensis 69.1 38.4 133.4 101.4 4.359599 43.834876 50
7 chesterAmphitheater Amphitheater at Chester Chester Deva Victrix https://pleiades.stoa.org/places/79420 amphitheater flavian 1 ... United Kingdom britannia 56.7 48.3 95.5 86.6 -2.886928 53.189079 23
8 arezzoAmphitheater Amphitheater at Arezzo Arezzo Arretium https://pleiades.stoa.org/places/413032 amphitheater flavian 1 ... Italy regio-vii 71.9 42.7 109.4 80 11.880370 43.460491 263
9 londonAmphitheater Amphitheater at London London Londinium https://pleiades.stoa.org/places/79574 amphitheater flavian 1 ... United Kingdom britannia 60 40 100 85 -0.092136 51.515522 29
10 urbisagliaAmphitheater Amphitheater at Urbs Salvia Urbs Salvia Urbs Saglia https://pleiades.stoa.org/places/413364 amphitheater flavian 1 ... Italy regio-v 59 35 98 71 13.387224 43.200877 219
11 vindonissaAmphitheater Amphitheater at Vindonissa Vindonissa https://pleiades.stoa.org/places/177661 17 amphitheater flavian 1 ... Switzerland germania-superior 64 51 112 98 8.213605 47.476290 361
12 frejusAmphitheater Amphitheater Fréjus Fréjus Forum Julii https://pleiades.stoa.org/places/157836 amphitheater flavian 1 ... France narbonensis 67.7 39.7 113.7 85.7 6.728710 43.434460 12
13 tebessaAmphitheater Amphitheater at Theveste Tebessa Theveste https://pleiades.stoa.org/places/324831 amphitheater flavian 1 ... Algeria proconsularis 52.8 39.5 83 70 8.123809 35.401171 877
14 aquileiaAmphitheater Amphitheater at Aquileia Aquileia Aquileia https://pleiades.stoa.org/places/187290 amphitheater flavian 1 ... Italy regio-x 72 46 142 118 13.370713 45.768239 2
15 terminiImereseAmphitheater Amphitheater at Termini Imerese Termini Thermae Himerae https://pleiades.stoa.org/places/462513 164 amphitheater flavian 1 ... Italy sicilia 53 30 98 75 13.695186 37.983696 76
16 bolsenaAmphitheater Amphitheater at Bolsena Bolsena Volsinii Novi https://pleiades.stoa.org/places/413389 amphitheater flavian 1 ... Italy regio-vii 11.990158 42.650243 417
17 narbonneAmphitheater Amphitheater at Narbonne Narbonne Narbo https://pleiades.stoa.org/places/246347 amphitheater flavian 1 ... France narbonensis 75 46.6 121.6 93.2 3.010181 43.184873 14
18 capuaImperialAmphitheater Imperial Amphitheater at Capua Capua II Capua https://pleiades.stoa.org/places/432754 amphitheater flavian 1 ... Italy regio-i 76.12 45.83 165 135 14.250089 41.085935 34
19 dorchesterAmphitheater Amphitheater at Dorchester / Maumbury Henge Maumbury Durnovaria https://pleiades.stoa.org/places/79431 amphitheater flavian 1 ... United Kingdom britannia 58 47 80 77 -2.440246 50.708044 78
20 autunAmphitheater Amphitheater at Autun Autun Augustodunum https://pleiades.stoa.org/places/177460 157 amphitheater flavian 1 ... France lugdunensis 74 49 154 130 4.310856 46.953003 317
21 bobadelaAmphitheater Amphitheater at Bobadela Bobadela Elbocoris https://pleiades.stoa.org/places/236458 amphitheater flavian 1 ... Portugal lusitania 50 40 -7.893572 40.361088 383
22 chichesterAmphitheater Amphitheater at Chichester Chichester Noviomagus Reginorum https://pleiades.stoa.org/places/79622 amphitheater flavian 1 ... United Kingdom britannia 70 60 -0.771186 50.835394 14

23 rows × 21 columns

Can you write a query that pulls only the Flavian amphitheatres in France? Hint: you'll need the AND command.

In [ ]:
 

Adding some maths

How many such amphitheatres are there? This is where you'd use the COUNT command. Let's count up the number of amphitheatres from the second century.

In [35]:
query = '''
SELECT COUNT(*)
FROM amphi
WHERE chronogroup = "second-century";
'''

run_query(query)
Out[35]:
COUNT(*)
0 50

We can rename that result like so:

In [36]:
query = '''
SELECT COUNT(*) AS "Total Count of Second Century Amphitheatres in the DB"
FROM amphi
WHERE chronogroup = "second-century";
'''

run_query(query)
Out[36]:
Total Count of Second Century Amphitheatres in the DB
0 50

SUM, AVG, MIN and MAX

What was the average capacity?

In [37]:
query = '''
SELECT AVG(capacity) AS "Average Capacity"
FROM amphi;
'''

run_query(query)
Out[37]:
Average Capacity
0 5791.476923
In [85]:
query = '''
SELECT AVG(arenamajor) AS "Average Length"
FROM amphi;
'''

run_query(query)
Out[85]:
Average Length
0 32.982759

We can group rows by one value versus another to see how they compare. Is there a difference in average length of the long axis in Julio-Claudian versus Flavian amphitheatres?

In [45]:
query = '''
SELECT chronogroup, AVG(arenamajor) AS "Average Length"
FROM amphi
GROUP BY chronogroup
ORDER BY "Average Length" DESC;
'''

run_query(query)
Out[45]:
chronogroup Average Length
0 post-severan 62.750000
1 fourth-century 59.000000
2 neronian 57.000000
3 augustan 56.125000
4 flavian 54.266087
5 late-second-century 46.333333
6 hadrianic 45.626667
7 third-century 43.500000
8 julio-claudian 43.273571
9 second-century 42.856800
10 severan 42.450000
11 first-century 42.227778
12 imperial 22.010092
13 caesarean 19.333333
14 republican 10.250000
15 late-1st-early-2nd 0.000000

JOIN

Now let's tell the database how the two tables are joined together. The label field in the amphi table contains the modern day description of the location of amphitheatres, and the title field in the aqua table contains a description of the modern day location of the aqueducts. Normally, when we join two tables, we want to perform the join on columns that are keyed together. In a sales database for instance there might a table of orders and another for shipping address, and each one contains a customer_id column. In such a case, we use = to say

FROM orders 
INNER JOIN shipping_address 
ON orders.customer_id = shipping_addres.customer_id

But archaeological data is rarely so straightforward. In our two tables here, we have to pattern match in order to make the two fields join up - there is no 'primary key' to help us know that a row in one table is talking about the same thing in another table. Instead of = we're going to use the LIKE command. LIKE uses two different kinds of wildcards, % and _.

  • % matches any sequence of zero or more characters
  • _ matches any single character.

If we said, LIKE 'Arl%' we would find matches on Arles, Arlate and so on. Placing the % on either side would find strings that contain Arl. In our case, we want to find instances in the aqua table's title column that contain strings from the amphi table's label column.

To join to our first table all matching rows from our second, we do an 'inner join'. The syntax generally is:

SELECT relevant-columns            # these will be the columns displayed in your result
FROM tableA                        # the table to join
INNER JOIN tableB                  # with this table
ON tableA.title = tableB.label     # by these criteria

The query below displays the result of joining the aqua table to the amphi table using the labels column data as the middle piece in a wildcard: %string%, but uses the || characters to indicate we want the string values, not the literal characters amphi.label.

In [84]:
query = '''
SELECT amphi.id, amphi.label, aqua.identifier
FROM aqua 
INNER JOIN amphi
ON aqua.title LIKE '%' || amphi.label || '%';
'''


run_query(query)
Out[84]:
id label identifier
0 mitilineAmphitheater Mitiline http://dare.ht.lu.se/places/43530
  • true confession: There is not an amphitheatre at Mitilene to our knowledge. We added one row to the table manually so that this join example would work properly. (When we retrieved the data from the Pelagios api, we only downloaded the first page of results, in order to keep the notebook light).

We've also created a small notebook that shows how to import a database into R, and to build queries for it. Once you've done that, you can pass the results as a dataframe and use the full power of R to analyze. The notebook is here.

This is also possible in python, of course, and we have an example notebook here

In [ ]: