SQL Murder Mystery

The SQL Murder Mystery is a self-directed lesson to learn SQL concepts and commands and a fun game for experienced SQL users to solve an intriguing crime published by the Northwestern University Knight Lab and based on Noah Veltman's Command Line Murder Mystery.

To solve the mystery, you need to download a small database which contains all the clues you need, if you can work out what SQL commands to use to find them.

Using a notebook makes sense for this sort of investigation because it can help you keep track of, and notes on, the current state of your investigation.

This original database is distributed as a SQLite database, but this notebook shows how to load it into, and access it from, the PostgreSQL database distributed as part of the TM351 VM.

Instructions for downloading a Virtualbox image of the TM351 VM, or building the VM on a remote host, can be found here.

Download and Install the Database

We can download the original SQLite3 database directly from the original repository:

In [1]:
!wget https://github.com/NUKnightLab/sql-mysteries/raw/master/sql-murder-mystery.db

!ls *.db
--2019-10-08 11:33:31--  https://github.com/NUKnightLab/sql-mysteries/raw/master/sql-murder-mystery.db
Resolving github.com (github.com)... 140.82.118.4
Connecting to github.com (github.com)|140.82.118.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/NUKnightLab/sql-mysteries/master/sql-murder-mystery.db [following]
--2019-10-08 11:33:31--  https://raw.githubusercontent.com/NUKnightLab/sql-mysteries/master/sql-murder-mystery.db
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3002368 (2.9M) [application/octet-stream]
Saving to: ‘sql-murder-mystery.db’

sql-murder-mystery. 100%[===================>]   2.86M  3.91MB/s    in 0.7s    

2019-10-08 11:33:33 (3.91 MB/s) - ‘sql-murder-mystery.db’ saved [3002368/3002368]

sql-murder-mystery.db

To export the data tables from the SQLite database into Postgres, we need to do a couple of things:

  • install the pgloader utility which provides a wide range of tools for loading variously formatted datasets into Postgres;
  • create a Postgres user and database to store the data.

Install the pgloader utility

The pgloader utility is a Linux package that we can install using apt-get.

In [2]:
%%capture
!sudo apt-get update && sudo apt-get install -y pgloader && sudo apt-get clean

Set up the Postgres Database

To keep things tidy, we can create a specific database to import the data into, along with a specific Postgres user with access permissions on that database.

We'll also import some tools and extensions to make working with the database easier.

In [3]:
from tm351_utils.db import *
%load_ext sql
%load_ext schemadisplay_magic
The sql extension is already loaded. To reload it, use:
  %reload_ext sql

Let's make sure we don't contaminate any of the evidence by starting with a clean slate:

In [4]:
#clearConnections("mysterydb")
!echo "DROP DATABASE IF EXISTS mysterydb" | sudo -u postgres psql
NOTICE:  database "mysterydb" does not exist, skipping
DROP DATABASE

Create an empty database to import the data into:

In [5]:
! createdb mysterydb

Let's also create a user account for our investigating officer with permissions over the database:

In [6]:
# Create a new user...
!echo "CREATE USER detective WITH PASSWORD 'clouseau';" | sudo -u postgres psql

# ...with permissions over the new database
!echo "GRANT ALL PRIVILEGES ON DATABASE mysterydb TO detective;" | sudo -u postgres psql
CREATE ROLE
GRANT

Create a Database Connection String and Import the Tables

Create a connection string for connecting to the database, set from identifiable parameter values.

In [7]:
DB_ENGINE='postgresql' # tells ipython-sql that we will use postgreSQL as our database engine
DB_USER='detective'  # id of the user who is logging in
DB_PWD='clouseau' # the user's password
DB_ADDR='localhost:5432' # the host and port on which the database engine is listening
DB_NAME='mysterydb' # the name of the database

DB_CONNECTION = '{engine}://{user}:{pwd}@{addr}/{name}'.format(engine=DB_ENGINE,
                                                               user=DB_USER,
                                                               pwd=DB_PWD,
                                                               addr=DB_ADDR,
                                                               name=DB_NAME)
DB_CONNECTION
Out[7]:
'postgresql://detective:clous[email protected]:5432/mysterydb'

We can now use this connection string as the target location for importing the data from the SQLite database into the Postgres database:

In [8]:
!pgloader ./sql-murder-mystery.db $DB_CONNECTION
2019-10-08T11:33:45.030000Z LOG Main logs in '/tmp/pgloader/pgloader.log'
2019-10-08T11:33:45.035000Z LOG Data errors in '/tmp/pgloader/'
2019-10-08T11:33:45.237000Z WARNING PostgreSQL warning: table "crime_scene_report" does not exist, skipping
2019-10-08T11:33:45.237000Z WARNING PostgreSQL warning: table "drivers_license" does not exist, skipping
2019-10-08T11:33:45.237000Z WARNING PostgreSQL warning: table "person" does not exist, skipping
2019-10-08T11:33:45.237000Z WARNING PostgreSQL warning: table "facebook_event_checkin" does not exist, skipping
2019-10-08T11:33:45.237000Z WARNING PostgreSQL warning: table "interview" does not exist, skipping
2019-10-08T11:33:45.237000Z WARNING PostgreSQL warning: table "get_fit_now_member" does not exist, skipping
2019-10-08T11:33:45.237000Z WARNING PostgreSQL warning: table "get_fit_now_check_in" does not exist, skipping
2019-10-08T11:33:45.237000Z WARNING PostgreSQL warning: table "income" does not exist, skipping
2019-10-08T11:33:45.237000Z WARNING PostgreSQL warning: table "solution" does not exist, skipping
            table name       read   imported     errors            time
----------------------  ---------  ---------  ---------  --------------
                 fetch          0          0          0          0.000s
       fetch meta data          9          9          0          0.026s
      create, truncate          0          0          0          0.033s
----------------------  ---------  ---------  ---------  --------------
    crime_scene_report       1228       1228          0          0.080s
       drivers_license      10007      10007          0          0.293s
                person      10011      10011          0          0.243s
facebook_event_checkin      20011      20011          0          0.440s
             interview       4991       4991          0          0.107s
    get_fit_now_member        184        184          0          0.053s
  get_fit_now_check_in       2703       2703          0          0.067s
                income       7514       7514          0          0.074s
              solution          0          0          0          0.007s
index build completion          0          0          0          0.000s
----------------------  ---------  ---------  ---------  --------------
        Create Indexes          0          0          0          0.000s
       Reset Sequences          0          0          0          0.010s
----------------------  ---------  ---------  ---------  --------------
  Total streaming time      56649      56649          0          1.433s

Connect to the Database

We can use the ipython-sql magic, which is preinstalled in the TM351 VM, to connect to the database, which will simplify running queries over it:

In [9]:
%sql $DB_CONNECTION
Out[9]:
'Connected: [email protected]'

The magic returns results as pandas dataframes. The following setting ensures we can read the full content of any result.

In [10]:
import pandas as pd
pd.set_option('display.max_colwidth', -1)

To orientate ourselves to the contents of the database, we can display the table schema:

In [11]:
%schema --connection_string $DB_CONNECTION
Out[11]:
G income income - ssn : BIGINT - annual_income : BIGINT get_fit_now_member get_fit_now_member - id : TEXT - person_id : BIGINT - name : TEXT - membership_start_date : BIGINT - membership_status : TEXT interview interview - person_id : BIGINT - transcript : TEXT crime_scene_report crime_scene_report - date : BIGINT - type : TEXT - description : TEXT - city : TEXT person person - id : BIGINT - name : TEXT - license_id : BIGINT - address_number : BIGINT - address_street_name : TEXT - ssn : BIGINT solution solution - user : BIGINT - value : TEXT get_fit_now_check_in get_fit_now_check_in - membership_id : TEXT - check_in_date : BIGINT - check_in_time : BIGINT - check_out_time : BIGINT drivers_license drivers_license - id : BIGINT - age : BIGINT - height : BIGINT - eye_color : TEXT - hair_color : TEXT - gender : TEXT - plate_number : TEXT - car_make : TEXT - car_model : TEXT facebook_event_checkin facebook_event_checkin - person_id : BIGINT - event_id : BIGINT - event_name : TEXT - date : BIGINT

A helper function also lets us view the tables in the database:

In [12]:
showTables(DB_NAME)
Out[12]:
tablename tableowner
0 crime_scene_report detective
1 drivers_license detective
2 person detective
3 facebook_event_checkin detective
4 interview detective
5 get_fit_now_member detective
6 get_fit_now_check_in detective
7 income detective
8 solution detective

We can also review the columns contained within a table (omit the table name to see columns listed for all tables in the specified database).

In [13]:
showColumns(DB_NAME, 'crime_scene_report',  user=DB_USER, password=DB_PWD)
Out[13]:
column_name table_name
0 city crime_scene_report
1 date crime_scene_report
2 description crime_scene_report
3 type crime_scene_report

Let the Investigation Begin

Check out the original guidance for the mystery here; additional hints are also available within the original repository.

In brief, you're going to be investigating a murder that took place on January 15th, 2018 in SQL City.

To get you started, here's the original crime report:

In [14]:
%%sql
SELECT * FROM crime_scene_report WHERE date=20180115 AND city='SQL City' AND type='murder';
 * postgresql://detective:***@localhost:5432/mysterydb
1 rows affected.
Out[14]:
date type description city
0 20180115 murder Security footage shows that there were 2 witnesses. The first witness lives at the last house on "Northwestern Dr". The second witness, named Annabel, lives somewhere on "Franklin Ave". SQL City

Now get to it...