In this guided project, we will put everything together to build a database for storing data related with crimes that occurred in Boston. The cleaned dataset provided by Dataquest is available in the file boston.csv
.
The data we will work with is known as Crime Incident Reports. It's published under PDDL license by Analyse Boston.
Crime incident reports are provided by Boston Police Department (BPD) to document the initial details surrounding an incident to which BPD officers respond. This is a dataset containing records from the new crime incident report system, which includes a reduced set of fields focused on capturing the type of incident as well as when and where it occurred. Records in the new system begin in June 2015.
More information and up-to-date Crime Incident Reports dataset can be found here.
import pandas as pd
import matplotlib.pyplot as plt
%pylab inline
import psycopg2
import csv
Populating the interactive namespace from numpy and matplotlib
#header and first four rows
pd.read_csv('data/boston.csv')[0:5]
incident_number | offense_code | description | date | day_of_the_week | lat | long | |
---|---|---|---|---|---|---|---|
0 | 1 | 619 | LARCENY ALL OTHERS | 2018-09-02 | Sunday | 42.357791 | -71.139371 |
1 | 2 | 1402 | VANDALISM | 2018-08-21 | Tuesday | 42.306821 | -71.060300 |
2 | 3 | 3410 | TOWED MOTOR VEHICLE | 2018-09-03 | Monday | 42.346589 | -71.072429 |
3 | 4 | 3114 | INVESTIGATE PROPERTY | 2018-09-03 | Monday | 42.334182 | -71.078664 |
4 | 5 | 3114 | INVESTIGATE PROPERTY | 2018-09-03 | Monday | 42.275365 | -71.090361 |
The first column represents the identifier of the crime. The second contains a numeric identifier code for the committed crime. The third represents a description of the crime. The next two rows contain the date on which the crime happened and the corresponding day of the week. Finally, the last two columns represent the location of the crime with a latitude and longitude coordinates.
We will create a database named crimes_db
with a table – boston_crimes
– with appropriate data types for storing the data. Then we will be creating the table inside a schema named crimes
. We will also create the readonly and readwrite groups with the appropriate privileges. Finally, we will also need to create one user for each of these groups.
#change the connection paremeters here
user = dq
dbname = dq
#password = #####
#connection to Dataquest database
conn = psycopg2.connect("dbname=dq user=dq")
conn.autocommit = True
cur = conn.cursor()
cur.execute("""CREATE DATABASE crime_db;""")
conn.autocommit = False #can we delete this line as we close the connexion just after?
conn.close()
#connexion to the new database and schema creation
conn = psycopg2.connect("dbname=crime_db user=postgres")
conn.autocommit = True
cur = conn.cursor()
cur.execute("""CREATE SCHEMA crimes;""")
conn.autocommit = False
Before we start creating tables, let's gather some data about our crime dataset so that we can more easily select the right data types to use in our tables.
#reading the CSV file by using the csv module as follows:
rows=list()
with open('data/boston.csv', 'r') as file:
reader = csv.reader(file)
for row in reader:
rows.append(row)
col_headers = rows[0]
first_row = rows[1]
print(col_headers)
['incident_number', 'offense_code', 'description', 'date', 'day_of_the_week', 'lat', 'long']
print(first_row)
['1', '619', 'LARCENY ALL OTHERS', '2018-09-02', 'Sunday', '42.35779134', '-71.13937053']
We now have the column names as well as the first row values at hand throughout this guided project so that we can easily take a look at them at any moment.
Before starting, you may find useful to check the Postgres documentation for the following data types:
We need to identify the proper data types for the columns. To help us with that, we will create a function — get_col_value_set()
— that given the name of a CSV file and a column index (starting at zero) computes a Python set with all distinct values contained in that column.
This function will be useful for two reasons:
enumerated
type might be a good choice for representing a column.varchar
columns.def get_col_value_set(csv_filename, col_index):
with open(csv_filename, 'r') as f:
next(f) # skip the row containing column headers
reader = csv.reader(f)
# create a set to contain all distinct values contained in that column
unique_values_in_column = set()
for row in reader:
# add the column values from this row to the set
column_values = row[col_index]
unique_values_in_column.add(column_values)
return unique_values_in_column
#computing with a for loop the number of unique values each column contains.
for i in range(0,7):
print(col_headers[i],len(get_col_value_set('data/boston.csv', i)))
incident_number 298329 offense_code 219 description 239 date 1177 day_of_the_week 7 lat 18177 long 18177
Columns with a low number of distinct values tend to be good candidates for enumerated types. Another important aspect is to know the longest word in any column containing textual data. We can actually use the previous function for computing this.
There are two textual columns in the dataset, namely, the description
and day_of_the_week
columns. However, the day of the week contains only $7$ different values, one for each day. We can tell that the longest of them is Wednesday without needing any computation.
Let's compute the maximum length for the values contained in the description column.
#description column index = 2
description_values = get_col_value_set('data/boston.csv', 2)
max_length = 0
for row in description_values:
if len(row) > max_length:
max_length = len(row)
print(max_length)
58
However, according to Analyse Boston (see Crime Incident Field Explanation), the appropriate varchar length for this text field is $80$, so we will use this number when creating the table, as in the future we may want to update our database and new offense descriptions could arise.
Before examining the numerical columns, let's prepare the enumerated type query string for the weekday column. We will execute the query later.
create_type_string = """CREATE TYPE day_of_the_week_enum AS ENUM
('Sunday',
'Monday',
'Tuesday',
'Wednesday',
'Thursday',
'Friday',
'Saturday');"""
The column incident_number
is an autoincrementing integer one. Since its maximum value ($=298329$) is only a $4$ bytes number, we will use the serial
type.
Let`s retrieve the maximum value and the maximum length for the values contained in the second column offense_code
.
offense_code_values = get_col_value_set('data/boston.csv', 1)
max_length = 0
max_value = 0
for row in offense_code_values:
if len(row) > max_length:
max_length = len(row)
if int(row) > max_value:
max_value = int(row)
print(max_length, max_value)
4 3831
Sometimes, the right data type for a given column may be a character
type even if the values look like a number (as for postal codes for example). Here is one simple reason: you cannot perform arithmetic on them. In addition, leading zeros are important.
In a SQL database, this would typically be varchar
or char
, of the appropriate length.
The fourth column contains the date when occurred the event, so we will use the date
type which is used to represent a specific day of the year. The storage size fort this data type is $4$ bytes.
Now let's examine the latitude and the longitude columns: they contain decimal numbers. With the Postgres decimal
type, we can set our own precision and scale of the number which means we can choose how many bytes our entries will require.
As we already know the valid range in degrees for latitude ($-90°$ and $+90°$) as for longitude ($-180°$ and $+180°$), we just need to calculate the maximum scale contained in lat
and long
columns and then add $2$ for the latitude precision and $3$ for the longitude precision. By this way, our table will be correctly set in case of reuse with other location than Boston in the future.
So we will create a function that computes the maximum scale for a given set of column values.
def get_scale(col_value_set):
max_scale = 0
for row in col_value_set:
scale = 0
for idx, r in enumerate(row):
if r == ".":
scale = len(row)- idx - 1
if scale > max_scale:
max_scale = scale
return(max_scale)
lat_values = get_col_value_set('data/boston.csv', 5)
long_values = get_col_value_set('data/boston.csv', 6)
lat_max_scale = get_scale(lat_values)
long_max_scale = get_scale(long_values)
print(lat_max_scale, long_max_scale)
8 8
We now can create our table boston_crimes
inside the crimes
schema with the appropriate data type for each column.
conn = psycopg2.connect("dbname=crime_db user=user")
cur = conn.cursor()
#executing the enumerated type query string
cur.execute(create_type_string)
#table creation query string
create_string = """CREATE TABLE crimes.boston_crimes (
incident_number serial PRIMARY KEY,
offense_code varchar(4),
description varchar(80),
date date,
day_of_the_week day_of_the_week_enum,
lat decimal(10,8),
long decimal(11,8)
);
"""
#executing the table creation query string
cur.execute(create_string)
#commit the transaction
conn.commit()
#checking the result
cur.execute("SELECT * FROM crimes.boston_crimes LIMIT 0;")
cur.description
(Column(name='incident_number', type_code=23), Column(name='offense_code', type_code=1043), Column(name='description', type_code=1043), Column(name='date', type_code=1082), Column(name='day_of_the_week', type_code=40979901), Column(name='lat', type_code=1700), Column(name='long', type_code=1700))
Now that we have created the table, we can load the data into it. We will use the cursor.copy_expert() method.
with open("data/boston.csv") as f:
cur.copy_expert("COPY crimes.boston_crimes FROM STDIN WITH CSV HEADER;", f)
conn.commit()
#checking the result printing the first five rows
cur.execute("SELECT * FROM crimes.boston_crimes LIMIT 5;")
cur.fetchall()
[(1, '619', 'LARCENY ALL OTHERS', datetime.date(2018, 9, 2), 'Sunday', Decimal('42.35779134'), Decimal('-71.13937053')), (2, '1402', 'VANDALISM', datetime.date(2018, 8, 21), 'Tuesday', Decimal('42.30682138'), Decimal('-71.06030035')), (3, '3410', 'TOWED MOTOR VEHICLE', datetime.date(2018, 9, 3), 'Monday', Decimal('42.34658879'), Decimal('-71.07242943')), (4, '3114', 'INVESTIGATE PROPERTY', datetime.date(2018, 9, 3), 'Monday', Decimal('42.33418175'), Decimal('-71.07866441')), (5, '3114', 'INVESTIGATE PROPERTY', datetime.date(2018, 9, 3), 'Monday', Decimal('42.27536542'), Decimal('-71.09036101'))]
By following the least privilege principle, the first step in doing so is to make sure that there are no privileges inherited from the public
group and on the public
schema.
cur.execute("""REVOKE ALL ON SCHEMA public FROM public;""")
cur.execute("""REVOKE ALL ON DATABASE crime_db FROM public;""")
conn.commit()
A readonly group is defined as a users group who has no other privileges except for being able to read data from your database. This is likely the type of users profile that we would set up for data analysts as, in general, such users need only to be able to access the data for analysis. So we will only grant SELECT
privileges for the readonly group.
Another common group is the readwrite group which has privileges to read, insert and remove data from tables but cannot drop tables. In a similar way, you can think of the readwrite to be a suitable group for data scientists. Those users need to be able to analyze the data, but they are also responsible for collecting, cleaning the data and loading it into the database.
Following the least privilege principle, it seems that a suitable set of privileges for these kinds of users consist of the SELECT
, INSERT
, UPDATE
and DELETE
commands.
The DROP
commands that is used to remove a table is not usually granted as it is more of a data engineer role to setup the platform and deal with table creation as with data types settings.
These roles are of course not mutually exclusive and can share common tasks. However, these user groups are quite common as you can read in this blog post.
It's also a good practice to always make sure that groups cannot be used for login. We can do that using the NOLOGIN
option when creating the group.
#create two groups named readonly and readwrite with the no NOLOGIN option.
cur.execute("""CREATE GROUP readonly NOLOGIN;""")
cur.execute("""CREATE GROUP readwrite NOLOGIN;""")
#the two groups need connection privileges, otherwise they won't be able to do anything
cur.execute("""GRANT CONNECT ON DATABASE crime_db TO readonly;""")
cur.execute("""GRANT CONNECT ON DATABASE crime_db TO readwrite;""")
#they also need usage privileges for the crimes schema
cur.execute("""GRANT USAGE ON SCHEMA crimes TO readonly;""")
cur.execute("""GRANT USAGE ON SCHEMA crimes TO readwrite;""")
#setting specific privileges to each group for all tables in crimes schema
cur.execute("""GRANT SELECT ON ALL TABLES IN SCHEMA crimes TO readonly;""")
cur.execute("""GRANT SELECT, INSERT, DELETE, UPDATE ON ALL TABLES IN SCHEMA crimes TO readwrite;""")
conn.commit()
Last step, we will create one user with password for each group.
cur.execute("""CREATE USER data_analyst WITH PASSWORD 'secret1';""")
cur.execute("""GRANT data_analyst TO readonly;""")
cur.execute("""CREATE USER data_scientist WITH PASSWORD 'secret2';""")
cur.execute("""GRANT data_scientist TO readwrite;""")
conn.commit()
It is a good practice to test that everything is configured as expected when you finish setting up the database. We can use SQL queries to check whether the objects have been created and that users and groups have the right privileges. This requires you to know the Postgres internal tables. We can query the pg_roles table to inspect privileges related to the database and the information_schema.table_privileges table to inspect table privileges.
test_string = """SELECT grantee, privilege_type
FROM information_schema.table_privileges
WHERE grantee = 'readwrite';"""
cur.execute(test_string)
cur.fetchall()
[('readwrite', 'INSERT'), ('readwrite', 'SELECT'), ('readwrite', 'UPDATE'), ('readwrite', 'DELETE')]
test_string = """SELECT grantee, privilege_type
FROM information_schema.table_privileges
WHERE grantee = 'readonly';"""
cur.execute(test_string)
cur.fetchall()
[('readonly', 'SELECT')]
test_string = """SELECT rolsuper, rolcanlogin, rolcreaterole, rolcreatedb
FROM pg_roles
WHERE rolname = 'readonly';"""
cur.execute(test_string)
cur.fetchall()
[(False, False, False, False)]
test_string = """SELECT rolsuper, rolcanlogin, rolcreaterole, rolcreatedb
FROM pg_roles
WHERE rolname = 'readwrite';"""
cur.execute(test_string)
cur.fetchall()
[(False, False, False, False)]
test_string = """SELECT rolsuper, rolcanlogin, rolcreaterole, rolcreatedb, rolpassword
FROM pg_roles
WHERE rolname = 'data_analyst';"""
cur.execute(test_string)
cur.fetchall()
[(False, True, False, False, '********')]
test_string = """SELECT rolsuper, rolcanlogin, rolcreaterole, rolcreatedb, rolpassword
FROM pg_roles
WHERE rolname = 'data_scientist';"""
cur.execute(test_string)
cur.fetchall()
[(False, True, False, False, '********')]
test_string = """SELECT rolsuper, rolcanlogin, rolcreaterole, rolcreatedb, rolpassword
FROM pg_roles
WHERE rolname = 'postgres';"""
cur.execute(test_string)
cur.fetchall()
[(True, True, True, True, '********')]
After cheking pg_roles
and information_schema.table_privileges
tables, it appears that:
conn.close()
#Copyright (c) 2013 Cameron Davidson-Pilon.
from IPython.core.display import HTML
def css_styling():
styles = open("css/custom.css", "r").read()
return HTML(styles)
css_styling()