how data can be structured, queried and updated.
It is based on set theory.
It can be implemented in many different ways.
When it is implemented by storing relations on disk files, we have a relational database.
Functional programming languages such as Python naturally express many aspects of the relational model.
This is one of the reasons they are very useful for data science.
The formal definition of the relational model
Representing relations in Python using collections of tuples
Querying relational data using Python set comprehensions
The following slides use relations to describe:
import pandas as pd
df = pd.DataFrame({'name': ['Smith', 'Brown'], 'student_number': [17, 18], 'course_id': [1, 2], 'department': ['CS', 'CS']})
df
name | student_number | course_id | department | |
---|---|---|---|---|
0 | Smith | 17 | 1 | CS |
1 | Brown | 18 | 2 | CS |
In a relational model, the data is a collection of relations.
Informally, a relation resembles a table of values.
When relations are stored on disk, they are called tables.
Each row represents a fact about a real-world entity or relationship.
The table name and column names are used to help interpret the meaning of the values.
A relational model is defined formally in terms of:
Image courtesy of AutumnSnow
A tuple is a mathematical abstraction which:
contains several other values
has a well-defined ordering over the values
can contain duplicate values
can contain values of different types
can contain the special value None
or Null
is immutable; the values contained in the tuple cannot change over time
We often restrict attention to tuples of a particular size or degree.
An $n-$tuple contains $n$ values.
Atomic values are values which are not stored in collections.
Atomic values cannot be further decomposed into other values.
A tuple is therefore not atomic.
A tuple that contains only atomic values is called a flat tuple.
A domain $D_i$ is a set of atomic values.
Each attribute within a relation has the same domain.
Intuitively, a domain specifies the allowable values in a column $i$.
Examples:
$D_1 = \mathbb{Z}$
$D_2 = \{ 15, 16, \ldots, 80 \}$
$D_3 = \{ "CS", \; "ECON", \; "PHYS" \}$
A relation schema is denoted by $R(A_1, A_2, \ldots, A_n)$.
Each attribute $A_i$ is the name of a role played by some domain $D_i$ in $R$.
$D_i$ is the domain of $A_i$ and is denoted by $\operatorname{dom}(A_i)$.
The degree or arity of a relation is the number of attributes $n$.
STUDENT(Name, Ssn, Home_phone, Address, Office_phone, Age, Gpa)
Answer: 7
What is the name of the relation?
Answer: STUDENT
$\operatorname{dom}(Gpa) = [0, 4]$
The schema represents the structure of a relation.
A relation contains the actual data.
It is sometimes called the relation state, relation intension or relation extension.
Let $r(R)$ denote the relation $r$ of a relation schema $R$.
The relation $r$ consists of a set of $n$-tuples $r = \{t_1, t_2, \ldots, t_m\}$.
The $i^{th}$ value in tuple $t$ corresponds to the attribute $A_i$
and is denoted $t.A_i$ or $t[i]$.
Domain constraints
Key constraints
NULL values
So far we have discussed single relations.
A typical data-set will comprise many relations.
A relational dataset schema $(S, IC)$ comprises:
a set of relation schemas $S = \{ R_1, R_2, \ldots, R_k \}$
a set of integrity constraints $IC$
A relational dataset state $DB$ is a set of relation states $DB = \{ r_1, r_2, \ldots r_m \}$
The data definition language (DDL) provides a concrete syntax and semantics for describing a relational schema.
Most commonly we use SQL - Structured Query Language.
The data query language provides a concrete syntax and semantics for querying the relational dataset.
Formally a query is a function mapping from existing relation states to new relations.
That is, we map from one set of tuples to another set of tuples.
Typically the mapping involves some combination of set-theoretic functions, e.g.
The most common data query language for relational databases is again SQL.
. . .
commas and surrounded by round brackets. For example:
tuple1 = (50, 6.5)
tuple2 = (1, 2, 'hello')
professor = ('Steve', 'Phelps', 'S6.18')
student = ('John', 'Doe', None)
their position (counting from zero). To find the office number of the professor:
professor[2]
'S6.18'
Tuples are a very flexible way to represent single pieces of data.
We only allow flat tuples. The following is not allowed in a relational model:
this_is_not_allowed = (1, 3, (50, 6.5))
How can we use tuples to represent data-sets and relations?
We can use collections of tuples, e.g. a set of tuples.
So now we can represent one or more students:
# Student tuples
smith = ('Smith', 17, 1, 'CS')
brown = ('Brown', 8, 2, 'CS')
# The student relation
students = {smith, brown}
Attributes are names for particular positions within a tuple.
We can use Python functions to represent relational attributes:
# The attributes of a student
def student_name(s):
return s[0]
def student_student_number(s):
return s[1]
Note that different relations can have the same attribute.
Therefore we need some convention to distinguish attributes from different relations.
In the above code, student_student_number
refers to the student_number
attribute of the student
relation.
We need some way to extract data from our data-set; i.e. to query the data.
A query will e.g.:
Take a subset of the tuples of a relation that satisfy a predicate.
Join two or more relations using a Cartesian product.
Take the intersection of tuples from two or more relations.
Take the union of tuples from two or more relations.
Python list comprehensions or set comprehensions provide all of this functionality.
{s for s in students if student_name(s) == 'Smith'}
{('Smith', 17, 1, 'CS')}
This is equivalent to the SQL query:
SELECT * FROM students WHERE students.name = "SMITH";
grades
which has tuples of the form (ssn, course-name, mark)
:grades = { (17, 'python', 'A'), (17, 'algebra', 'B'), (17, 'algebra', 'A')}
and a function to return the mark for a given grade tuple:
def grade_mark(g):
return g[2]
Now we can join the two relations using a Cartesian product:
{(student_name(s), grade_mark(g)) for s in students for g in grades}
{('Brown', 'A'), ('Brown', 'B'), ('Smith', 'A'), ('Smith', 'B')}
this is equivalent to the following SQL:
SELECT students.name, grades.mark FROM students, grades;
{(student_name(s), grade_mark(g)) for s in students for g in grades if student_name(s) == 'Smith'}
{('Smith', 'A'), ('Smith', 'B')}
SELECT students.name, grades.mark FROM students, grades WHERE students.name = "Smith";
Functional programming languages such as Python are well suited to data science because functional programming constructs such as comprehensions map directly onto the relational model.
This is called functional-relational mapping.
There are tools which allow relational database to be queried directly from functional expressions.
This is in contrast to object-relational mapping (ORM).
There is no isomorphic mapping between an object model and a relational model, which leads to difficulties called the object-relational impedance mismatch.
Functional programming allows us to express a computation as functions on sets of tuples, and thus comprehensions are completely isomorphic with relational queries.