The Relational Model¶

The relational model is a mathematical abstraction used to describe

how data can be structured, queried and updated.

It is based on set theory.
It can be implemented in many different ways.
When it is implemented by storing relations on disk files, we have a relational database.
Functional programming languages such as Python naturally express many aspects of the relational model.
This is one of the reasons they are very useful for data science.

Overview¶

The formal definition of the relational model
Representing relations in Python using collections of tuples
Querying relational data using Python set comprehensions

An Example Relational Dataset¶

The following slides use relations to describe:

students,
the courses they are taking
the prerequisites of courses
their grades
which department they are in

In [26]:

import pandas as pd
df = pd.DataFrame({'name': ['Smith', 'Brown'], 'student_number': [17, 18], 'course_id': [1, 2], 'department': ['CS', 'CS']})
df

Out[26]:

	name	student_number	course_id	department
0	Smith	17	1	CS
1	Brown	18	2	CS

Concepts¶

In a relational model, the data is a collection of relations.
Informally, a relation resembles a table of values.
When relations are stored on disk, they are called tables.
Each row represents a fact about a real-world entity or relationship.
The table name and column names are used to help interpret the meaning of the values.
A relational model is defined formally in terms of:
- tuples
- attributes
- relations
- domains

Illustration of a relation¶

relation

Image courtesy of AutumnSnow

Tuples¶

A tuple is a mathematical abstraction which:

contains several other values
has a well-defined ordering over the values
can contain duplicate values
can contain values of different types
can contain the special value None or Null
is immutable; the values contained in the tuple cannot change over time

The size of a tuple¶

We often restrict attention to tuples of a particular size or degree.
An $n-$tuple contains $n$ values.

Attributes¶

An attribute refers to the value in a particular index of a tuple.

Atomic values¶

Atomic values are values which are not stored in collections.
Atomic values cannot be further decomposed into other values.
A tuple is therefore not atomic.
A tuple that contains only atomic values is called a flat tuple.

Domain¶

A domain $D_i$ is a set of atomic values.
Each attribute within a relation has the same domain.
Intuitively, a domain specifies the allowable values in a column $i$.
Examples:

$D_1 = \mathbb{Z}$

$D_2 = \{ 15, 16, \ldots, 80 \}$

$D_3 = \{ "CS", \; "ECON", \; "PHYS" \}$

Relation schema¶

A relation schema is denoted by $R(A_1, A_2, \ldots, A_n)$.
Each attribute $A_i$ is the name of a role played by some domain $D_i$ in $R$.
$D_i$ is the domain of $A_i$ and is denoted by $\operatorname{dom}(A_i)$.
The degree or arity of a relation is the number of attributes $n$.

Example¶

What is the arity of the following relation schema?

STUDENT(Name, Ssn, Home_phone, Address, Office_phone, Age, Gpa)

Answer: 7
What is the name of the relation?
Answer: STUDENT

Example of a Domain¶

$\operatorname{dom}(Gpa) = [0, 4]$

Relations¶

The schema represents the structure of a relation.
A relation contains the actual data.
It is sometimes called the relation state, relation intension or relation extension.
Let $r(R)$ denote the relation $r$ of a relation schema $R$.
The relation $r$ consists of a set of $n$-tuples $r = \{t_1, t_2, \ldots, t_m\}$.
The $i^{th}$ value in tuple $t$ corresponds to the attribute $A_i$

and is denoted $t.A_i$ or $t[i]$.

Constraints¶

Domain constraints
Key constraints
NULL values

Relational Datasets¶

So far we have discussed single relations.
A typical data-set will comprise many relations.
A relational dataset schema $(S, IC)$ comprises:
- a set of relation schemas $S = \{ R_1, R_2, \ldots, R_k \}$
- a set of integrity constraints $IC$
A relational dataset state $DB$ is a set of relation states $DB = \{ r_1, r_2, \ldots r_m \}$
- such that every $r_i$ satisfies every constraint in $IC$.

Data definition language¶

The data definition language (DDL) provides a concrete syntax and semantics for describing a relational schema.
Most commonly we use SQL - Structured Query Language.

Data query language¶

The data query language provides a concrete syntax and semantics for querying the relational dataset.
Formally a query is a function mapping from existing relation states to new relations.
That is, we map from one set of tuples to another set of tuples.
Typically the mapping involves some combination of set-theoretic functions, e.g.
- subset of tuples that satisfy a predicate $p$
  - $\{ x: x \in X \wedge p(x) \}$
- set union $X \cup Y$, difference $X - Y$, intersection $X \cap Y$
- Cartesian product $X \times Y$
The most common data query language for relational databases is again SQL.

. . .

Mathematically, there is nothing stopping us from using e.g. Python as a query language.

Tuples in Python¶

Tuples in Python can be written by writing a sequence of values separated by

commas and surrounded by round brackets. For example:

In [1]:

tuple1 = (50, 6.5)
tuple2 = (1, 2, 'hello')
professor = ('Steve', 'Phelps', 'S6.18')
student = ('John', 'Doe', None)

The individual values contain within a tuple can be obtained by indexing

their position (counting from zero). To find the office number of the professor:

In [2]:

professor[2]

Out[2]:

'S6.18'

Tuples are a very flexible way to represent single pieces of data.
We only allow flat tuples. The following is not allowed in a relational model:

In [3]:

this_is_not_allowed = (1, 3, (50, 6.5))

Sets of Tuples¶

How can we use tuples to represent data-sets and relations?
We can use collections of tuples, e.g. a set of tuples.
So now we can represent one or more students:

In [4]:

# Student tuples
smith = ('Smith', 17, 1, 'CS')
brown = ('Brown', 8, 2, 'CS')

# The student relation
students = {smith, brown}

Relational attributes in Python¶

Attributes are names for particular positions within a tuple.
We can use Python functions to represent relational attributes:

In [5]:

# The attributes of a student

def student_name(s):
    return s[0]
    
def student_student_number(s):
    return s[1]

Note that different relations can have the same attribute.
Therefore we need some convention to distinguish attributes from different relations.
In the above code, student_student_number refers to the student_number attribute of the student relation.

Queries in Python¶

We need some way to extract data from our data-set; i.e. to query the data.
A query will e.g.:
- Take a subset of the tuples of a relation that satisfy a predicate.
- Join two or more relations using a Cartesian product.
- Take the intersection of tuples from two or more relations.
- Take the union of tuples from two or more relations.
Python list comprehensions or set comprehensions provide all of this functionality.

Relational queries in Python¶

The set of students whose name is "Smith":

In [6]:

{s for s in students if student_name(s) == 'Smith'}

Out[6]:

{('Smith', 17, 1, 'CS')}

This is equivalent to the SQL query:

SELECT * FROM students WHERE students.name = "SMITH";

Joining relations¶

Now let's create another relation called grades which has tuples of the form (ssn, course-name, mark):

In [7]:

grades = { (17, 'python', 'A'), (17, 'algebra', 'B'), (17, 'algebra', 'A')}

and a function to return the mark for a given grade tuple:

In [8]:

def grade_mark(g):
    return g[2]

Now we can join the two relations using a Cartesian product:

In [9]:

{(student_name(s), grade_mark(g)) for s in students for g in grades}

Out[9]:

{('Brown', 'A'), ('Brown', 'B'), ('Smith', 'A'), ('Smith', 'B')}

this is equivalent to the following SQL:

SELECT students.name, grades.mark FROM students, grades;

We can also combine this with a predicate:

In [10]:

{(student_name(s), grade_mark(g)) for s in students for g in grades if student_name(s) == 'Smith'}

Out[10]:

{('Smith', 'A'), ('Smith', 'B')}

SELECT students.name, grades.mark FROM students, grades WHERE students.name = "Smith";

Functional-relational mapping¶

Functional programming languages such as Python are well suited to data science because functional programming constructs such as comprehensions map directly onto the relational model.
This is called functional-relational mapping.
There are tools which allow relational database to be queried directly from functional expressions.
This is in contrast to object-relational mapping (ORM).
There is no isomorphic mapping between an object model and a relational model, which leads to difficulties called the object-relational impedance mismatch.
Functional programming allows us to express a computation as functions on sets of tuples, and thus comprehensions are completely isomorphic with relational queries.