Notebook

Data Modeling¶

Data modeling is, providing a structure to the data in order to allow efficient storage, easy access and better comprehension of data. One of the fundamental concepts of data modeling is the entity-relationship model.

The Entity-Relationship Model¶

Entity-Relationship model or ER model is a diagrammatical representation of various data entities and relationships among them. In object oriented programming, a category of entities is defined as a 'Class' and each unique entity belonging to a class is called an 'Object'. Similarly, in database parlance, A schema or table is a collection of records which have same attributes. Records are stored as rows and attributes are stored as columns in a table. In an ER diagram, a table or a view (the result of a query, which is not stored in disk, and is viewed in the form of a table) constitutes entities and associations between two or more entities are called relationships. Relationships can be determined in two different ways.

Based on number of participants

1. Unary relationship: Both participants of the relationship are the same entity. i.e., A relationship among two different objects of the same entity. In real world, this is generally possible when the entity has a very broad definition.
1. Binary relationship: Two participants and relationship among them.
1. Ternary relationship: Three participants and relationship among them.
1. N-nary relationship: N participants and relationship among them.

Based on number of possible associations for a single object

1. One-to-One: An object of one entity can have the given relationship with only one object of the other entity.
1. One-to-Many: An object of one entity can have the same relationship with multiple objects of the other entity.
1. Many-to-One: Many objects of one entity can have the same relationship with only one object of the other entity.
1. Many-to-Many: Many objects of entity 1 can have same relationship with many objects of entity 2.

Example¶

Let us look at an example for an Entity-Relationship model. Consider an application with a shopping cart feature and the information flows with respect to performing an online purchase.

The above diagram shows that a key symbol is marked besides specific attributes. A key is a column or set of columns (minimum number of required columns), through which each individual object or record can be easily identified and accessed. A primary key is native to the table and a foreign key is a primary key of another table. By defining primary and foreign keys, we can establish the association between two tables and thus provide a connection among them both.

Exercise¶

Connect to an in-memory database using sqlite module.

Create the tables

customer (columns='cust_id', 'cust_name', 'cust_payment_mode')
product (columns='prod_id','prod_name','prod_price')
purchaseorder (columns='order_id','cust_id','prod_id','order_amount')

using the above example ER diagram. After creation of each table, load them with data using the respective data files, customer.csv, product.csv, order.csv. Use pandas module and apply data loading operations learned in previous exercise. Print the first row of the table after loading the data.

Datatypes of columns: cust_id, prod_id, prod_price, order_id, order_amount can all be assumed to be Integers (INT). cust_name, cust_payment_name, prod_name can be assigned a CHAR(50) string data type.

Assume that cust_id, order_id and prod_id are each primary keys of customer, purchaseorder and product tables. Also assume, cust_id and prod_id are foreign keys in the purchaseorder table. Create these relationships while table creation through the create query.

Print out all the contents of the 'purchaseorder' table using a 'SELECT * FROM' query. Store the query output in a 'po_out' variable.

Points to remember:

1. Column names: Column names in the source file should match exactly to the column names in the created SQL table. If they dont match, data will not get loaded.
1. Table names: Make sure that table names follow SQL rules. For e.g., table name cannot be 'order' as order is a SQL keyword.

In [1]:

# Connecting to an in-memory database
import sqlite3
import pandas as pd

shopcon = sqlite3.connect(':memory:')
shopcur = shopcon.cursor()

# Modify the create statements below

shopcur.execute('''CREATE TABLE customer()''')
shopcustdf = pd.read_csv('https://raw.githubusercontent.com/colaberry/data/master/SQL/customer.csv',sep=",")
shopcustdf.to_sql(name='customer',con=shopcon,if_exists='append',index=False)
shopcur.execute('''CREATE TABLE product()''')
shopproddf = pd.read_csv('https://raw.githubusercontent.com/colaberry/data/master/SQL/product.csv',sep=",")
shopproddf.to_sql(name='product',con=shopcon,if_exists='append',index=False)
shopcur.execute('''CREATE TABLE purchaseorder()''')
shoporderdf = pd.read_csv('https://raw.githubusercontent.com/colaberry/data/master/SQL/order.csv',sep=",")
shoporderdf.to_sql(name='purchaseorder',con=shopcon,if_exists='append',index=False)
shopcur.execute("SELECT * FROM purchaseorder")
po_out = shopcur.fetchall()

In [2]:

# refer to any online sql tutorials to learn how to frame SQL create queries, declaring primary and foreign keys

In [2]:

shopcur.execute('''CREATE TABLE customer(cust_id INT PRIMARY KEY NOT NULL, cust_name CHAR(50), cust_payment_mode CHAR(50))''')
shopcustdf = pd.read_csv('https://raw.githubusercontent.com/colaberry/data/master/SQL/customer.csv',sep=",")

shopcustdf.to_sql(name='customer',con=shopcon,if_exists='append',index=False)
shopcur.execute('''CREATE TABLE product(prod_id INT PRIMARY KEY NOT NULL, prod_name CHAR(50), prod_price INT)''')

shopproddf = pd.read_csv('https://raw.githubusercontent.com/colaberry/data/master/SQL/product.csv',sep=",")
shopproddf.to_sql(name='product',con=shopcon,if_exists='append',index=False)
shopcur.execute('''CREATE TABLE purchaseorder(order_id INT PRIMARY KEY NOT NULL, cust_id INT NOT NULL, prod_id INT NOT NULL,
    order_amount INT, FOREIGN KEY (cust_id) REFERENCES customer(cust_id), FOREIGN KEY (prod_id) REFERENCES product(prod_id))''')

shoporderdf = pd.read_csv('https://raw.githubusercontent.com/colaberry/data/master/SQL/order.csv',sep=",")
shoporderdf.to_sql(name='purchaseorder',con=shopcon,if_exists='append',index=False)
shopcur.execute("SELECT * FROM purchaseorder")
po_out = shopcur.fetchall()

In [3]:

ref_tmp_var = False

try:
    test = [(5005, 1000, '3002, 3005', 18),
     (5006, 1003, '3008, 3009, 3005', 38),
     (5007, 1002, '3003, 3001', 35),
     (5008, 1003, '3004, 3009', 18),
     (5009, 1002, '3007, 3003, 3001', 53),
     (5010, 1002, '3008, 3010', 26)]

    if test==po_out:
        ref_assert_var = True
        ref_tmp_var = True
    else:
        ref_assert_var = False
        print('Please follow the instructions given and use the same variables provided in the instructions.')
except Exception:
    print('Please follow the instructions given and use the same variables provided in the instructions.')

assert ref_tmp_var

continue

When a foreign key relationship is created among two tables, referential integrity can be forced. Referential Integrity ensures that no row in the table containing the foreign keys will contain a foreign key value which does not exist as a valid primary key value in the source table. i.e., In the current example purchaseorder table will not accept insertion of a row with a cust_id (or prod_id) value that does not exist in the customer table (or product table). Similar rules apply for deletion. No row in customer (or product) table can be deleted, if that row's cust_id (or prod_id) exists as a foreign key value in the purchaseorder table.

In order to ensure that we have created the relationship, we can test the referential integrity of the tables. Note that by default, sqlite does not enforce foreign key constraints (and thereby referential integrity). This feature needs to be explicitly turned on for each session of sqlite, using the command 'PRAGMA foreign_keys=ON'.

Exercise¶

Let us test the referential integrity of the data model created above. Pass the following record, with non-existing cust_id and prod_id. If the foreign key relationship was properly created in the previous steps, an error message 'FOREIGN KEY constraint failed' should be displayed. Else, the relationship was not created properly and that exercise needs to be revisited.

In [1]:

shopcur.execute("PRAGMA foreign_keys=ON")

try:
    shopcur.execute("INSERT INTO purchaseorder (order_id, cust_id, prod_id, order_amount) VALUES (5011,1030,3040,50)")
    raise Exception("Insert Successful. Creation of Relationship failed")
except Exception as x:
    global storemsg
    storemsg = x
    print(x)

In [ ]:

In [5]:

In [6]:

ref_tmp_var = False

try:
    if str(storemsg)=='FOREIGN KEY constraint failed':
        ref_assert_var = True
        ref_tmp_var = True
    else:
        ref_assert_var = False
        print('Please follow the instructions given and use the same variables provided in the instructions.')
except Exception:
    print('Please follow the instructions given and use the same variables provided in the instructions.')

assert ref_tmp_var

continue

Types of Data Models¶

There are three types of data models - Conceptual model, Logical model and Physical model

Conceptual model: This consists of just the entities and its relationships. This is an initial design model for a database/datawarehouse.
Logical model: This consists of attributes within entities and the data type required for each of the attributes. It also provides details on the primary keys and foreign keys within each entity. This model is helpful in translating a conceptual model into implementation.
Physical model: This model represents the Entity-Relationship structure of the database/datawarehouse after implementation. It is called a physical view as this is how the structure is created and stored on disk.

Datawarehouses and Dimensional Data¶

Datawarehouse is a type of database which stores data in a slightly different way, in order to enable business analysis in a different way. A database helps answer transactional queries and hence is said to enable OLTP (Online Transaction Processing). A datawarehouse is created over an existing database in order to enable and support OLAP (Online Analytical Processing) queries. OLTP queries provide data to business users, in order to support operational decision making. OLAP queries support strategic decision making of the business users.

Dimensional data is the way data is perceived and modelled in a datawarehouse. Quantitative variables are generally called 'Measures'. Variables which govern or define a specific value (record) of a 'Measure' are called dimensions. For e.g., Sale amount of a transaction is a 'Measure', whereas the 'time of occurence of sale transaction', 'the type of sale' (say cash sale or credit sale), 'location where the sale occurred' are all 'Dimensions' of the sale. A datawarehouse usually consists of multiple 'Dimension' tables and one 'Measure' table. The connection of these 'Dimension' tables to their 'Measure' table results in a Star or Snowflake schema.

If you would like to learn more about Datawarehouses - concepts and implementation, do refer to certified course from Colaberry: http://www.colaberry.com/bootcamp/sqlbi

In [ ]:

ref_tmp_var = False

ref_tmp_var = True

assert ref_tmp_var