Data modeling is, providing a structure to the data in order to allow efficient storage, easy access and better comprehension of data. One of the fundamental concepts of data modeling is the entity-relationship model.
Entity-Relationship model or ER model is a diagrammatical representation of various data entities and relationships among them. In object oriented programming, a category of entities is defined as a 'Class' and each unique entity belonging to a class is called an 'Object'. Similarly, in database parlance, A schema or table is a collection of records which have same attributes. Records are stored as rows and attributes are stored as columns in a table. In an ER diagram, a table or a view (the result of a query, which is not stored in disk, and is viewed in the form of a table) constitutes entities and associations between two or more entities are called relationships. Relationships can be determined in two different ways.
Based on number of participants
Based on number of possible associations for a single object
Let us look at an example for an Entity-Relationship model. Consider an application with a shopping cart feature and the information flows with respect to performing an online purchase.
<img src="https://s3.amazonaws.com/refactored/images/ML/images/ShoppingER.png", style="width:70%;">
The above diagram shows that a key symbol is marked besides specific attributes. A key is a column or set of columns (minimum number of required columns), through which each individual object or record can be easily identified and accessed. A primary key is native to the table and a foreign key is a primary key of another table. By defining primary and foreign keys, we can establish the association between two tables and thus provide a connection among them both.
Connect to an in-memory database using sqlite module.
Create the tables
using the above example ER diagram. After creation of each table, load them with data using the respective data files, customer.csv, product.csv, order.csv. Use pandas module and apply data loading operations learned in previous exercise. Print the first row of the table after loading the data.
Datatypes of columns: cust_id, prod_id, prod_price, order_id, order_amount can all be assumed to be Integers (INT). cust_name, cust_payment_name, prod_name can be assigned a CHAR(50) string data type.
Assume that cust_id, order_id and prod_id are each primary keys of customer, purchaseorder and product tables. Also assume, cust_id and prod_id are foreign keys in the purchaseorder table. Create these relationships while table creation through the create query.
Print out all the contents of the 'purchaseorder' table using a 'SELECT * FROM' query. Store the query output in a 'po_out' variable.
Points to remember:
# Connecting to an in-memory database
import sqlite3
import pandas as pd
shopcon = sqlite3.connect(':memory:')
shopcur = shopcon.cursor()
# Modify the create statements below
shopcur.execute('''CREATE TABLE customer()''')
shopcustdf = pd.read_csv('https://raw.githubusercontent.com/colaberry/data/master/SQL/customer.csv',sep=",")
shopcustdf.to_sql(name='customer',con=shopcon,if_exists='append',index=False)
shopcur.execute('''CREATE TABLE product()''')
shopproddf = pd.read_csv('https://raw.githubusercontent.com/colaberry/data/master/SQL/product.csv',sep=",")
shopproddf.to_sql(name='product',con=shopcon,if_exists='append',index=False)
shopcur.execute('''CREATE TABLE purchaseorder()''')
shoporderdf = pd.read_csv('https://raw.githubusercontent.com/colaberry/data/master/SQL/order.csv',sep=",")
shoporderdf.to_sql(name='purchaseorder',con=shopcon,if_exists='append',index=False)
shopcur.execute("SELECT * FROM purchaseorder")
po_out = shopcur.fetchall()
# refer to any online sql tutorials to learn how to frame SQL create queries, declaring primary and foreign keys
shopcur.execute('''CREATE TABLE customer(cust_id INT PRIMARY KEY NOT NULL, cust_name CHAR(50), cust_payment_mode CHAR(50))''')
shopcustdf = pd.read_csv('https://raw.githubusercontent.com/colaberry/data/master/SQL/customer.csv',sep=",")
shopcustdf.to_sql(name='customer',con=shopcon,if_exists='append',index=False)
shopcur.execute('''CREATE TABLE product(prod_id INT PRIMARY KEY NOT NULL, prod_name CHAR(50), prod_price INT)''')
shopproddf = pd.read_csv('https://raw.githubusercontent.com/colaberry/data/master/SQL/product.csv',sep=",")
shopproddf.to_sql(name='product',con=shopcon,if_exists='append',index=False)
shopcur.execute('''CREATE TABLE purchaseorder(order_id INT PRIMARY KEY NOT NULL, cust_id INT NOT NULL, prod_id INT NOT NULL,
order_amount INT, FOREIGN KEY (cust_id) REFERENCES customer(cust_id), FOREIGN KEY (prod_id) REFERENCES product(prod_id))''')
shoporderdf = pd.read_csv('https://raw.githubusercontent.com/colaberry/data/master/SQL/order.csv',sep=",")
shoporderdf.to_sql(name='purchaseorder',con=shopcon,if_exists='append',index=False)
shopcur.execute("SELECT * FROM purchaseorder")
po_out = shopcur.fetchall()
ref_tmp_var = False
try:
test = [(5005, 1000, '3002, 3005', 18),
(5006, 1003, '3008, 3009, 3005', 38),
(5007, 1002, '3003, 3001', 35),
(5008, 1003, '3004, 3009', 18),
(5009, 1002, '3007, 3003, 3001', 53),
(5010, 1002, '3008, 3010', 26)]
if test==po_out:
ref_assert_var = True
ref_tmp_var = True
else:
ref_assert_var = False
print('Please follow the instructions given and use the same variables provided in the instructions.')
except Exception:
print('Please follow the instructions given and use the same variables provided in the instructions.')
assert ref_tmp_var
continue
When a foreign key relationship is created among two tables, referential integrity can be forced. Referential Integrity ensures that no row in the table containing the foreign keys will contain a foreign key value which does not exist as a valid primary key value in the source table. i.e., In the current example purchaseorder table will not accept insertion of a row with a cust_id (or prod_id) value that does not exist in the customer table (or product table). Similar rules apply for deletion. No row in customer (or product) table can be deleted, if that row's cust_id (or prod_id) exists as a foreign key value in the purchaseorder table.
In order to ensure that we have created the relationship, we can test the referential integrity of the tables. Note that by default, sqlite does not enforce foreign key constraints (and thereby referential integrity). This feature needs to be explicitly turned on for each session of sqlite, using the command 'PRAGMA foreign_keys=ON'.
Let us test the referential integrity of the data model created above. Pass the following record, with non-existing cust_id and prod_id. If the foreign key relationship was properly created in the previous steps, an error message 'FOREIGN KEY constraint failed' should be displayed. Else, the relationship was not created properly and that exercise needs to be revisited.
shopcur.execute("PRAGMA foreign_keys=ON")
try:
shopcur.execute("INSERT INTO purchaseorder (order_id, cust_id, prod_id, order_amount) VALUES (5011,1030,3040,50)")
raise Exception("Insert Successful. Creation of Relationship failed")
except Exception as x:
global storemsg
storemsg = x
print(x)
ref_tmp_var = False
try:
if str(storemsg)=='FOREIGN KEY constraint failed':
ref_assert_var = True
ref_tmp_var = True
else:
ref_assert_var = False
print('Please follow the instructions given and use the same variables provided in the instructions.')
except Exception:
print('Please follow the instructions given and use the same variables provided in the instructions.')
assert ref_tmp_var
continue
There are three types of data models - Conceptual model, Logical model and Physical model
Datawarehouse is a type of database which stores data in a slightly different way, in order to enable business analysis in a different way. A database helps answer transactional queries and hence is said to enable OLTP (Online Transaction Processing). A datawarehouse is created over an existing database in order to enable and support OLAP (Online Analytical Processing) queries. OLTP queries provide data to business users, in order to support operational decision making. OLAP queries support strategic decision making of the business users.
Dimensional data is the way data is perceived and modelled in a datawarehouse. Quantitative variables are generally called 'Measures'. Variables which govern or define a specific value (record) of a 'Measure' are called dimensions. For e.g., Sale amount of a transaction is a 'Measure', whereas the 'time of occurence of sale transaction', 'the type of sale' (say cash sale or credit sale), 'location where the sale occurred' are all 'Dimensions' of the sale. A datawarehouse usually consists of multiple 'Dimension' tables and one 'Measure' table. The connection of these 'Dimension' tables to their 'Measure' table results in a Star or Snowflake schema.
If you would like to learn more about Datawarehouses - concepts and implementation, do refer to certified course from Colaberry: http://www.colaberry.com/bootcamp/sqlbi
ref_tmp_var = False
ref_tmp_var = True
assert ref_tmp_var