The data
The data in the example cases have these columns:
time
engine heat
in celcius degrees. An increase of 1 in the time
column equals 1 hour. Each case has 3 days of data.
The value in the engine heat
column increases by time when the engine is running. Once the temperature raises to a certain level, the system automatically switces to secondary engine.
The case
This info should be extracted from the data in this imaginary case:
For one reason or another the engine temperature sensor is the only available information. Because of the system limitations SQL is the only possible analytics tool.
This notebook is loosely inspired by an actual business need, but the data and examples are generalization of the problem.
Load libraries and create variables.
#Import custom python files
import module as m
import config as c
#Reload custom libraries in case of changes
import importlib
importlib.reload(m)
importlib.reload(c)
#Generate a temporary in memory database
import sqlite3
con = sqlite3.connect(":memory:")
#Data analysis libraries
import pandas as pd
import numpy as np
#Initialize plotly for notebooks
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)
This is the base data for all of the cases.
#Display the data frame
df_init = m.generate_initial_data()
print("Rows in the data: {}".format(df_init.shape[0]))
display(df_init[7:17])
In this case we make an expectation that the engine heat always increases until the engine is switched.
The clusters can thus be detected only by observing the previous data point.
#Generate data for case1 from the initial data
df_case1 = df_init.copy()
df_case1 = m.generate_engine_heat_1(df_case1)
#Plot the data
m.plot_heat_2d(df_case1, plot_title="Case 1: Known clusters")
#Write to sqlite database having only the time and heat columns
df_case1[[c.col_time, 'engine_heat']].to_sql(c.tbl_case1, con, if_exists="replace", index=False)
Start a new cluster if previous value is greater than the previous.
#The query to get all rows labled with cluster id
sql_case1 = """
--Cumulative sum to generate cluster id for each row
SELECT *, SUM(is_new_cluster) OVER (ORDER BY {0} RANGE UNBOUNDED PRECEDING) AS cluster_id
FROM(
--Creates a column that indicates whether the row starts a new cluster
SELECT *, CASE WHEN ((engine_heat - engine_heat_prev) < 0) OR (engine_heat_prev IS NULL) THEN TRUE ELSE 0 END AS is_new_cluster
FROM(
--Creates a column for previous engine heat value
SELECT {0}, {1},
LAG({1}, 1, null) OVER (ORDER BY {0}) AS engine_heat_prev
FROM {2}
)
)
""".format(c.col_time, "engine_heat", c.tbl_case1)
print(sql_case1)
#Use the query to read the data from database
df_case1_clustered = pd.read_sql(sql_case1, con)
#Show the first rows of the data
df_case1_clustered.head(15)
#A query to get aggregated results for each cluster
sql_case1_agg = """
--Count the number of observations in each cluster
SELECT {0}, COUNT({0}) AS rows_n
FROM (
{1}
)
GROUP BY {0}
""".format(c.col_cluster_id, sql_case1)
print(sql_case1_agg)
#Print the report showing the observation count in each cluster
df_case1_clustered_agg = pd.read_sql(sql_case1_agg, con)
display(df_case1_clustered_agg)
This visualization shows the dected clusters by our simple SQL algorithm.
If the chart looks the same than original visualization, the clustering was successful.
#Show the visualization
m.plot_heat_2d(df_case1_clustered, plot_title="Case 1: Clusters created by SQL")
Even the machine learning backed clustering algorithms struggle with unclear boundaries. Let's add some noise to the data and see how the SQL clustering manages this challenge. This is done by making the engine heat
go randomly up or down.
In this example there will be two features for the clustering: engine heat 1
and engine heat 2
. It's impossible to do the clustering by just a single variable. Instead, we know that if the combined heat increases, the observation belongs to the same cluster than the previous one.
#Generate data for case2 from the initial data
df_case2 = df_init.copy()
df_case2 = m.generate_engine_heat_2(df_case2, col_1="engine_heat_1", col_2="engine_heat_2")
#Plot engine heat 1 and 2 by the time
m.plot_heat_2d(df_case2, col_y="engine_heat_1", plot_title="Case 2: Known clusters for engine 1")
m.plot_heat_2d(df_case2, col_y="engine_heat_2", plot_title="Case 2: Known clusters for engine 2")
#Plot both engine heats by time in 3D plot
importlib.reload(m)
importlib.reload(c)
m.plot_heat_3d(df_case2, plot_title="Case 2: Both engine heats by time")
#Write to sqlite database having only the time and heat columns
df_case2[[c.col_time, 'engine_heat_1', 'engine_heat_2']].to_sql(c.tbl_case2, con, if_exists="replace", index=False)
The most of the logic is similar compared to the simple case 1.
The biggest difference in case 2 is that a new cluster starts if the total heat increases.
#The query to get all rows labled with cluster id
sql_case2 = """
--Cumulative sum to generate cluster id for each row
SELECT *, SUM(is_new_cluster) OVER (ORDER BY {0} RANGE UNBOUNDED PRECEDING) AS cluster_id
FROM(
--Creates a column that indicates whether the row starts a new cluster
SELECT *, CASE WHEN (tot_diff_prev < 0) OR (tot_diff_prev IS NULL) THEN TRUE ELSE 0 END AS is_new_cluster
FROM(
--Creates a column for difference between the current and the previous sum of engines heats
SELECT *, (engine_heat_1 - eh_1_prev) + (engine_heat_2 - eh_2_prev) AS tot_diff_prev
FROM(
SELECT {0}, {1}, {2},
LAG({1}, 1, null) OVER (ORDER BY {0}) AS eh_1_prev,
LAG({2}, 1, null) OVER (ORDER BY {0}) AS eh_2_prev
FROM {3}
)
)
)
""".format(c.col_time, "engine_heat_1", "engine_heat_2", c.tbl_case2)
print(sql_case2)
#Use the SQL query to get the clusters for the case 2
df_case2_clustered = pd.read_sql(sql_case2, con)
df_case2_clustered.head(15)
#A query to get aggregated results for each cluster
sql_case2_agg = """
--Count the number of observations in each cluster
SELECT {0}, COUNT({0}) AS rows_n
FROM (
{1}
)
GROUP BY {0}
""".format(c.col_cluster_id, sql_case2)
print(sql_case2_agg)
#Print the report showing the observation count in each cluster
df_case2_clustered_agg = pd.read_sql(sql_case2_agg, con)
display(df_case2_clustered_agg)
m.plot_heat_2d(df_case2_clustered, col_y="engine_heat_1", plot_title="Case 2: Clusters created by SQL - Plot for engine 1")
m.plot_heat_2d(df_case2_clustered, col_y="engine_heat_2", plot_title="Case 2: Clusters created by SQL - Plot for engine 2")