PySpark Demo Plotly Notebook

Demo

  1. Setup Spark
  2. Load Kaggle Data
  3. Transform Data with Spark SQL
  4. Graph Data with Plotly

Setup Spark

Setup Spark SparkSession

In [1]:
! pip install plotly --quiet
In [2]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
In [3]:
working_directory = '/home/garystafford/work/'

spark = SparkSession \
    .builder \
    .appName('pyspark_demo_app') \
    .config('spark.driver.extraClassPath',
            working_directory + 'postgresql-42.2.5.jar') \
    .master("local[*]") \
    .getOrCreate()

Load Kaggle Data

Load the Kaggle dataset from the CSV file, containing ~21K records, into a DataFrame

In [4]:
bakery_schema = StructType([
    StructField('date', StringType(), True),
    StructField('time', StringType(), True),
    StructField('transaction', IntegerType(), True),
    StructField('item', StringType(), True)
])

df_bakery1 = spark.read \
    .format("csv") \
    .option("header", "true") \
    .load("BreadBasket_DMS.csv", schema=bakery_schema)

df_bakery1.show(10)
df_bakery1.count()
+----------+--------+-----------+-------------+
|      date|    time|transaction|         item|
+----------+--------+-----------+-------------+
|2016-10-30|09:58:11|          1|        Bread|
|2016-10-30|10:05:34|          2| Scandinavian|
|2016-10-30|10:05:34|          2| Scandinavian|
|2016-10-30|10:07:57|          3|Hot chocolate|
|2016-10-30|10:07:57|          3|          Jam|
|2016-10-30|10:07:57|          3|      Cookies|
|2016-10-30|10:08:41|          4|       Muffin|
|2016-10-30|10:13:03|          5|       Coffee|
|2016-10-30|10:13:03|          5|       Pastry|
|2016-10-30|10:13:03|          5|        Bread|
+----------+--------+-----------+-------------+
only showing top 10 rows

Out[4]:
21293

Transform Data with Spark SQL

Transform the DataFrame's bakery data using Spark SQL

In [5]:
df_bakery1.createOrReplaceTempView("bakery_table_tmp1")

df_bakery2 = spark.sql("SELECT date, transaction, item " +
                       "FROM bakery_table_tmp1 " +
                       "WHERE item NOT LIKE 'NONE'" +
                       "ORDER BY transaction")
df_bakery2.show(5)
df_bakery2.count()
+----------+-----------+-------------+
|      date|transaction|         item|
+----------+-----------+-------------+
|2016-10-30|          1|        Bread|
|2016-10-30|          2| Scandinavian|
|2016-10-30|          2| Scandinavian|
|2016-10-30|          3|Hot chocolate|
|2016-10-30|          3|          Jam|
+----------+-----------+-------------+
only showing top 5 rows

Out[5]:
20507
In [6]:
df_bakery2.createOrReplaceTempView("bakery_table_tmp2")

df_bakery3 = spark.sql("SELECT date, count(*) as count " +
                       "FROM bakery_table_tmp2 " +
                       "WHERE date >= '2017-01-01' " +
                       "GROUP BY date " +
                       "ORDER BY date")
df_bakery3.show(5)
df_bakery3.count()
+----------+-----+
|      date|count|
+----------+-----+
|2017-01-01|    1|
|2017-01-03|   87|
|2017-01-04|   76|
|2017-01-05|   95|
|2017-01-06|   84|
+----------+-----+
only showing top 5 rows

Out[6]:
98

Use Plotly to Visualize Data

Use Plotly to create a chart showing bakery items sold over time
Demostrates linear fit and data smoothing:

In [8]:
import plotly.plotly as py
import plotly.graph_objs as go
from numpy import arange, array, ones
from scipy import stats, signal

# print(plotly.__version__)

df_bakery4 = df_bakery3.toPandas()

# Generated linear fit
xi = arange(0, len(df_bakery4.index))
slope, intercept, r_value, p_value, std_err = stats.linregress(xi, df_bakery4['count'])
line = slope * xi + intercept

layout = dict(title='2017 Bakery Sales',
              xaxis=dict(
                  title='Month',
                  showgrid=True,
                  zeroline=True,
                  showline=True,
                  ticks='outside',
                  tickangle=45,
                  showticklabels=True),
              yaxis=dict(
                  title='Items Sold/Day',
                  showgrid=True,
                  zeroline=True,
                  showline=True,
                  ticks='outside',
                  showticklabels=True))

trace1 = go.Bar(x=df_bakery4['date'], y=df_bakery4['count'], name='Items Sold')
trace2 = go.Scatter(x=df_bakery4['date'], y=line, mode='lines', name='Linear Fit')
trace3 = go.Scatter(x=df_bakery4['date'], y=signal.savgol_filter(df_bakery4['count'], 53, 3),
                    mode='lines', name='Savitzky-Golay')
data = [trace1, trace2, trace3]
fig = dict(data=data, layout=layout)
py.iplot(fig, filename='jupyter-basic_bar.html')
Out[8]: