Spark Magic¶

BeakerX has a Spark magic that provides deeper integration with Spark. It provides a GUI dialog for connecting to a cluster, a progress meter that shows how your job is working and links to the regular Spark UI, and it forwards kernel interrupt messages onto the cluster so you can stop a job without leaving the notebook, and it automatically displays Datasets using an interactive widget. Finally, it automatically closes the Spark session when the notebook is closed.

The Spark magic is alpha quality.

In [ ]:

%%classpath add mvn
org.apache.spark spark-sql_2.11 2.2.1

The spark cell magic can be run all by itself in a cell. It produces a GUI dialog you fill out to connect to your cluster.

In [ ]:

%%spark

Optionally, the contents of the cell can produce a Spark session to fill out default values for the GUI. Only one spark magic can be active at a time.

In [ ]:

%%spark
SparkSession.builder()
      .appName("BeakerX Demo")
      .master("local[4]")

You can also provide a --connect (or -c) option to automatically connect with the cluster.

In [ ]:

%%spark --connect
SparkSession.builder().master("local[100]")

In [ ]:

val NUM_SAMPLES = 10000000

val count2 = spark.sparkContext.parallelize(1 to NUM_SAMPLES).map{i =>
  val x = Math.random()
  val y = Math.random()
  if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)

println("Pi is roughly " + 4.0 * count2 / NUM_SAMPLES)

By default the first 1000 rows are materialized to preview a dataset.

In [ ]:

val tornadoesPath = java.nio.file.Paths.get("../resources/data/tornadoes_2014.csv").toAbsolutePath()

val ds = spark.read.format("csv").option("header", "true").load("file://" + tornadoesPath)
ds

Or you can use the display method to specify any number of rows.

In [ ]:

ds.display(1)