Spark Magic

BeakerX has a Spark magic that provides deeper integration with Spark. It provides a GUI dialog for connecting to a cluster, a progress meter that shows how your job is working and links to the regular Spark UI, and it forwards kernel interrupt messages onto the cluster so you can stop a job without leaving the notebook, and it automatically displays Datasets using an interactive widget. Finally, it automatically closes the Spark session when the notebook is closed.

In [ ]:
%%classpath add mvn
org.apache.spark spark-sql_2.11 2.2.1

The spark cell magic can be run all by itself in a cell. It produces a GUI dialog you fill out to connect to your cluster.

In [ ]:
%%spark

Optionally, the contents of the cell can produce a Spark session to fill out default values for the GUI. Only one spark magic can be connected at a time.

In [ ]:
%%spark
SparkSession.builder()
      .appName("BeakerX Demo")
      .master("local[4]")

You can also provide a --connect (or -c) option to automatically connect with the cluster.

In [ ]:
%%spark --connect
SparkSession.builder().master("local[100]")

If you have added JARs to the classpath of the Spark driver with the %classpath magic, they can be copied to the executors as follows. We are looking into making this automatic, and also supporting spark.jars.packages, see #7498.

In [ ]:
%%spark
val jars = ClasspathManager.getJars().toArray.mkString(",")
SparkSession.builder().config("spark.jars", jars)

After starting a session with a Spark cluster using one of the above configurations, then code like the following runs in parallel without any additional annotation. A three-way progress widget automatically appears, showing how many tasks are waiting, running, and completed.

In [ ]:
val NUM_SAMPLES = 10000000

val count2 = spark.sparkContext.parallelize(1 to NUM_SAMPLES).map{i =>
  val x = Math.random()
  val y = Math.random()
  if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)

println("Pi is roughly " + 4.0 * count2 / NUM_SAMPLES)

By default the Dataset preview shows just the columns and their types. You can click a button to materialize ten rows.

In [ ]:
val tornadoesPath = java.nio.file.Paths.get("../resources/data/tornadoes_2014.csv").toAbsolutePath()

val ds = spark.read.format("csv").option("header", "true").load("file://" + tornadoesPath)
ds

Or you can use the display method to specify any number of rows.

In [ ]:
ds.display(1000)