Spark Magic

BeakerX has a Spark magic that provides deeper integration with Spark. It provides a GUI dialog for connecting to a cluster, a progress meter that shows how your job is working and links to the regular Spark UI, and it forwards kernel interrupt messages onto the cluster so you can stop a job without leaving the notebook, and it automatically displays Datasets using an interactive widget. Finally, it automatically closes the Spark session when the notebook is closed.

It is compatible with Spark version 2.x.

In [ ]:
%%classpath add mvn
org.apache.spark spark-sql_2.11 2.2.1

The spark cell magic can be run all by itself in a cell. It produces a GUI dialog you fill out to connect to your cluster.

In [ ]:
%%spark

Optionally, the contents of the cell can produce a Spark session to fill out default values for the GUI. Only one spark magic can be connected at a time.

In [ ]:
%%spark
SparkSession.builder()
      .appName("BeakerX Demo")
      .master("local[4]")

You can also provide a --start (or -s) option to automatically start a session with a cluster (or a local instance).

In [ ]:
%%spark --start
SparkSession.builder().master("local[100]")

If you have added JARs to the classpath of the Spark driver with the %classpath magic, they can be copied to the executors as follows. We are looking into making this automatic, and also supporting spark.jars.packages, see #7498.

In [ ]:
%%spark
val jars = ClasspathManager.getJars().toArray.mkString(",")
SparkSession.builder().config("spark.jars", jars)

After starting a session with a Spark cluster using one of the above configurations, then code like the following runs in parallel without any additional annotation. A three-way progress widget automatically appears, showing how many tasks are waiting, running, and completed.

In [ ]:
val NUM_SAMPLES = 10000000
val random = new scala.util.Random()
val count = spark.sparkContext.parallelize(1 to NUM_SAMPLES).map{i =>
  val x = random.nextDouble()
  val y = random.nextDouble()
  if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)

println("Pi is roughly " + 4.0 * count / NUM_SAMPLES)

By default the Dataset preview shows just the columns and their types. You can click a button to materialize ten rows.

In [ ]:
val tornadoesPath = java.nio.file.Paths.get("../resources/data/tornadoes_2014.csv").toAbsolutePath()

val ds = spark.read.format("csv").option("header", "true").load(tornadoesPath.toString())
ds

Or you can use the display method to specify any number of rows.

In [ ]:
ds.display(1000)
In [ ]: