BeakerX has a Spark magic that provides deeper integration with Spark. It provides a GUI dialog for connecting to a cluster, a progress meter that shows how your job is working and links to the regular Spark UI, and it forwards kernel interrupt messages onto the cluster so you can stop a job without leaving the notebook, and it automatically displays Datasets using an interactive widget. Finally, it automatically closes the Spark session when the notebook is closed.
%%classpath add mvn
org.apache.spark spark-sql_2.11 2.2.1
The spark cell magic can be run all by itself in a cell. It produces a GUI dialog you fill out to connect to your cluster.
%%spark
Optionally, the contents of the cell can produce a Spark session to fill out default values for the GUI. Only one spark magic can be connected at a time.
%%spark
SparkSession.builder()
.appName("BeakerX Demo")
.master("local[4]")
You can also provide a --connect
(or -c
) option to automatically connect with the cluster.
%%spark --connect
SparkSession.builder().master("local[100]")
If you have added JARs to the classpath of the Spark driver with the %classpath
magic, they can be copied to the executors as follows. We are looking into making this automatic, and also supporting spark.jars.packages
, see #7498.
%%spark
val jars = ClasspathManager.getJars().toArray.mkString(",")
SparkSession.builder().config("spark.jars", jars)
After starting a session with a Spark cluster using one of the above configurations, then code like the following runs in parallel without any additional annotation. A three-way progress widget automatically appears, showing how many tasks are waiting, running, and completed.
val NUM_SAMPLES = 10000000
val count2 = spark.sparkContext.parallelize(1 to NUM_SAMPLES).map{i =>
val x = Math.random()
val y = Math.random()
if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)
println("Pi is roughly " + 4.0 * count2 / NUM_SAMPLES)
By default the Dataset preview shows just the columns and their types. You can click a button to materialize ten rows.
val tornadoesPath = java.nio.file.Paths.get("../resources/data/tornadoes_2014.csv").toAbsolutePath()
val ds = spark.read.format("csv").option("header", "true").load("file://" + tornadoesPath)
ds
Or you can use the display method to specify any number of rows.
ds.display(1000)