The current setup allows to execute PySpark operations on a local standalone Spark instance. This can be used for testing with small datasets.
In the future, SWAN users will be able to attach external Spark clusters to their notebooks, so they can target bigger datasets. Moreover, a Scala Jupyter kernel will be added to use Spark from Scala as well.
pyspark module is available to perform the necessary imports.
from pyspark import SparkContext
SparkContext needs to be created before running any Spark operation. This context is linked to the local Spark instance.
sc = SparkContext()
Let's use our
SparkContext to parallelize a list.
rdd = sc.parallelize([1, 2, 4, 8])
We can count the number of elements in the list.
map a function to our RDD to increment all its elements.
rdd.map(lambda x: x + 1).collect()
[2, 3, 5, 9]
We can also calculate the sum of all the elements with
rdd.reduce(lambda x, y: x + y)