#!/usr/bin/env python # coding: utf-8 # ![Spark Logo](http://spark-mooc.github.io/web-assets/images/ta_Spark-logo-small.png) # # **Simple example with Spark** #

# This notebook illustrates the use of [Spark](https://spark.apache.org) in [SWAN](http://swan.web.cern.ch). # # The current setup allows to execute [PySpark](http://spark.apache.org/docs/latest/api/python/) operations on a local standalone Spark instance. This can be used for testing with small datasets. # # In the future, SWAN users will be able to attach external Spark clusters to their notebooks, so they can target bigger datasets. Moreover, a Scala Jupyter kernel will be added to use Spark from Scala as well. # ## Import the necessary modules # The `pyspark` module is available to perform the necessary imports. # In[1]: from pyspark import SparkContext # ## Create a `SparkContext` # A `SparkContext` needs to be created before running any Spark operation. This context is linked to the local Spark instance. # In[2]: sc = SparkContext() # ## Run Spark actions and transformations # Let's use our `SparkContext` to parallelize a list. # In[13]: rdd = sc.parallelize([1, 2, 4, 8]) # We can count the number of elements in the list. # In[14]: rdd.count() # Let's now `map` a function to our RDD to increment all its elements. # In[15]: rdd.map(lambda x: x + 1).collect() # We can also calculate the sum of all the elements with `reduce`. # In[16]: rdd.reduce(lambda x, y: x + y)