I am currently using spark-1.1.1
and ipython 2.3.1
You also need to install py4j
Technically, we need to add spark/python to the search path so that we can load SparkContext
and SparkConf
. After that, the pyspark
will look up $SPARK_HOME
env variable to find spark so that a SparkContext instance can be created.
from IPython import display
## configure spark path
import os, sys
from os import path
SPARK_HOME = path.abspath("/home/dola/opt/spark-1.1.1/")
os.environ["SPARK_HOME"] = SPARK_HOME
sys.path.append(path.join(SPARK_HOME, "python/")) ## find pyspark, needs $SPARK_HOME
## main entrance to pyspark
from pyspark import SparkContext
from pyspark import SparkConf
import pyspark
## optionally configure spark settings
conf = SparkConf()
conf.set("spark.executor.memory", "32g", )
conf.set("spark.cores.max", "28")
conf.setAppName("spark ipython-notebook")
## initialize sparkcontext
sc = SparkContext("local", conf = conf, )
## the control panel
display.IFrame("http://localhost:4040", 1000, 300)
I put the code in the modeule startspark.py
Please note: In practice, when running on a cluster, you will not want to hardcode master in the program, but rather launch the application with spark-submit and receive it there. However, for local testing and unit tests, you can pass “local” to run Spark in-process.
An alternative is to directly use the spark's frameworks's bin/pyspark
as it has an option to start ipython notebook, just run:
IPYTHON=1 ./bin/pyspark # to start ipython
OR
IPYTHON_OPTS="notebook" ./bin/pyspark # to start ipython notebook