Notebook

In [1]:

import startspark 
import pyspark

In [2]:

sc = startspark.create_spark_instance() # run once

The key abstraction of spark is *Resilient Distributed Dataset (RDD)*, which represents a distributed collection of items.

RDD can be created from input formats such as local files or HDFS files
RDD can be acted to calcualte values
RDD can be transformed to new RDDs

In [3]:

textfile = sc.textFile("startspark.py") ## construction
print textfile.count() ## action - number of items (lines)
print textfile.first() ## action - first item

22
import os, sys

In [4]:

## filter to transform the RDD to a new RDD
lineswithspark = textfile.filter(lambda line: "spark" in line)
print lineswithspark.first()
print lineswithspark.count()

SPARK_HOME = path.abspath("/home/dola/opt/spark-1.1.1/")
9

In [5]:

## combination and action and transformation can do a lot of things
## e.g., find the line with most words in a file - use reduce to find max
## map is a transform, reduce is an action
textfile.map(lambda line: len(line.split())).reduce(lambda a, b: a if a>=b else b)

Out[5]:

In [6]:

## word count - "hello world" map reduce example
## unlike reduce, reduceByKey generates another RDD
word_counts = sc.textFile("README.md").flatMap(lambda line: line.split()) \
                .map(lambda word: (word, 1)) \
                .reduceByKey(lambda a, b: a+b)
        
print word_counts.collect()

[(u'and', 2), (u'useful', 1), (u'is', 1), (u'am', 2), (u'not', 1), (u'But', 1), (u'learning', 2), (u'go', 1), (u'plan', 1), (u'spark', 1), (u'are', 1), (u'for', 1), (u'how', 1), (u'with', 1), (u'least', 1), (u'machine', 1), (u'to', 2), (u'collections', 1), (u'tutorials', 1), (u'tasks.', 1), (u'include', 1), (u'(impyla).', 1), (u'sure', 1), (u'that', 1), (u'I', 3), (u'some', 2), (u'here', 1), (u'framework', 1), (u'preparing', 1), (u'across.', 1), (u'The', 1), (u'Impala', 1), (u'a', 1), (u'articles', 1), (u'about', 1), (u'this', 1), (u'of', 1), (u'yet.', 1), (u'at', 1), (u'came', 1), (u'Blaze', 1)]

Cluster-wide in-memory cache is one of the biggest selling point of spark. It is generally required for hot data that will be continousely accessed in an iterative algorithm.

In [7]:

## it is easy to cache a RDD
textfile.cache() ## side effect instead of returning a new RDD
textfile.count()

Out[7]:

When working with Spark, we can pass Python functions to Spark, which are automatically serialized along with any variables that they reference. For applications that use custom classes or third-party libraries, we can also add code dependencies to spark-submit through its --py-files argument by packaging them into a .zip file (see spark-submit --help for details). For example,

"""SimpleApp.py"""
from pyspark import SparkContext

logFile = "YOUR_SPARK_HOME/README.md"  # Should be some file on your system
sc = SparkContext("local", "Simple App")
logData = sc.textFile(logFile).cache()

numAs = logData.filter(lambda s: 'a' in s).count()
numBs = logData.filter(lambda s: 'b' in s).count()

print "Lines with a: %i, lines with b: %i" % (numAs, numBs)

submit the python standalone to spark

# Use spark-submit to run your application
$ YOUR_SPARK_HOME/bin/spark-submit \
  --master local[4] \
  SimpleApp.py
...
Lines with a: 46, Lines with b: 23

As in the above example, in practice, when running on a cluster, you will not want to hardcode master in the program, but rather launch the application with spark-submit and receive it there. However, for local testing and unit tests, you can pass “local” to run Spark in-process.

Spark Programming Guide¶

Main Data Objects:
- RDD: can be intutively imagined as a collection of things, distributed on cluster's files systems (e.g. by HDFS). An RDD can be persisted in (distributed) memory
- Variables (different from RDD) by default have their own local copies in each task. However, they can be shared as "broadcast variables" (caching a value in memory on all nodes) or "accumulators" (variables that can only be added to from different tasks, such as counters).
There are two ways of creating an RDD
- parallelizing an existing collection from driver program (for interactive exploration) - The elements of the collection are copied to form a distributed dataset that can be operated on in parallel.
- referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat (including local file system, in this case, however, the file must also be accessible at the same path on worker nodes. Either copy the file to all workers or use a network-mounted shared file system).
Passing functions to Spark, three main ways:
- lambda expression for short code
- local defs inside the function calling Spark for longer code - similiar to a local variable
- top-level functions (if it is a method defined on a class, the whole object might need to be passed to spark)
Key-Value paris RDD: While most Spark operations work on RDDs containing any type of objects, a few special operations are only available on RDDs of key-value pairs. The most common ones are distributred "shuffle" operations, such as grouping or aggregating the lements by a key, e.g., reduceByKey
All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program.
Persistence of RDD: By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.
One of the most important capabilities in Spark is persisting (or caching) a dataset in memory across operations. When you persist an RDD, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset (or datasets derived from it). This allows future actions to be much faster (often by more than 10x). Caching is a key tool for iterative algorithms and fast interactive use. Spark automatically monitors cache usage on each node and drops out old data partitions in a least-recently-used (LRU) fashion. If you would like to manually remove an RDD instead of waiting for it to fall out of the cache, use the RDD.unpersist() method.

In [8]:

## parallelize an object in driver program to form an RDD
import numpy as np
data = np.arange(10) ## not just python list!
para_data = sc.parallelize(data)
para_data.count()
!rm -fR data/temp-data/
para_data.saveAsPickleFile("data/temp-data", )
print sc.pickleFile("data/temp-data/").collect()

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [9]:

## several ways of load files
!cat data/a.txt 
!cat data/b.txt

allInOne = sc.textFile("data/*.txt")
print allInOne.count()

nameToFiles = sc.wholeTextFiles("data/*.txt")
print nameToFiles.count()
print nameToFiles.collect()

hello
this is a
hello
this is b
4
2
[(u'/home/dola/workspace/dola/tutorials/learn-spark/data/a.txt', u'hello\nthis is a\n'), (u'/home/dola/workspace/dola/tutorials/learn-spark/data/b.txt', u'hello\nthis is b\n')]

Shared Variables

Normally, when a function passed to a Spark operation (such as map or reduce) is executed on a remote cluster node, it works on separate copies of all the variables used in the function. These variables are copied to each machine, and no updates to the variables on the remote machine are propagated back to the driver program. Supporting general, read-write shared variables across tasks would be inefficient. However, Spark does provide two limited types of shared variables for two common usage patterns: broadcast variables and accumulators.

Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. Broadcast variables are created from a variable v by calling SparkContext.broadcast(v). The broadcast variable is a wrapper around v, and its value can be accessed by calling the value method.

In [9]:

an example of how to paralize tasks in Spark¶

It is based on the discussion on stackoverflow

In [10]:

import random
import time

N = 12500000

def sample(p):
    x, y = random.random(), random.random()
    return 1 if x*x + y*y < 1 else 0

In [11]:

## The following code will probably run on a single core, because spark implicitly partition 
## it to one partition

tic = time.time()
count = sc.parallelize(xrange(0, N)).map(sample).reduce(lambda a, b: a + b)
print "Pi is roughly %f" % (4.0 * count / N)
print 'time: %g s' % (time.time() - tic)

Pi is roughly 3.141940
time: 9.78296 s

In [12]:

## It can be made better by explicitly partitioning the data into multi parts.
## But it will still have to generate the huge range in the driver thread - with single core
ncores = 16
tic = time.time()
count = sc.parallelize(xrange(0, N), ncores).map(sample).reduce(lambda a, b: a + b)
print "Pi is roughly %f" % (4.0 * count / N)
print 'time: %g s' % (time.time() - tic)

Pi is roughly 3.141770
time: 9.82372 s

In [13]:

## so a better solution is to partition the data separately
N = 12500000
part = 16
tic = time.time()
count = ( sc.parallelize([None] * part, part)
           .flatMap(lambda blah: [sample(p) for p in xrange( N/part)])
           .reduce(lambda a, b: a + b)
       )
print "Pi is roughly %f" % (4.0 * count / N)
print 'time: %g s' % (time.time() - tic)

Pi is roughly 3.141796
time: 6.60225 s

In [ ]: