Apache Spark¶

Install¶

Install the Java JDK from Oracle.
pip3 install pyspark to get the Spark and the Python bindings
Set an environment variable. E.g., in your ~/.zshrc file:

PYSPARK_PYTHON="python3"
export PYSPARK_PYTHON

Start a new terminal and run pyspark.

Now it is time to start reading in the quick-start-guide.

Use here in Jupyter¶

We try to get a meaningful thing done with the words of the BHSA, here, in this notebook.

We explode the g_word_utf8 feature:

In [1]:

%%time

from tf.convert.tf import explode

explode('~/github/etcbc/bhsa/tf/c/g_word_utf8.tf', 'explode/out')

CPU times: user 1.27 s, sys: 42.6 ms, total: 1.32 s
Wall time: 1.32 s

Out[1]:

True

In [2]:

!head -n 15 explode/out/g_word_utf8.tf

1	בְּ
2	רֵאשִׁ֖ית
3	בָּרָ֣א
4	אֱלֹהִ֑ים
5	אֵ֥ת
6	הַ
7	שָּׁמַ֖יִם
8	וְ
9	אֵ֥ת
10	הָ
11	אָֽרֶץ
12	וְ
13	הָ
14	אָ֗רֶץ
15	הָיְתָ֥ה

In [3]:

!tail -n 15 explode/out/g_word_utf8.tf

426570	בִּ
426571	ירוּשָׁלִַ֖ם
426572	אֲשֶׁ֣ר
426573	בִּֽ
426574	יהוּדָ֑ה
426575	מִֽי
426576	בָכֶ֣ם
426577	מִ
426578	כָּל
426579	עַמֹּ֗ו
426580	יְהוָ֧ה
426581	אֱלֹהָ֛יו
426582	עִמֹּ֖ו
426583	וְ
426584	יָֽעַל

Brilliant.

Spark¶

Spark just works in the notebook!

In [4]:

from pyspark import SparkConf, SparkContext

In [5]:

%%time

conf = SparkConf().setAppName("bhsa").setMaster("local")
sc = SparkContext(conf=conf)

CPU times: user 15.3 ms, sys: 12 ms, total: 27.3 ms
Wall time: 3.18 s

In [6]:

lines = sc.textFile("explode/out/g_word_utf8.tf")

In [7]:

lines.count()

Out[7]:

In [8]:

lines.first()

Out[8]:

'1\tבְּ'

In [9]:

pairs = lines.map(lambda s: tuple(reversed(s.split("\t"))))

In [10]:

pairs.first()

Out[10]:

('בְּ', '1')

Due to the hebrew you do not see that '1' is the second element:

In [11]:

pairs.first()[0]

Out[11]:

'בְּ'

We want the 1 as integer, or rather, as a tuple of one integer (becomes clear later).

In [12]:

def makePair(s):
    (node, value) = s.split("\t")
    return (value, (int(node),))

In [13]:

%%time

pairs = lines.map(makePair)

CPU times: user 604 µs, sys: 506 µs, total: 1.11 ms
Wall time: 784 µs

In [14]:

%%time

firstPair = pairs.first()
print(firstPair[0])
print(firstPair[1])

בְּ
(1,)
CPU times: user 5.41 ms, sys: 2.03 ms, total: 7.43 ms
Wall time: 51.6 ms

In [15]:

pairs.take(10)

Out[15]:

[('בְּ', (1,)),
 ('רֵאשִׁ֖ית', (2,)),
 ('בָּרָ֣א', (3,)),
 ('אֱלֹהִ֑ים', (4,)),
 ('אֵ֥ת', (5,)),
 ('הַ', (6,)),
 ('שָּׁמַ֖יִם', (7,)),
 ('וְ', (8,)),
 ('אֵ֥ת', (9,)),
 ('הָ', (10,))]

Now we get the occurrences of each distinct word form:

In [16]:

def add(occs, occ):
    return occs + occ

In [17]:

%%time

occs = pairs.reduceByKey(add)

CPU times: user 7.24 ms, sys: 2.01 ms, total: 9.24 ms
Wall time: 25.2 ms

occs should be the nodes that all have בְּ as their g_word_utf8 value.

In [18]:

%%time

bOccs = occs.first()
nodes = bOccs[1]
word = bOccs[0]
word

CPU times: user 5.2 ms, sys: 871 µs, total: 6.07 ms
Wall time: 7.23 s

Out[18]:

'בְּ'

In [19]:

len(nodes)

Out[19]:

In [20]:

nodes[0:10]

Out[20]:

(1, 84, 500, 540, 542, 735, 737, 804, 820, 852)

In [21]:

nodes[-10:]

Out[21]:

This is the very, very beginning, but here you can see how you can get at the feature data in a completely different way.

As a check, we do it in Text-Fabric:

In [22]:

%%time

from tf.app import use

A = use('bhsa', hoist=globals(), silent='deep')

TF-app: ~/text-fabric-data/annotation/app-bhsa/code

data: ~/text-fabric-data/etcbc/bhsa/tf/c

data: ~/text-fabric-data/etcbc/phono/tf/c

data: ~/text-fabric-data/etcbc/parallels/tf/c

CPU times: user 5.12 s, sys: 652 ms, total: 5.77 s
Wall time: 6.43 s

In [23]:

word = F.g_word_utf8.v(1)
word

Out[23]:

'בְּ'

In [24]:

%%time

nodes = [n for n in F.otype.s('word') if F.g_word_utf8.v(n) == word]

CPU times: user 140 ms, sys: 1.75 ms, total: 142 ms
Wall time: 141 ms

In [25]:

len(nodes)

Out[25]:

In [26]:

nodes[0:10]

Out[26]:

[1, 84, 500, 540, 542, 735, 737, 804, 820, 852]

In [27]:

nodes[-10:]

Out[27]:

Performance¶

In Spark, loading the system takes more than 3 seconds, although the processor is not very busy during that time.

Later, the cell that does occs.first() takes 7 seconds. But after that, the occs are cached.

In Text-Fabric, loading the features takes slightly less than 7 seconds, although we load many more features than just g_word_utf8! After that, all features are cached.

But, although in this case Text-Fabric wins, it might very well be that if you really start crunching numbers, Spark outperforms Text-Fabric in a devastating way.

We'll see.

In [ ]: