pip3 install pyspark
to get the Spark and the Python bindings~/.zshrc
file:PYSPARK_PYTHON="python3"
export PYSPARK_PYTHON
Start a new terminal and run pyspark
.
Now it is time to start reading in the quick-start-guide.
We try to get a meaningful thing done with the words of the BHSA, here, in this notebook.
We explode the g_word_utf8
feature:
%%time
from tf.convert.tf import explode
explode('~/github/etcbc/bhsa/tf/c/g_word_utf8.tf', 'explode/out')
CPU times: user 1.27 s, sys: 42.6 ms, total: 1.32 s Wall time: 1.32 s
True
!head -n 15 explode/out/g_word_utf8.tf
1 בְּ 2 רֵאשִׁ֖ית 3 בָּרָ֣א 4 אֱלֹהִ֑ים 5 אֵ֥ת 6 הַ 7 שָּׁמַ֖יִם 8 וְ 9 אֵ֥ת 10 הָ 11 אָֽרֶץ 12 וְ 13 הָ 14 אָ֗רֶץ 15 הָיְתָ֥ה
!tail -n 15 explode/out/g_word_utf8.tf
426570 בִּ 426571 ירוּשָׁלִַ֖ם 426572 אֲשֶׁ֣ר 426573 בִּֽ 426574 יהוּדָ֑ה 426575 מִֽי 426576 בָכֶ֣ם 426577 מִ 426578 כָּל 426579 עַמֹּ֗ו 426580 יְהוָ֧ה 426581 אֱלֹהָ֛יו 426582 עִמֹּ֖ו 426583 וְ 426584 יָֽעַל
Brilliant.
Spark just works in the notebook!
from pyspark import SparkConf, SparkContext
%%time
conf = SparkConf().setAppName("bhsa").setMaster("local")
sc = SparkContext(conf=conf)
CPU times: user 15.3 ms, sys: 12 ms, total: 27.3 ms Wall time: 3.18 s
lines = sc.textFile("explode/out/g_word_utf8.tf")
lines.count()
426584
lines.first()
'1\tבְּ'
pairs = lines.map(lambda s: tuple(reversed(s.split("\t"))))
pairs.first()
('בְּ', '1')
Due to the hebrew you do not see that '1' is the second element:
pairs.first()[0]
'בְּ'
We want the 1
as integer, or rather, as a tuple of one integer (becomes clear later).
def makePair(s):
(node, value) = s.split("\t")
return (value, (int(node),))
%%time
pairs = lines.map(makePair)
CPU times: user 604 µs, sys: 506 µs, total: 1.11 ms Wall time: 784 µs
%%time
firstPair = pairs.first()
print(firstPair[0])
print(firstPair[1])
בְּ (1,) CPU times: user 5.41 ms, sys: 2.03 ms, total: 7.43 ms Wall time: 51.6 ms
pairs.take(10)
[('בְּ', (1,)), ('רֵאשִׁ֖ית', (2,)), ('בָּרָ֣א', (3,)), ('אֱלֹהִ֑ים', (4,)), ('אֵ֥ת', (5,)), ('הַ', (6,)), ('שָּׁמַ֖יִם', (7,)), ('וְ', (8,)), ('אֵ֥ת', (9,)), ('הָ', (10,))]
Now we get the occurrences of each distinct word form:
def add(occs, occ):
return occs + occ
%%time
occs = pairs.reduceByKey(add)
CPU times: user 7.24 ms, sys: 2.01 ms, total: 9.24 ms Wall time: 25.2 ms
occs
should be the nodes that all have בְּ
as their g_word_utf8
value.
%%time
bOccs = occs.first()
nodes = bOccs[1]
word = bOccs[0]
word
CPU times: user 5.2 ms, sys: 871 µs, total: 6.07 ms Wall time: 7.23 s
'בְּ'
len(nodes)
6423
nodes[0:10]
(1, 84, 500, 540, 542, 735, 737, 804, 820, 852)
nodes[-10:]
(426282, 426354, 426370, 426385, 426403, 426419, 426495, 426525, 426538, 426543)
This is the very, very beginning, but here you can see how you can get at the feature data in a completely different way.
As a check, we do it in Text-Fabric:
%%time
from tf.app import use
A = use('bhsa', hoist=globals(), silent='deep')
CPU times: user 5.12 s, sys: 652 ms, total: 5.77 s Wall time: 6.43 s
word = F.g_word_utf8.v(1)
word
'בְּ'
%%time
nodes = [n for n in F.otype.s('word') if F.g_word_utf8.v(n) == word]
CPU times: user 140 ms, sys: 1.75 ms, total: 142 ms Wall time: 141 ms
len(nodes)
6423
nodes[0:10]
[1, 84, 500, 540, 542, 735, 737, 804, 820, 852]
nodes[-10:]
[426282, 426354, 426370, 426385, 426403, 426419, 426495, 426525, 426538, 426543]
In Spark, loading the system takes more than 3 seconds, although the processor is not very busy during that time.
Later, the cell that does occs.first()
takes 7 seconds.
But after that, the occs
are cached.
In Text-Fabric, loading the features takes slightly less than 7 seconds,
although we load many more features than just g_word_utf8
!
After that, all features are cached.
But, although in this case Text-Fabric wins, it might very well be that if you really start crunching numbers, Spark outperforms Text-Fabric in a devastating way.
We'll see.