sparklyr
¶This notebook stems from this one where we realized there's no method to unnest
columns in sparklyr
!.
Fortunately here comes PySpark to help us.
The following commands are 'forked' from this great tutorial: Sentiment analysis with Spark ML. Material for Machine Learning Workshop Galicia 2016.
We import our data as a Spark dataframe:
type(sqlContext)
pyspark.sql.context.HiveContext
bin_reviews = sqlContext.read.json('amazon/bin_reviews.json')
bin_reviews.printSchema()
root |-- asin: string (nullable = true) |-- helpful: array (nullable = true) | |-- element: long (containsNull = true) |-- label: double (nullable = true) |-- overall: double (nullable = true) |-- reviewText: string (nullable = true) |-- reviewTime: string (nullable = true) |-- reviewerID: string (nullable = true) |-- reviewerName: string (nullable = true) |-- summary: string (nullable = true) |-- unixReviewTime: long (nullable = true)
select_reviews = bin_reviews.select('reviewText', 'overall', 'label')
select_reviews.show(2)
+--------------------+-------+-----+ | reviewText|overall|label| +--------------------+-------+-----+ |Spiritually and m...| 5.0| 1.0| |This is one my mu...| 5.0| 1.0| +--------------------+-------+-----+ only showing top 2 rows
from pyspark.ml.feature import Tokenizer
tokenizer = Tokenizer(inputCol="reviewText", outputCol="words")
tokenized_reviews = tokenizer.transform(select_reviews)
tokenized_reviews.show(2)
+--------------------+-------+-----+--------------------+ | reviewText|overall|label| words| +--------------------+-------+-----+--------------------+ |Spiritually and m...| 5.0| 1.0|[spiritually, and...| |This is one my mu...| 5.0| 1.0|[this, is, one, m...| +--------------------+-------+-----+--------------------+ only showing top 2 rows
from pyspark.ml.feature import StopWordsRemover
remover = StopWordsRemover(inputCol=tokenizer.getOutputCol(), outputCol="filtered")
removed_reviews = remover.transform(tokenized_reviews)
removed_reviews.show(2)
sample_review = removed_reviews.first()
print sample_review['words'][:10]
print sample_review['filtered'][:10]
+--------------------+-------+-----+--------------------+--------------------+ | reviewText|overall|label| words| filtered| +--------------------+-------+-----+--------------------+--------------------+ |Spiritually and m...| 5.0| 1.0|[spiritually, and...|[spiritually, men...| |This is one my mu...| 5.0| 1.0|[this, is, one, m...|[books., masterpi...| +--------------------+-------+-----+--------------------+--------------------+ only showing top 2 rows [u'spiritually', u'and', u'mentally', u'inspiring!', u'a', u'book', u'that', u'allows', u'you', u'to'] [u'spiritually', u'mentally', u'inspiring!', u'book', u'allows', u'question', u'morals', u'help', u'discover', u'really']
from pyspark.sql.functions import split, explode
unnested_reviews = removed_reviews.select('overall', 'label', explode("filtered").alias("word"))
unnested_reviews.show(5)
+-------+-----+-----------+ |overall|label| word| +-------+-----+-----------+ | 5.0| 1.0|spiritually| | 5.0| 1.0| mentally| | 5.0| 1.0| inspiring!| | 5.0| 1.0| book| | 5.0| 1.0| allows| +-------+-----+-----------+ only showing top 5 rows
We save our dataframe for further use in our small sparklyr
pipeline.
It will take a good load of time to save, so be patient!
# unnested_reviews.write.json('unnested_reviews_json')
unnested_reviews.write.save('amazon/unnested_reviews_json', format='json', mode='overwrite')
Return to the sparklyr notebook to follow the pipeline!.