Connecting to remote spark through DSX-HI

In [1]:
%load_ext sparkmagic.magics
from dsx_core_utils import proxy_util,dsxhi_util
success configuring sparkmagic livy.
In [2]:
['', '', '', '', '', '', '', '', '', '', '', '', '', '', '']

Pushing the python virtual environment to cluster using DSX-HI

In [3]:
!cat /user-home/_global_/.remote-images/dsx-hi/dsx-scripted-ml-python2.json
{ "imageId": "968c2101554e0d1e0d4fdd3720aaa565a2910cb46f4d7ed61188b6ceeec22930",
  "scriptCommand": "anaconda2/bin/python2.7",
  "libPaths": ["usr/local/spark-2.0.2-bin-hadoop2.7/python","user-home/.scripts/common-helpers/batch/pmml","user-home/.scripts/common-helpers/saas","user-home/_global_/python-2.7"] }

Create Session Properties

Using values from dsx-scripted-ml-python2.json, we'll need to:

  • (1) Pull the archive from HDFS to the Yarn Distributed cache using spark conf --archives
  • (2) Override the default PYSPARK_PYTHON, from the relative path scriptCommand

Example DSX_HI Properties for using dsx-scripted-ml-python2.tar.gz Virtual Environment:

{"proxyUser": "user1", "archives": ["/user/dsxhi/environments/26611bf7fe595f786139d6d2132de070fc813f6a0ef7a4e25857b79c8cd4b565/dsx-scripted-ml-python2.tar.gz"],"conf":{"spark.yarn.appMasterEnv.PYSPARK_PYTHON":"dsx-scripted-ml-python2.tar.gz/anaconda2/bin/python"}}

Files currently on HDFS:

In [4]:
Added endpoint
Starting Spark application
IDYARN Application IDKindStateSpark UIDriver logCurrent session?
SparkSession available as 'spark'.

Reading the dataset from HDFS

In [5]:
import pandas as pd
import numpy as np

# Reading the data from hdfs
data ="delimiter",",").option("header","false").csv("hdfs:///user/user1/SMSSpamCollection.csv")
dataset = data.toPandas()
dataset = dataset.iloc[:,:2]

message = dataset['_c1']

Extracting the Bag of Words features (Text to Vector)

In [6]:
# Creating the Bag of Words model
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(message).toarray()
y = dataset.iloc[:, 0].values

Perform Machine learning

In [7]:
# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

#Fitting Logistic Regression to the training set
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state=0),y_train)

#Predicting the Test Results
y_pred = classifier.predict(X_test)

#Printing the model accuracy
from sklearn.metrics import accuracy_score
print('Accuracy: %.2f%%' % (accuracy_score(y_test, y_pred) * 100))

#Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test,y_pred)

#Printing the confusion matrix in a table

cm_df = pd.DataFrame({'y_test':x,'y_pred':y,'count':z})
cm_df = cm_df[['y_test','y_pred','count']]
print cm_df.to_string(index=False)
Accuracy: 93.00%
y_test  y_pred  count
     1       1     94
     0       1      4
     1       0     10
     0       0     92
/hadoop/yarn/local/usercache/user1/appcache/application_1533478912530_0773/container_e32_1533478912530_0773_01_000001/dsx-scripted-ml-python2.tar.gz/anaconda2/lib/python2.7/site-packages/sklearn/ DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)