PyMADlib (https://github.com/gopivotal/pymadlib) is a Python wrapper to MADlib - the popular library of parallel in-database algorithms. It currently has wrappers for 5 of MADlib's algorithms including:
Refer MADlib User Docs for MADlib's user documentation.
We can employ it to push the heavy number crunching to MADlib, while allowing us to work with awesomeness of Python in the front end.
from pymadlib.pymadlib import *
from pymadlib import example
from pymadlib.example import *
conn_str = '''host='{hostname}' port ='{port}' dbname='{database}' user='{username}' password='{password}' '''
#conn_str = conn_str.format(hostname='localhost',database='pivotal_test',port='5433',username='gpadmin',password='gpadmin')
conn_str = conn_str.format(hostname='192.168.241.161',database='pivotal_test',port='5433',username='gpadmin',password='gpadmin')
#PyMADlib is compatible with only MADlib v0.5, so we need to explicitly specify the MADlib schema.
#We have installed both MADlib 1.3 and MADlib 0.5 in this VM.
conn = DBConnect(conn_str=conn_str,madlib_schema='madlib_v05')
#View Documentation
mdl = LinearRegression(conn)
print(mdl.train.__doc__)
Given train a linear regression model on the specified table for the given set of independent and dependent variables Inputs : ======== table_name : (String) input table name indep : (list of strings) the independent variables to be used to build the model on dep : (string) the class label Output : ======== The Model coefficients, r2, p_values and t_stats The function also returns the model object.
#Train Model and Score
lreg = LinearRegression(conn)
mdl_dict, mdl_params = lreg.train('public.wine_training_set',['1','alcohol','proline','hue','color_intensity','flavanoids'],'quality')
#Show model params
mdl_params
#Now do prediction
predictions = lreg.predict('public.wine_test_set','quality')
#Show prediction results
predictions.head()
id | alcohol | mmalic_acid | ash | alcalinity_of_ash | magnesium | total_phenols | flavanoids | nonflavanoid_phenols | proanthocyanins | color_intensity | hue | od280 | proline | quality | prediction | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 32 | 2 | 12.08 | 1.13 | 2.51 | 24.0 | 78 | 2.00 | 1.58 | 0.40 | 1.40 | 2.20 | 1.31 | 2.72 | 630 | 550.955896 |
1 | 46 | 2 | 13.03 | 0.90 | 1.71 | 16.0 | 86 | 1.95 | 2.03 | 0.24 | 1.46 | 4.60 | 1.19 | 2.48 | 392 | 711.149598 |
2 | 40 | 3 | 13.62 | 4.95 | 2.35 | 20.0 | 92 | 2.00 | 0.80 | 0.47 | 1.02 | 4.40 | 0.91 | 2.05 | 550 | 386.660786 |
3 | 2 | 2 | 12.17 | 1.45 | 2.53 | 19.0 | 104 | 1.89 | 1.75 | 0.45 | 1.03 | 2.95 | 1.45 | 2.23 | 355 | 646.727206 |
4 | 34 | 3 | 13.73 | 4.36 | 2.26 | 22.5 | 88 | 1.28 | 0.47 | 0.52 | 1.15 | 6.62 | 0.78 | 1.75 | 520 | 539.247126 |
5 rows × 16 columns
#Show Scatter Matrix of Actual Vs Predicted
from pandas.tools.plotting import scatter_matrix
import matplotlib.pyplot as plt
smat = scatter_matrix(predictions.get(['quality','prediction']), diagonal='kde')
PyMADlib has utility function to perform Pivoting of categorical variables when used in the context of algorithms which require a nominal value.
#Train Linear Regression Model on a mixture of Numeric and Categorical Variables
mdl_dict, mdl_params = lreg.train('public.auto_mpg_train',['1','height','width','length','highway_mpg','engine_size','make','fuel_type','fuel_system'],'price')
predictions = lreg.predict('public.auto_mpg_test','price')
#Show sample predictions
predictions.head()
id | height | width | length | highway_mpg | engine_size | make_val_0 | make_val_1 | make_val_2 | make_val_3 | make_val_4 | make_val_5 | make_val_6 | make_val_7 | make_val_8 | make_val_9 | make_val_10 | make_val_11 | make_val_12 | make_val_13 | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 31 | 54.3 | 64.8 | 176.8 | 28 | 164 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... |
1 | 25 | 50.8 | 63.9 | 144.6 | 54 | 92 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... |
2 | 19 | 53.5 | 63.8 | 170.2 | 37 | 97 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... |
3 | 13 | 50.2 | 68.3 | 168.9 | 27 | 151 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... |
4 | 37 | 54.3 | 64.8 | 176.8 | 28 | 164 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... |
5 rows × 28 columns
#Display Scatter Plot of Actual Vs Predicted Values
smat = scatter_matrix(predictions.get(['price','prediction']), diagonal='kde')
#1) Logistic Regression with Numeric Variables Alone
log_reg = LogisticRegression(conn)
#Train Model
mdl_dict, mdl_params = log_reg.train('public.wine_bool_training_set','indep','quality_label')
#Show Model Parameters
mdl_params.head()
coef | log_likelihood | std_err | z_stats | p_values | odds_ratios | condition_no | num_iterations | |
---|---|---|---|---|---|---|---|---|
0 | [3.67131782851, -2.3556530177, -0.36640132682,... | -59.664051 | [2.62370220266, 0.645203905536, 0.583811938681... | [1.3992890751, -3.65102101443, -0.627601634265... | [0.161726314396, 0.000261199864726, 0.53026493... | [39.3036672396, 0.0948315596232, 0.69322453577... | 6198.907232 | 6 |
1 rows × 8 columns
#2) Logistic Regression Prediction
predictions = log_reg.predict('wine_bool_test_set','',None)
predictions.head()
#Display ROC Curve
actual = predictions.get('quality_label')
predicted = predictions.get('prediction')
ROCPlot('ROC curve Logistic Reg. on Continuous Features ',['Logistic Regression'],actual,predicted)
#Demonstrate K-Means
example.kmeansDemo(conn)