Simple Examples of using PL/Python in a SQL Database¶

Written by Ian Huston.

These notes are a short introduction to using the procedural language PL/Python and how this can help speed up calculations by leveraging the power of a massively parallel processing database like the Pivotal Greenplum Database.

Requirements: PostgreSQL or Greenplum DB, PL/Python + NumPy installed on all nodes.

For this notebook you will also need ipython-sql by Catherine Devlin

In [5]:

#Load the ipython-sql magic command
%load_ext sql

The sql extension is already loaded. To reload it, use:
  %reload_ext sql

First we need to connect to the Greenplum database

In [6]:

# Normally use the following:
# %sql postgresql://user:passwd@server/db

In [7]:

# Use a workaround to not display password!
with open("./.config") as conn_config:
    conn_info = conn_config.readline().strip()
ip = get_ipython()
ip.magic("%sql " + conn_info)

Out[7]:

u'Connected: ihuston@hustondb'

We will do our work in a separate schema (container inside a database) to keep things tidy.

In [8]:

%%sql 
-- Some quick tests of the capability of PL/Python on 
-- PostgreSQL and Greenplum DB.
-- Create a schema to do some work in

DROP SCHEMA plp CASCADE;
CREATE SCHEMA plp;

Done.
Done.

Out[8]:

[]

First demonstrate what a User Defined Function looks like in SQL.

In [9]:

%%sql 
-- Simple SQL User Defined Function to get started

CREATE FUNCTION plp.times2(INT)
RETURNS INT
AS $$
SELECT 2 * $1;
$$ LANGUAGE sql;

Done.

Out[9]:

[]

In [10]:

%%sql 
--Try it out

SELECT plp.times2(10);

1 rows affected.

Out[10]:

times2
20

In [11]:

%%sql
-- Test using Python code

CREATE FUNCTION plp.pymax (a integer, b integer)
RETURNS integer
AS $$
if a > b:
    return a
return b
$$ LANGUAGE plpythonu;

Done.

Out[11]:

[]

In [12]:

%%sql
--Test Python code

SELECT plp.pymax(10, 5);

1 rows affected.

Out[12]:

pymax
10

In [13]:

%%sql
-- Create a composite return type

CREATE TYPE plp.named_value AS (
  name  text,
  value  integer
);

Done.

Out[13]:

[]

In [14]:

%%sql
--Simple function which returns a composite object

CREATE OR REPLACE FUNCTION plp.make_pair (name text, value integer)
RETURNS named_value
AS $$
return [ name, value ]
  # or alternatively, as tuple: return ( name, value )
  # or as dict: return { "name": name, "value": value }
  # or as an object with attributes .name and .value
$$ LANGUAGE plpythonu;

Done.

Out[14]:

[]

In [15]:

%%sql
--Try out the function
-- (See https://en.wikipedia.org/wiki/Zozimus)

SELECT plp.make_pair('Zozimus', 1);

1 rows affected.

Out[15]:

make_pair
(Zozimus,1)

Note that UDFs and PL/X functions allow for multiple function signatures.

In [16]:

%%sql
--Using NumPy inside a PL/Python function

CREATE OR REPLACE FUNCTION plp.make_pair (name text)
RETURNS named_value
AS $$
import numpy as np
a = np.arange(100)
return [name, a[2]]
$$ LANGUAGE plpythonu;

Done.

Out[16]:

[]

In [17]:

%%sql
--Try it out

SELECT plp.make_pair('Horatio');

1 rows affected.

Out[17]:

make_pair
(Horatio,2)

Note that the SQL calling syntax affects how the results are returned.

In SQL this is the difference between returning composite types and splitting out results into each subtype (name and value here).

In [18]:

%%sql

SELECT * FROM plp.make_pair('Horatio');

1 rows affected.

Out[18]:

name	value
Horatio	2

In [19]:

%%sql
--Returning a set of results using SETOF

CREATE OR REPLACE FUNCTION make_pair_sets (name text)
RETURNS SETOF named_value
AS $$
import numpy as np
return ((name, i) for i in np.arange(3))
$$ LANGUAGE plpythonu;

Done.

Out[19]:

[]

In [20]:

%%sql
--Try it out

SELECT * FROM make_pair_sets('Gerald');

3 rows affected.

Out[20]:

name	value
Gerald	0
Gerald	1
Gerald	2

Parallelisation¶

Now we will try to use parallelisation with PL/Python

In [21]:

%%sql
--Set up some data to show parallelisation

DROP TABLE IF EXISTS plp.test_data;

CREATE TABLE plp.test_data AS
SELECT 
    'a'::text AS name
    , generate_series(0,1000000)::float AS x
    , generate_series(0,1000000)/100.0 AS y
DISTRIBUTED BY (name);

INSERT INTO plp.test_data 
SELECT 
    'b'::text AS name
    , generate_series(0,1000000)::float AS x
    , sin(generate_series(0,1000000)/100.0) AS y;

INSERT INTO plp.test_data 
SELECT 
    'c'::text AS name
    , generate_series(0,1000000)::float AS x
    , 100.0 + sin(generate_series(0,1000000)/100.0) AS y;

Done.
1000001 rows affected.
1000001 rows affected.
1000001 rows affected.

Out[21]:

[]

In [22]:

%%sql
-- Create a function to find the mean of some numbers

DROP FUNCTION IF EXISTS plp.np_mean(double precision[]);

CREATE OR REPLACE FUNCTION plp.np_mean(value_array double precision[])
RETURNS float
AS $$
import numpy as np
return np.mean(value_array)
$$ LANGUAGE plpythonu;

Done.
Done.

Out[22]:

[]

In [23]:

%%sql
-- Need to pass the numbers as an array using array_agg

SELECT plp.np_mean(array_agg(y)) FROM plp.test_data;

1 rows affected.

Out[23]:

np_mean
1700.00013004

In [24]:

%%sql
-- Now try to do this for each type of data in parallel by grouping

SELECT 
    name, 
    plp.np_mean(array_agg(y)) 
FROM plp.test_data 
GROUP BY name ORDER BY name;

3 rows affected.

Out[24]:

name	np_mean
a	5000.0
b	0.000195060907772
c	100.000195061

In [25]:

%%sql
-- Now try do something even more interesting

DROP FUNCTION IF EXISTS plp.linregr(double precision[]);

CREATE OR REPLACE FUNCTION 
    plp.linregr(x double precision[], y double precision[])
RETURNS float[]
AS $$
from scipy import stats
return stats.linregress(x, y)
$$ LANGUAGE plpythonu;

Done.
Done.

Out[25]:

[]

In [26]:

%%sql
-- Do linear regression for all data

SELECT plp.linregr(array_agg(x), array_agg(y)) 
FROM plp.test_data;

1 rows affected.

Out[26]:

linregr
[0.00333333331357, 33.3334732575, 0.335532910587, 0.0, 5.4031491216e-06]

In [27]:

%%sql
-- Now do it separately for each 'name'

SELECT name, plp.linregr(array_agg(x), array_agg(y)) 
FROM plp.test_data 
GROUP BY name ORDER BY name;

3 rows affected.

Out[27]:

name	linregr
a	[0.0100000000001, -6.82430254528e-08, 1.0, 0.0, 0.0]
b	[-2.96599680404e-11, 0.000209890891792, -1.210882548e-05, 0.990338798439, 2.44945167615e-09]
c	[-2.96599680394e-11, 100.000209891, -1.21088254796e-05, 0.99033879844, 2.44945167615e-09]

In this example we have shown how to run models separately for different data using the GROUP BY clause. It is important to have distributed your data in the correct way to utilise the parallel architecture.

For further information see these notes on using PL/R in addition to PL/Python.