PySAL/R Comparison¶

NOTE: work in progress

The PySAL team

This document presents a comparison between the Python library PySAL and different packages of the R statistical environment relating to spatial econometric functionality.

The following functionality is tested:

Weights creation
OLS
K&P Heteroskedasticity consistent spatial error model
K&P spatial lag model

Distribution¶

This is an IPython notebook hosted on github as a gist. You can see an online rendered version on this link or downloaded on this link. Feel free to fork it and have a go at playing with it. Any change proposal can be issued through a PR following standard Github procedures.

Dependencies¶

In [1]:

%pylab inline

Populating the interactive namespace from numpy and matplotlib

In [2]:

import time, seaborn
import pandas as pd
import numpy as np
import pysal as ps
import matplotlib.pyplot as plt
from scipy.sparse import eye
from scipy.sparse.linalg import inv as spinv
from scipy.linalg import inv
from pysal.spreg.utils import spdot

In [3]:

%load_ext rpy2.ipython

In [4]:

%%R
library(spdep)
library(sphet)
print(sessionInfo())

Loading required package: sp
Loading required package: Matrix
R version 3.0.2 (2013-09-25)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] tools     stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] sphet_1.5    spdep_0.5-77 Matrix_1.1-2 sp_1.0-13   

loaded via a namespace (and not attached):
 [1] boot_1.3-9      coda_0.16-1     deldir_0.0-22   grid_3.0.2     
 [5] lattice_0.20-24 LearnBayes_2.12 MASS_7.3-29     nlme_3.1-113   
 [9] parallel_3.0.2  splines_3.0.2

In [5]:

%%R
# Utilities to run tests on the R sides
buildW <- function(n){
    t0 = Sys.time()
    w = cell2nb(sqrt(n), sqrt(n))
    we = nb2listw(w)
    t1 = Sys.time()
    out = list('we'=we, 'tim'=t1-t0)
    return(out)
}

runOLS <- function(y, x){
    db <- cbind(data.frame(y=y), data.frame(x))
    t0 = Sys.time()
    m <- lm('y ~ V1 + V2 + V3', db)
    t1 = Sys.time()
    return(t1-t0)
}

runGMERR <- function(y, x, w){
    db <- cbind(data.frame(y=y), data.frame(x))
    t0 = Sys.time()
    m <- spreg(y ~ V1 + V2 + V3, db, listw=w)
    t1 = Sys.time()
    return(t1-t0)
}

runMLERR <- function(y, x, w){
    db <- cbind(data.frame(y=y), data.frame(x))
    t0 = Sys.time()
    m <- errorsarlm(y ~ V1 + V2 + V3, db, listw=w)
    t1 = Sys.time()
    return(t1-t0)
}

runGMLAG <- function(y, x, w){
    db <- cbind(data.frame(y=y), data.frame(x))
    t0 = Sys.time()
    m <- spreg(y ~ V1 + V2 + V3, db, listw=w, model = "lag")
    t1 = Sys.time()
    return(t1-t0)
}

runMLLAG <- function(y, x, w){
    db <- cbind(data.frame(y=y), data.frame(x))
    t0 = Sys.time()
    m <- lagsarlm(y ~ V1 + V2 + V3, db, listw=w)
    t1 = Sys.time()
    return(t1-t0)
}

In [6]:

np.random.seed(1234)

In [7]:

def sim(sizes, rho, lamb, ml=False):
    timings = pd.DataFrame()
    %R timingsR = data.frame()
    for n in sizes:
            ### Data
        print "Working on Size %i"%n
        timing = {}
        # Python
        t0 = time.time()
        w = ps.lat2W(int(np.sqrt(n)), int(np.sqrt(n)))
        t1 = time.time()
        w.transform = 'R'
        timing['W'] = t1 - t0
        x = np.random.random((n, 3))
        X = np.hstack((np.ones((n, 1)), x))
        eps = np.random.normal(size=(n, 1))
        xb =np.dot(X, betas) 
        spxb_r = ps.spreg.utils.power_expansion(w, xb, rho)
        speps_r = ps.spreg.utils.power_expansion(w, eps, rho)
        speps_l = ps.spreg.utils.power_expansion(w, eps, lamb)
        y = xb + eps
        y_lag = spxb_r + speps_r
        y_err = xb + speps_l
        # R
        %Rpush n x y y_lag y_err
        %R x = as.data.frame(x)
        %R wtw = buildW(n)
        %R res = data.frame(W=wtw$tim)
        %R w = wtw$we
            ### OLS
        t0 = time.time()
        m = ps.spreg.OLS(y, x)
        t1 = time.time()
        timing['OLS'] = t1 - t0
        %R t = runOLS(y, x)
        %R res = cbind(res, data.frame(OLS=t))
            ### Spatial Error
        t0 = time.time()
        m = ps.spreg.GM_Error_Het(y_err, x, w=w)
        t1 = time.time()
        timing['Err_gm'] = t1 - t0
        %R t = runGMERR(y_err, x, w)
        %R res = cbind(res, data.frame(Err_gm=t))
        if ml:
            t0 = time.time()    
            m = ps.spreg.ML_Error(y, x, w=w)
            t1 = time.time()
            timing['Err_ml'] = t1 - t0
            %R t = runMLERR(y_err, x, w)
            %R res = cbind(res, data.frame(Err_ml=t))
        # Spatial lag
        t0 = time.time()
        m = ps.spreg.GM_Lag(y_lag, x, w=w)
        t1 = time.time()
        timing['Lag_gm'] = t1 - t0
        %R t = runGMLAG(y_lag, x, w)
        %R res = cbind(res, data.frame(Lag_gm=t))
        if ml:
            t0 = time.time()
            m = ps.spreg.ML_Lag(y_lag, x, w=w)
            t1 = time.time()
            timing['Lag_ml'] = t1 - t0
            %R t = runMLLAG(y_lag, x, w)
            %R res = cbind(res, data.frame(Lag_ml=t))
        timings['n%i'%n] = pd.Series(timing)
        %R row.names(res) = paste('n', n, sep='')
        %R timingsR = rbind(timingsR, res)
    timingsPY = timings.T
    timingsPY.index = pd.MultiIndex.from_arrays([['Py']*timingsPY.shape[0], \
                                               timingsPY.index])
    %Rpull timingsR
    ri = %R rownames(timingsR)
    timingsR.index = pd.MultiIndex.from_arrays([['R']*ri.shape[0], ri])
    timings = timingsPY.T.join(timingsR.T).T
    return timings

Data¶

The basic model we are going to test against is a linear regression with four explanatory variables to which, depending on the model, we will be appending terms:

$$ y = (\rho W y) + \alpha + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 + \beta_4 X_4 + (\lambda W u) + \epsilon $$

We will record the running time of these models for the following number of observations:

In [8]:

sizes = [100, 625, 900, 3025, 4900, 10000, 25600, 50625, 99225]
betas = np.array([[1, 1, 1, 1]]).T
rho = 0.5
lamb = 0.5

Running with maximum likelihood included¶

In [9]:

%%time
ssizes = sizes[:5]
stimings = sim(ssizes, rho, lamb, ml=True)

/home/dani/code/pysal_darribas/pysal/spreg/user_output.py:349: FutureWarning: comparison to `None` will result in an elementwise object comparison in the future.
  if i == None:
/home/dani/anaconda/envs/pydata/lib/python2.7/site-packages/scipy/optimize/_minimize.py:570: RuntimeWarning: Method 'bounded' does not support relative tolerance in x; defaulting to absolute tolerance.
  "defaulting to absolute tolerance.", RuntimeWarning)

need more than 0 values to unpack
Working on Size 100
Working on Size 625
Working on Size 900
Working on Size 3025
Working on Size 4900
CPU times: user 45min 17s, sys: 1min 46s, total: 47min 3s
Wall time: 32min 38s

/home/dani/code/pysal_darribas/pysal/spreg/twosls.py:139: FutureWarning: comparison to `None` will result in an elementwise object comparison in the future.
  if q == None and h == None:

In [10]:

stimings

Out[10]:

		Err_gm	Err_ml	Lag_gm	Lag_ml	OLS	W
Py	n100	0.022200	0.015826	0.003856	0.018613	0.002736	0.000991
	n625	0.025810	0.594324	0.003643	0.635988	0.002734	0.063488
	n900	0.027929	1.651768	0.003526	1.440586	0.002547	0.003898
	n3025	0.040959	54.081272	0.004570	43.543330	0.003509	0.014238
	n4900	0.054737	1309.297623	0.007065	198.539012	0.004447	0.026224
R	n100	0.516554	0.312385	0.006613	0.228411	0.008700	0.099601
	n625	1.223979	1.473421	0.008109	0.891190	0.003769	0.755120
	n900	1.265054	1.841072	0.008744	1.767377	0.003879	0.812916
	n3025	3.664639	32.921955	0.014100	32.429866	0.006280	2.649967
	n4900	5.406586	2.101724	0.020745	2.145562	0.008604	4.350219

In [11]:

for col in stimings:
    f, ax = plt.subplots(1)
    stimings[col].groupby(level=0).plot(ax=ax)
    plt.xlabel('N. Observations')
    plt.ylabel('Seconds')
    plt.legend()
    plt.title(col)

Running without maximum likelihood¶

In [12]:

%%time
timings = sim(sizes, rho, lamb)

need more than 0 values to unpack
Working on Size 100
Working on Size 625
Working on Size 900
Working on Size 3025
Working on Size 4900
Working on Size 10000
Working on Size 25600
Error: cannot allocate vector of size 4.9 Gb
In addition: There were 50 or more warnings (use warnings() to see the first 50)
Error: cannot allocate vector of size 4.9 Gb
Working on Size 50625
Error: cannot allocate vector of size 19.1 Gb
In addition: Warning messages:
1: In if (!(model %in% c("ols", "ols.end"))) { :
  the condition has length > 1 and only the first element will be used
2: In if (model %in% c("sarar", "lag", "ivhac")) { :
  the condition has length > 1 and only the first element will be used
3: In if (model == "error") { :
  the condition has length > 1 and only the first element will be used
4: In if (model == "ols.end") { :
  the condition has length > 1 and only the first element will be used
5: In if (model %in% c("sarar", "error")) { :
  the condition has length > 1 and only the first element will be used
Error: cannot allocate vector of size 19.1 Gb
Working on Size 99225
Error: cannot allocate vector of size 73.4 Gb
In addition: Warning messages:
1: In if (!(model %in% c("ols", "ols.end"))) { :
  the condition has length > 1 and only the first element will be used
2: In if (model %in% c("sarar", "lag", "ivhac")) { :
  the condition has length > 1 and only the first element will be used
3: In if (model == "error") { :
  the condition has length > 1 and only the first element will be used
4: In if (model == "ols.end") { :
  the condition has length > 1 and only the first element will be used
5: In if (model %in% c("sarar", "error")) { :
  the condition has length > 1 and only the first element will be used
Error: cannot allocate vector of size 73.4 Gb
CPU times: user 3min 45s, sys: 30 s, total: 4min 15s
Wall time: 22min 13s

In [13]:

timings

Out[13]:

		Err_gm	Lag_gm	OLS	W
Py	n100	0.020799	0.003664	0.002098	0.000670
	n625	0.026594	0.002983	0.002443	0.005336
	n900	0.027249	0.003719	0.002479	0.006735
	n3025	0.041921	0.005710	0.003510	0.029185
	n4900	0.054417	0.006932	0.004522	0.045991
	n10000	0.088830	0.205651	0.006430	0.084250
	n25600	0.194143	6.968898	0.136538	0.785413
	n50625	0.491060	0.061421	0.348023	49.088749
	n99225	0.687504	0.123573	0.056800	0.789296
R	n100	0.372393	0.006577	0.002660	0.087507
	n625	0.356954	0.007866	0.003204	0.546080
	n900	0.413155	0.008901	0.003610	0.772779
	n3025	1.610812	0.014108	0.006154	2.663630
	n4900	3.627312	0.019419	0.008348	4.076419
	n10000	20.597504	0.076775	0.015285	8.635177
	n25600	0.056522	2.104784	0.056522	22.532605
	n50625	0.161837	0.203868	0.161837	47.339219
	n99225	0.393457	0.421947	0.393457	1.564577

In [14]:

for col in timings:
    f, ax = plt.subplots(1)
    timings[col].groupby(level=0).plot(ax=ax)
    plt.xlabel('N. Observations')
    plt.ylabel('Seconds')
    plt.legend()
    plt.title(col)