Panel fixed effects: dummy and demeaning estimation

This notebook is a mental note about issues coming up when trying to estimate a panel with one and two-ways fixed effects. Any comments, suggestions or ideas are more than welcome.

You can access the notebook in two flavours:

  • Online (read-only): link
  • Raw file (as a gist): link

Last, the notebook relies on two functions I provide in a separate Python file in order not to clutter and make this text more readable. You get the additional file when you pull the gist, but you can also check it online here. All the code examples are written in Python and rely on the following dependencies that you will need to reproduce the code in your machine:

In [1]:
import panel_ols
import pandas as pd
print "Pandas: ", pd.__version__
import pysal as ps
print "PySAL: ", ps.version
import numpy as np
print "Numpy: ", np.__version__

seed = np.random.seed(1234)
Pandas:  0.12.0
PySAL:  1.7.0dev
Numpy:  1.7.1

Theoretical preliminars

One-way

The idea is that we have a panel with two dimensions (you can think of individuals $i$ over time $t$, but there can be other dimensions too). In formal notation, we begin with the following baseline model:

$$ y_{it} = \alpha + X_{it} \beta + \epsilon_{it} $$

Assuming the usual stuff about the good behaviour of $\epsilon_{it}$$, this can be estimated by OLS. If we want to control for time-invariant characteristics of individuals, we can include what is called a fixed-effect:

$$ y_{it} = X_{it} \beta + \mu_i + \epsilon_{it} $$

where $\mu_i$ absorbs every individual effect that does not vary over time. If we want to estimate this model, we can simply include a dummy of each $i$ and run OLS. The problem is that if the number of individuals is very large, one quickly runs out of degrees of freedom and increases multicollinearity. An equivalent way to work around this is suggested by Baltagi on p. 34 (of Third edition at least). We can demean $y_{it} and $X_{it}$ as follows:

$$ y_{it}* = y_{it} - \bar{y}_{i.} $$

where $\bar{y}_{i.}$ is the average of values for each individual. Running a straight OLS (without intercept) on the transformed variables results in the same $\beta$ estimates as we would by using dummy variables.

Two-way

If we also want to account for the unboserved individual-invariant heterogeneity, we can also include a time effect, converting the model in the so called two-way fixed-effects modes:

$$ y_{it} = X_{it} \beta + \mu_i + \nu_t + \epsilon_{it} $$

In theory, you could also estimate this using dummy variables, but the problems that arise in the one-way model are even more acute. Fortunately, Baltagi proposes another transformation that gets around this:

$$ y_{it}* = y_{it} - \bar{y}_{i.} - \bar{y}_{.t} + \bar{y}_{..} $$

If you apply this transformation to both $y$ and $X$, you can run a straight OLS (without intercept again) and obtain the $\beta$ estimates of interest.

Ilustration in Python

Let's now get our hands dirty and check this alternative ways of especifying the models are indeed equivalent. First we need some synthetic data to run the experiments:

In [2]:
n = 1000
d1 = 100
d2 = 10
seed = np.random.seed(1234)
x = np.random.random((n, 3))
fe1 = np.random.random_integers(0, d1, n)
fe1_d = pd.get_dummies(fe1, 'fe1')
fe2 = np.random.random_integers(0, d2, n)
fe2_d = pd.get_dummies(fe2, 'fe2')

e = np.random.normal(0, 1, (n, 1))
y = np.dot(x, np.array([[1., 1., 1.]]).T) \
    + fe1_d.ix[:, :-1].values.sum(axis=1)[:, None] \
    + fe2_d.values.sum(axis=1)[:, None] \
    + e

One-way

We can estimate the model including the dummies for the first effect:

In [3]:
xone = np.hstack((x, fe1_d.values))
m1_dummy = ps.spreg.BaseOLS(y, xone)

Or use the transformation, built into the method one_way_fe:

In [4]:
m1_demean = panel_ols.one_way_fe(y, x, fe1)

We can now compare the estimates of $\beta$:

In [5]:
np.hstack((m1_dummy.betas[:3], m1_demean.betas))
Out[5]:
array([[ 1.05322727,  1.05322727],
       [ 0.94063773,  0.94063773],
       [ 1.08095942,  1.08095942]])

If you check, the difference between the two is negligible:

In [6]:
m1_dummy.betas[:3] - m1_demean.betas
Out[6]:
array([[ -4.15223411e-14],
       [  1.57651669e-14],
       [ -1.82076576e-14]])

Also, you can see how the multicollinearity issue starts to creep up although, in this case, it is not unacceptable:

In [7]:
ci = ps.spreg.diagnostics.condition_index

print "Dummy: %.4f | Demean: %.4f"%(ci(m1_dummy), ci(m1_demean))
Dummy: 7.0236 | Demean: 1.0713
/home/dani/code/pysal_darribas/pysal/spreg/diagnostics.py:619: ComplexWarning: Casting complex values to real discards the imaginary part
  ci_result = sqrt(max_eigval/min_eigval)

Two-way

Here there are three alternatives and, in theory, they all give the same estimates although, as we are going to see, differences are a bit larger.

We can estimate with all the dummies directly:

In [8]:
xtwo = np.hstack((xone, fe2_d.values[:, 1:]))
m2_dummies = ps.spreg.BaseOLS(y, xtwo)

We can estimate the fully transformed model:

In [9]:
m2_demean = panel_ols.two_way_fe(y, x, fe1, fe2)

Or you can have a middle way using dummies for the effect with less categories (fe2 in this case) and the demeaner for the other one:

In [10]:
m2_mix = panel_ols.one_way_fe(y, xone[:, :-1], fe2)

We can now check the coefficients of the three methods:

In [11]:
pd.DataFrame({'Dummies': m2_dummies.betas[:3].flatten(), \
              'Mix': m2_mix.betas[:3].flatten(), \
              'Demean': m2_demean.betas.flatten()})
Out[11]:
Demean Dummies Mix
0 1.051114 1.051047 1.051047
1 0.963858 0.964616 0.964616
2 1.072238 1.070982 1.070982

And their multicollinearity:

In [12]:
pd.Series({'Dummies': ci(m2_dummies), \
              'Mix': ci(m2_mix), \
              'Demean': ci(m2_demean)})
Out[12]:
Demean      1.073158
Dummies    11.582533
Mix        11.704122
dtype: float64

As you can see, estimates for Dummies and Mix are exactly the same (at least at the sixth decimal), while those for the Demean are slightly different. This is probably related to precission issues, but if you have any suggestion as to why this happens, please drop me a line, I'll be happy to learn!

Update

It turns out that the two-way fixed effects differ in the previous cells because the way I have randomly set up the indices for both dimensions. Demeaning a-la Baltagi only works in a true panel setting in which you only have one occurrence of a pair of indices $it$. When you indeed have this case, the numbers match.

In [15]:
n = 1000
d1 = 10
d2 = 100
seed = np.random.seed(1234)
x = np.random.random((n, 3))
# Balanced panel
d2 = int(n * 1. / d1)
fe1 = np.arange(d1)
fe2 = np.arange(d2)
fe = np.array([(i, j) for i in fe1 for j in fe2])
fe1 = fe[:, 0]
fe2 = fe[:, 1]
#
fe1_d = pd.get_dummies(fe1, 'fe1')
fe2_d = pd.get_dummies(fe2, 'fe2')
                                                         
e = np.random.normal(0, 1, (n, 1))
y = np.dot(x, np.array([[1., 1., 1.]]).T) \
        + fe1_d.ix[:, :-1].values.sum(axis=1)[:, None] \
        + e
        #+ fe2_d.values.sum(axis=1)[:, None] \
                                                         
        # One-way
m = panel_ols.one_way_fe(y, x, fe1)
                                                         
xone = np.hstack((x, fe1_d.values))
md = ps.spreg.BaseOLS(y, xone)
                                                         
print "De-meaned | Dummy"
for i in range(3):
    print np.round(m.betas[i][0], 9), " | ", \
            np.round(md.betas[i][0], 9)
print 'Condition index'
print ci(m), " | ", ci(md)
                                                         
        # Two-way
m = panel_ols.two_way_fe(y, x, fe1, fe2)
                                                         
malt = panel_ols.one_way_fe(y, xone[:, :-1], fe2)
                                                         
xtwo = np.hstack((xone, fe2_d.values[:, 1:]))
md = ps.spreg.BaseOLS(y, xtwo)
                                                         
print "\nDe-meaned | 1 demeaned-1 dummy | Dummy"
for i in range(3):
    print np.round(m.betas[i][0], 9), " | ", \
            np.round(malt.betas[i][0], 9), " | ", \
            np.round(md.betas[i][0], 9)
print 'Condition index'
print ci(m), " | ",ci(malt), " | ", ci(md)
De-meaned | Dummy
1.001466538  |  1.001466538
0.823691091  |  0.823691091
0.988729449  |  0.988729449
Condition index
1.09144305033  |  6.66571288935

De-meaned | 1 demeaned-1 dummy | Dummy
0.971197304  |  0.971197304  |  0.971197304
0.867112279  |  0.867112279  |  0.867112279
0.88178915  |  0.88178915  |  0.88178915
Condition index
1.06881037371  |  3.32261011076  |  29.8077107922