The RR term in autocorrelations

Author: Lehman Garrison (http://lgarrison.github.io)

$\newcommand{\RR}{\mathrm{RR}}$ $\newcommand{\DD}{\mathrm{DD}}$ When computing a two-point correlation function estimator like $\xi(r) = \frac{\DD}{\RR} - 1$, the $\RR$ term can be computed analytically if the domain is a periodic box. Often, this is done as $$\begin{align} \RR_i &= N A_i \bar\rho \\ &= N A_i \frac{N}{L^2} \end{align}$$ where $\RR_i$ is the expected number of random-random pairs in bin $i$, $N$ is the total number of points, $A_i$ is the area (or volume if 3D) of bin $i$, $L$ is the box size, and $\bar\rho$ is the average density in the box.

However, using $\bar\rho = \frac{N}{L^2}$ is not quite right. When sitting on a particle, only $N-1$ particles are available to be in a bin some distance away. The other particle is the particle you're sitting on, which is always at distance $0$. Thus, the correct expression is $$\RR_i = N A_i \frac{N-1}{L^2}.$$

The following notebook empirically demonstrates that using $N-1$ is correct, and that using $N$ introduces bias of order $\frac{1}{N}$ into the estimator. This is a tiny correction for large $N$ problems, but important for small $N$.

Cross-correlations of two different particle sets don't suffer from this problem; the particle you're sitting on is never part of the set of particles under consideration for pair-making.

This $N-1$ correction is implemented in the Corrfunc code.

In [1]:
import numpy as np
In [2]:
# Generates a set of N uniform random points in boxsize L
# and returns binned pair counts
# Pair counts are doubled, and self-pairs are counted
def rand_DD(N, L, bins):
    # Make a 2D random set of N particles in [0,L)
    x = np.random.random((2,N))*L
    
    # Form the periodic distance array
    dist = x[:,None] - x[...,None]
    dist -= np.round(dist/L)*L
    dist = (dist**2).sum(axis=0)**.5
    
    # Count pairs in 2 bins
    DD = np.histogram(dist, bins=bins)[0]
    
    return DD
In [3]:
# DD parameters
N = 10
L = 1.
bins = np.linspace(0.0,0.49, 5)
n_iter = 100000  # number of times to repeat the experiment
seed = 42
np.random.seed(seed)

# First, calculate RR analytically
dA = (bins[1:]**2 - bins[:-1]**2)*np.pi  # The area of each 2D annular bin

RR = np.empty((2, len(dA)))
RR[:] = N/L**2*dA

# Calculate RR using N-1 and N
RR[0] *= N - 1
RR[1] *= N

# If the first bin includes 0, the random term must include the self-pairs
if bins[0] == 0.:
    RR[:, 0] += N

xi_mean = np.zeros((2, len(bins)-1))
for i in xrange(n_iter):
    DD = rand_DD(N, L, bins)
    xi_mean += DD/RR
    
xi_mean /= n_iter
print 'DD/RR using density (N-1) = {}'.format(xi_mean[0])
print 'DD/RR using density (N)   = {}'.format(xi_mean[1])
DD/RR using density (N-1) = [ 1.00022627  0.99934544  0.99956887  1.00006867]
DD/RR using density (N)   = [ 0.96817988  0.8994109   0.89961199  0.9000618 ]

As we expected, using $N$ in the density leads to a $\frac{1}{N} = 10\%$ bias in $\frac{\DD}{\RR}$, when averaging over many realizations of $\DD$.

The first bin has a smaller bias because we chose to include a bin that starts at $0$ separation, and thus includes self-pairs, which are indepdendent of the $N$ or $N-1$ density term.