Notebook

Goldbach's comet¶

Goldbach's comet with Numba and Datashader¶

This Python notebook is about computing and plotting the Goldbach function. It requires some basic mathematical knowledge, but no fancy fancy prime number theory! The main point is to perfom some computations with Numba and some efficient plotting with Datashader.

Here is the definition of the Goldbach function from Wikipedia:

The function $g ( E )$ is defined for all even integers $E > 2$ to be the number of different ways in which E can be expressed as the sum of two primes. For example, $g ( 22 ) = 3$ since 22 can be expressed as the sum of two primes in three different ways ( 22 = 11 + 11 = 5 + 17 = 3 + 19).

The different prime pairs $(p_1, p_2)$ that sum to an even integer $E=p_1+p_2$ are called Goldbach partitions. So $g(E)$ is the count of distinct Goldbach partitions of $E$, without regard to the order of the primes in the pairs.

Note that for Goldbach's conjecture to be false, there must be $g(E) = 0$ somewhere for $E > 2$. This is very unlikely to occur, but has not been proved yet. Anyway, here are the steps used in this notebook to compute Goldbach function. Given a positive integer $n$:

For each natural number $k \leq n$, build a quick way to check if $k$ is a prime or not, and list all the primes smaller or equal to $n$, using a sieve method.
For each even number $E \leq n$, compute $g(E)$ by counting the number of cases where $E-p$ is prime for all primes $p$ not larger than $E/2$.

Imports¶

In [ ]:

from typing import Tuple

from colorcet import palette
import datashader as ds
from datashader.mpl_ext import dsshow
import matplotlib.pyplot as plt
from numba import jit, njit, prange
import numpy as np
import pandas as pd
from sympy import sieve

plt.style.use("seaborn")

FS = (20, 10)  # figure size

Prime number sieve¶

The generate_primes function creates a list of primes smaller or equal to $n$ and a boolean vector is_prime_vec of size $n+1$: is_prime_vec[k] is True if k is a prime.

In [ ]:

def generate_primes(n: int) -> Tuple[np.ndarray, np.ndarray]:

    primes = np.array([i for i in sieve.primerange(n)])
    is_prime_vec = np.zeros(n + 1, dtype=np.bool_)
    is_prime_vec[primes] = True

    return primes, is_prime_vec


n = 11
primes, is_prime_vec = generate_primes(n)
print(f"primes = {primes}")
print(f"is_prime_vec = {is_prime_vec}")

Evaluate Goldbach function for a given even number¶

Now we show how $g(E)$ can be computed for a given value of $E$ with Numba. In the compute_g function, we loop over all primes $p \leq E/2 $ with a for loop : if $E-p$ is a prime, $(p, E-P)$ is a partition of $E$. By looping over all primes $p \leq E/2$, we count all the possible partitions of $E$. The upper bound of the for loop is computed using np.searchsorted, since primes are sorted within the primes array. This returns the index of the largest prime in the array smaller or equal to $E/2$.

In [ ]:

@jit(nopython=True)
def compute_g(is_prime_vec: np.ndarray, primes: np.ndarray, E: int) -> int:

    assert E < len(is_prime_vec)
    assert E % 2 == 0
    E_half = int(0.5 * E)

    # initialization
    count = 0  # number of prime pairs

    # we loop over all the prime numbers smaller than or equal to half of E
    i_max = np.searchsorted(primes, E_half, side="right")
    for i in range(i_max):
        if is_prime_vec[E - primes[i]]:
            count += 1

    return count

We check the function for a few values of $E$:

In [ ]:

E = 22
primes, is_prime_vec = generate_primes(E)
g = compute_g(is_prime_vec, primes, E)
assert g == 3

E = 1890  # https://www.ias.ac.in/article/fulltext/reso/019/11/1028-1037
primes, is_prime_vec = generate_primes(E)
g = compute_g(is_prime_vec, primes, E)
assert g == 91

E = 1_000_000  # https://resources.wolframcloud.com/FunctionRepository/resources/Goldbach/
primes, is_prime_vec = generate_primes(E)
g = compute_g(is_prime_vec, primes, E)
assert g == 5402

Evaluate Goldbach function over a range of even numbers¶

Now we are going to loop over all even values $E \leq n$ to compute all the corresponding values of $g$. Note that in the following compute_g_vector function, the outer loop has a constant step size of 1, in order to use Numba prange, which only supports this unit step size. This means that we loop on contiguous $E/2$ integer values instead of even $E$ values. Also, we compute is_prime_vec and primes only once and use it for all the evaluations of $g$.

n is calculted from the length of is_prime_vec. In the arguments, we assume that primes is corresponding to the primes of is_prime_vec.

In [ ]:

@njit(parallel=True)
def compute_g_vector(is_prime_vec: np.ndarray, primes: np.ndarray) -> np.ndarray:

    n_max = len(is_prime_vec) - 1
    n_max_half = int(0.5 * n_max) + 1

    g_vec = np.empty(n_max_half, dtype=np.uint)

    for E_half in prange(n_max_half):
        count = 0
        E = 2 * E_half
        i_max = np.searchsorted(primes, E_half, side="right")
        for i in range(i_max):
            if is_prime_vec[E - primes[i]]:
                count += 1

        g_vec[E_half] = np.uint(count)

    return g_vec


n = 10
primes, is_prime_vec = generate_primes(n)
g_vec = compute_g_vector(is_prime_vec, primes)
g_vec

The $i$-th value of g_vec correponds to $g(2 \, i)$ with $i \geq 0 $:

i	E	g_vec
0	0	0
1	2	0
2	4	1
3	6	1
4	8	1
5	10	2

We can check $g$ at least for some for some small values of $E$ :

In [ ]:

n = 56
primes, is_prime_vec = generate_primes(n)
g_vec = compute_g_vector(is_prime_vec, primes)
g_vec_ref = [0, 0, 1, 1, 1, 2, 1, 2, 2, 2, 2, 3, 3, 3, 2, 3, 2, 4, 4, 2, 3, 4, 
    3, 4, 5, 4, 3, 5, 3]
np.testing.assert_array_equal(g_vec, g_vec_ref)

Finally we wrap everything into a function that also loads the values of g_vec into a Pandas dataframe for convenience:

In [ ]:

def compute_g_df(n):
    primes, is_prime_vec = generate_primes(n)
    g_vec = compute_g_vector(is_prime_vec, primes)
    g_df = pd.DataFrame(data={"E": 2 * np.arange(len(g_vec)), "g": g_vec})
    g_df = g_df[g_df.E > 2]  # The function g(E) is defined for all even integers E>2
    return g_df


compute_g_df(11)

First plot of the comet¶

We start by computing the Goldbach function up to $E=10000$ :

In [ ]:

%%time
n = 20_000
g_df_small = compute_g_df(n)

In [ ]:

len(g_df_small)

And plot it with Matplotlib:

In [ ]:

ax = g_df_small.plot(x="E", y="g", style=".", ms=5, alpha=0.5, legend=False, figsize=FS)
_ = ax.set(
    title="Goldbach's comet",
    xlabel="E",
    ylabel="g(E)",
)
ax.autoscale(enable=True, axis="x", tight=True)
ax.grid(True)

Plot the comet with Datashader¶

Now let's compute the Goldbach function with a larger value of $n$ (this takes about 25 - 30s on my laptop, Intel(R) i7-7700HQ CPU @ 2.80GHz with 8 cores):

In [ ]:

%%time
n = 2_000_000
g_df = compute_g_df(n)

In [ ]:

len(g_df)

The plot is made using Datashader's interface for Matplotlib.

In [ ]:

cmap = palette["dimgray"][::-1]
bg_col = "white"

In [ ]:

%%time
fig, ax = plt.subplots(figsize=FS)
_ = dsshow(
    g_df,
    ds.Point("E", "g"),
    norm="eq_hist",
    cmap=cmap,
    aspect="auto",
    ax=ax,
)
ax.grid(False)
ax.set_facecolor(bg_col)
_ = ax.set(title="Goldbach's comet", xlabel="E", ylabel="g(E)")

We can clearly observe some dense lines in this "comet tail". In order to visualize this vertical distribution of prime pairs count, we are are going to normalize $g$. As explained on wikipedia:

An illuminating way of presenting the comet data is as a histogram. The function $g(E)$ can be normalized by dividing by the locally averaged value of $g$, $g_{av}$, taken over perhaps 1000 neighboring values of the even number $E$. The histogram can then be accumulated over a range of up to about 10% either side of a central $E$.

In [ ]:

%%time
g_df["g_av"] = g_df["g"].rolling(window=1000, center=True).mean()
g_df["g_norm"] = g_df["g"] / g_df["g_av"]

In [ ]:

%%time
fig, ax = plt.subplots(figsize=FS)
_ = dsshow(
    g_df,
    ds.Point("E", "g_norm"),
    norm="eq_hist",
    cmap=cmap,
    aspect="auto",
    ax=ax,
)
ax.grid(False)
ax.set_facecolor(bg_col)
_ = ax.set(title="Normalized Goldbach function", xlabel="E", ylabel="g_norm(E)")

We can also plot the histogram of the comet data, which will lead to some kind of cross section of the above plot:

In [ ]:

%%time
ax = g_df["g_norm"].hist(bins=1000, alpha=0.5, figsize=FS)
ax.grid(True)
_ = ax.set(
    title="Histogram of the normalized Goldbach function",
    xlabel="Normalized number of prime pairs",
    ylabel="Number of occurrences",
)
_ = plt.xticks(np.arange(0.5, 3.0, 0.1))

The Hardy-Littlewood estimate¶

As described on the Wikipedia page for Goldbach's comet, the number of Goldbach partitions can be estimated using the following formulae from Hardy and Littlewood (1922) :

$$\frac{g(E)}{g_{av}} \approx c \prod_{F(E/2)} \frac{p-1}{p-2}$$

where the product is taken over $F(E/2)$ : all primes p that are factors of $E/2$. The constant $c$ is the twin primes constant :

$$c = \prod_{p \geq 3} \left( 1 - \frac{1}{(1-p)^2} \right)$$

where the product is taken over all primes larger or equal to 3.

In [ ]:

%%time
primes = np.array([i for i in sieve.primerange(n)])
c = np.prod(1.0 - 1.0 / np.power(1.0 - primes[1:], 2))
c

So let's compute this estimate of the normalized Goldbach function with Numba njit :

In [ ]:

@njit(parallel=True)
def compute_g_hl_vector(primes: np.ndarray, n_max: int, c: float) -> np.ndarray:

    n_max_half = int(0.5 * n_max) + 1

    g_hl_vec = np.empty(n_max_half, dtype=np.float64)

    for E_half in prange(n_max_half):
        i_max = np.searchsorted(primes, E_half, side="right")
        prod = 1.0
        for i in range(1, i_max):
            p = primes[i]
            if E_half % p == 0:  # if p is a factor of E/2
                prod *= (p - 1.0) / (p - 2.0)

        g_hl_vec[E_half] = np.float64(prod)
    g_hl_vec *= c

    return g_hl_vec

In [ ]:

%%time
g_hl_vec = compute_g_hl_vector(primes, n, c)

In [ ]:

g_hl_df = pd.DataFrame(data={"E": 2 * np.arange(len(g_hl_vec)), "g_norm": g_hl_vec})

In [ ]:

%%time
fig, ax = plt.subplots(figsize=FS)
_ = dsshow(
    g_hl_df,
    ds.Point("E", "g_norm"),
    norm="eq_hist",
    cmap=cmap,
    aspect="auto",
    ax=ax,
)
ax.grid(False)
ax.set_facecolor(bg_col)
_ = ax.set(title="Hardy-Littlewood estimate of the normalized Goldbach function", xlabel="E", ylabel="g_norm(E) estimate")

The vertical distribution of the dense lines seems to be similar to the one from the normalized Goldbach function. This can be checked by computing and plotting both kernel density estimates.

In [ ]:

%%time
ax = g_df["g_norm"].plot.density(alpha=0.5, figsize=FS, label='Golbach function')
ax = g_hl_df.g_norm.plot.density(alpha=0.5, ax=ax, label='Hardy-Littlewood estimate')
ax.grid(True)
_ = ax.set(
    title="KDEs of the normalized Goldbach function and its Hardy-Littlewood estimate",
    xlabel="Normalized number of prime pairs",
    ylabel="Density",
)
_ = ax.legend()
_ = plt.xticks(np.arange(0.5, 3.0, 0.1))
_ = ax.set_xlim(0.5, 2.5)

Prime E/2 values only¶

Finally, we are going to isolate a part of the most dense line from the comet tail (for a normalized number of prime pairs around 0.66). As explained on Wikipedia:

Of particular interest is the peak formed by selecting only values of $E/2$ that are prime. [...] The peak is very close to a Gaussian form.

In [ ]:

g_df["E_half"] = (0.5 * g_df["E"]).astype(int)
g_df_primes = g_df[g_df["E_half"].isin(primes)]

In [ ]:

%%time
fig, ax = plt.subplots(figsize=FS)
_ = dsshow(
    g_df_primes,
    ds.Point("E", "g_norm"),
    norm="eq_hist",
    cmap=cmap,
    aspect="auto",
    ax=ax,
)
ax.grid(False)
ax.set_facecolor(bg_col)
_ = ax.set(title="Normalized Goldbach function for prime E/2 values only", xlabel="E", ylabel="g_norm(E)")

In [ ]:

ax = g_df_primes["g_norm"].hist(bins=500, alpha=0.5, figsize=FS)
ax.grid(True)
_ = ax.set(
    title="Histogram of the normalized Goldbach function for prime E/2 values only",
    xlabel="Normalized number of prime pairs",
    ylabel="Number of occurrences",
)

Conclusion¶

As is often the case with number theory, it appears that the computation of Goldbach function is quite expensive. But we saw that speeding up these CPU bound processes using Numba is straightforward, as well as it is to quickly plot a dataset with a large number of points with Datashader. I am not familiar with this branch of mathematics, and with this kind of new subject, I am amazed how easy it is to launch a notebook, play with these formulas and visualize the data in Python.