Notebook

Information Topography¶

Neil D. Lawrence¶

2025-04-14¶

Abstract: Physical landscapes are shaped by elevation, valleys, and peaks. We might expect information landscapes are molded by entropy, precision, and capacity constraints. To explore how these ideas might manifest we introduce Jaynes’ world, an entropy game that maximises instantaneous entropy production.

In this talk we’ll argue that this landscape has a precision/capacity trade-off that suggests the underlying configuration requires a density matrix representation.

::: {.cell .markdown}

Jaynes’ World¶

[edit]

This game explores how structure, time, causality, and locality might emerge within a system governed solely by internal information-theoretic constraints. The hope is that it can serve as

A research framework for observer-free dynamics and entropy-based emergence,
A conceptual tool for exploring the notion of an information topography: A landscape in which information flows under constraints.

Definitions and Global Constraints¶

System Structure¶

Let $Z = \{Z_1, Z_2, \dots, Z_n\}$ be the full set of system variables. At game turn $t$ , define a partition where $X(t) \subseteq Z$ : are active variables (currently contributing to entropy) and $M(t) = Z \setminus X(t)$ : latent or frozen variables that are stored in the form of an information reservoir (Barato and Seifert (2014),Parrondo et al. (2015)).

Representation via Density Matrix¶

We’ll argue that the configuration space must be represented by a density matrix,

$\rho(\boldsymbol{\theta}) = \frac{1}{Z(\boldsymbol{\theta})} \exp\left( \sum_i \theta_i H_i \right),$ where

$\boldsymbol{\theta} \in \mathbb{R}^d$ are the natural parameters, each

$H_i$ is a Hermitian operator associated with the observables and the partition function is given by

$Z(\boldsymbol{\theta}) = \mathrm{Tr}[\exp(\sum_i \theta_i H_i)]$ .

From this we can see that the log-partition function, which has an interpretation as the cummulant generating function, is

$A(\boldsymbol{\theta}) = \log Z(\boldsymbol{\theta})$ and the von Neumann entropy is

$S(\boldsymbol{\theta}) = A(\boldsymbol{\theta}) - \boldsymbol{\theta}^\top \nabla A(\boldsymbol{\theta}).$ We can show that the Fisher Information Matrix is

$G_{ij}(\boldsymbol{\theta}) = \frac{\partial^2 A}{\partial \theta_i \partial \theta_j}.$

Entropy Capacity and Resolution¶

We define our system to have a maximum entropy of $N$ bits. If the dimension $d$ of the parameter space is fixed, this implies a minimum detectable resolution in natural parameter space,

$\varepsilon \sim \frac{1}{2^N},$ where changes in natural parameters smaller than

$\varepsilon$ are treated as invisible by the system. As a result, system dynamics exhibit discrete, detectable transitions between distinguishable states.

Note if the dimension $d$ scales with $N$ (e.g., $d = \alpha N$ for some constant $\alpha$ ), then the resolution constraint becomes more complex. In this case, the volume of distinguishable states $(\varepsilon)^d$ must equal $2^N$ , which leads to $\varepsilon = 2^{1/\alpha}$ , a constant independent of $N$ . This suggests that as the system’s entropy capacity grows, it maintains a constant resolution while exponentially increasing the number of distinguishable states.

Dual Role of Parameters and Variables¶

Each variable $Z_i$ is associated with a generator $H_i$ , and a natural parameter $\theta_i$ . When we say a parameter $\theta_i \in X(t)$ , we mean that the component of the system associated with $H_i$ is active at time $t$ and its parameter is evolving with $|\dot{\theta}_i| \geq \varepsilon$ . This comes from the duality variables, observables, and natural parameters that we find in exponential family representations and we also see in a density matrix representation.

Core Axiom: Entropic Dynamics¶

Our core axiom is that the system evolves by steepest ascent in entropy. The gradient of the density matrix with respect to the natural parameters is given by

$\nabla S[\rho] = -G(\boldsymbol{\theta}) \boldsymbol{\theta}$ and so we set

$\frac{d\boldsymbol{\theta}}{dt} = -G(\boldsymbol{\theta}) \boldsymbol{\theta}$

Histogram Game¶

[edit]

To illustrate the concept of the Jaynes’ world entropy game we’ll run a simple example using a four bin histogram. The entropy of a four bin histogram can be computed as,

$S(p) = - \sum_{i=1}^4 p_i \log_2 p_i.$

In [ ]:

import numpy as np

First we write some helper code to plot the histogram and compute its entropy.

In [ ]:

import matplotlib.pyplot as plt
import mlai.plot as plot

In [ ]:

def plot_histogram(ax, p, max_height=None):
    heights = p
    if max_height is None:
        max_height = 1.25*heights.max()
    
    # Safe entropy calculation that handles zeros
    nonzero_p = p[p > 0]  # Filter out zeros
    S = - (nonzero_p*np.log2(nonzero_p)).sum()

    # Define bin edges
    bins = [1, 2, 3, 4, 5]  # Bin edges

    # Create the histogram
    if ax is None:
        fig, ax = plt.subplots(figsize=(6, 4))  # Adjust figure size 
    ax.hist(bins[:-1], bins=bins, weights=heights, align='left', rwidth=0.8, edgecolor='black') # Use weights for probabilities


    # Customize the plot for better slide presentation
    ax.set_xlabel("Bin")
    ax.set_ylabel("Probability")
    ax.set_title(f"Four Bin Histogram (Entropy {S:.3f})")
    ax.set_xticks(bins[:-1]) # Show correct x ticks
    ax.set_ylim(0,max_height) # Set y limit for visual appeal

We can compute the entropy of any given histogram.

In [ ]:

# Define probabilities
p = np.zeros(4)
p[0] = 4/13
p[1] = 3/13
p[2] = 3.7/13
p[3] = 1 - p.sum()

# Safe entropy calculation
nonzero_p = p[p > 0]  # Filter out zeros
entropy = - (nonzero_p*np.log2(nonzero_p)).sum()
print(f"The entropy of the histogram is {entropy:.3f}.")

In [ ]:

import matplotlib.pyplot as plt
import mlai.plot as plot
import mlai

In [ ]:

fig, ax = plt.subplots(figsize=plot.big_wide_figsize)
fig.tight_layout()
plot_histogram(ax, p)
ax.set_title(f"Four Bin Histogram (Entropy {entropy:.3f})")
mlai.write_figure(filename='four-bin-histogram.svg', 
                  directory = './information-game')

Figure: The entropy of a four bin histogram.

We can play the entropy game by starting with a histogram with all the probability mass in the first bin and then ascending the gradient of the entropy function.

Two-Bin Histogram Example¶

The simplest possible example of Jaynes’ World is a two-bin histogram with probabilities $p$ and $1-p$ . This minimal system allows us to visualize the entire entropy landscape.

The natural parameter is the log odds, $\theta = \log\frac{p}{1-p}$ , and the update given by the entropy gradient is

$\Delta \theta_{\text{steepest}} = \eta \frac{\text{d}S}{\text{d}\theta} = \eta p(1-p)(\log(1-p) - \log p).$ The Fisher information is

$G(\theta) = p(1-p)$ This creates a dynamic where as

$p$ approaches either 0 or 1 (minimal entropy states), the Fisher information approaches zero, creating a critical slowing” effect. This critical slowing is what leads to the formation of information resevoirs. Note also that in the natural gradient the updated is given by multiplying the gradient by the inverse Fisher information, which would lead to a more efficient update of the form,

$\Delta \theta_{\text{natural}} = \eta(\log(1-p) - \log p),$ however, it is this efficiency that we want our game to avoid, because it is the inefficient behaviour in the reagion of saddle points that leads to critical slowing and the emergence of information resevoirs.

In [ ]:

import numpy as np

In [ ]:

# Python code for gradients
p_values = np.linspace(0.000001, 0.999999, 10000)
theta_values = np.log(p_values/(1-p_values))
entropy = -p_values * np.log(p_values) - (1-p_values) * np.log(1-p_values)
fisher_info = p_values * (1-p_values)
gradient = fisher_info * (np.log(1-p_values) - np.log(p_values))

In [ ]:

import matplotlib.pyplot as plt
import mlai.plot as plot
import mlai

In [ ]:

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=plot.big_wide_figsize)

ax1.plot(theta_values, entropy)
ax1.set_xlabel('$\\theta$')
ax1.set_ylabel('Entropy $S(p)$')
ax1.set_title('Entropy Landscape')

ax2.plot(theta_values, gradient)
ax2.set_xlabel('$\\theta$')
ax2.set_ylabel('$\\nabla_\\theta S(p)$')
ax2.set_title('Entropy Gradient vs. Position')

mlai.write_figure(filename='two-bin-histogram-entropy-gradients.svg', 
                  directory = './information-game')

Figure: Entropy gradients of the two bin histogram agains position.

This example reveals the entropy extrema at $p = 0$ , $p = 0.5$ , and $p = 1$ . At minimal entropy ( $p \approx 0$ or $p \approx 1$ ), the gradient approaches zero, creating natural information reservoirs. The dynamics slow dramatically near these points - these are the areas of critical slowing that create information reservoirs.

Gradient Ascent in Natural Parameter Space¶

We can visualize the entropy maximization process by performing gradient ascent in the natural parameter space $\theta$ . Starting from a low-entropy state, we follow the gradient of entropy with respect to $\theta$ to reach the maximum entropy state.

In [ ]:

import numpy as np

In [ ]:

# Helper functions for two-bin histogram
def theta_to_p(theta):
    """Convert natural parameter theta to probability p"""
    return 1.0 / (1.0 + np.exp(-theta))

def p_to_theta(p):
    """Convert probability p to natural parameter theta"""
    # Add small epsilon to avoid numerical issues
    p = np.clip(p, 1e-10, 1-1e-10)
    return np.log(p/(1-p))

def entropy(theta):
    """Compute entropy for given theta"""
    p = theta_to_p(theta)
    # Safe entropy calculation
    return -p * np.log2(p) - (1-p) * np.log2(1-p)

def entropy_gradient(theta):
    """Compute gradient of entropy with respect to theta"""
    p = theta_to_p(theta)
    return p * (1-p) * (np.log2(1-p) - np.log2(p))

def plot_histogram(ax, theta, max_height=None):
    """Plot two-bin histogram for given theta"""
    p = theta_to_p(theta)
    heights = np.array([p, 1-p])
    
    if max_height is None:
        max_height = 1.25
    
    # Compute entropy
    S = entropy(theta)
    
    # Create the histogram
    bins = [1, 2, 3]  # Bin edges
    if ax is None:
        fig, ax = plt.subplots(figsize=(6, 4))
    ax.hist(bins[:-1], bins=bins, weights=heights, align='left', rwidth=0.8, edgecolor='black')
    
    # Customize the plot
    ax.set_xlabel("Bin")
    ax.set_ylabel("Probability")
    ax.set_title(f"Two-Bin Histogram (Entropy {S:.3f})")
    ax.set_xticks(bins[:-1])
    ax.set_ylim(0, max_height)

In [ ]:

# Parameters for gradient ascent
theta_initial = -9.0  # Start with low entropy 
learning_rate = 1
num_steps = 1500

# Initialize
theta_current = theta_initial
theta_history = [theta_current]
p_history = [theta_to_p(theta_current)]
entropy_history = [entropy(theta_current)]

# Perform gradient ascent in theta space
for step in range(num_steps):
    # Compute gradient
    grad = entropy_gradient(theta_current)
    
    # Update theta
    theta_current = theta_current + learning_rate * grad
    
    # Store history
    theta_history.append(theta_current)
    p_history.append(theta_to_p(theta_current))
    entropy_history.append(entropy(theta_current))
    if step % 100 == 0:
        print(f"Step {step+1}: θ = {theta_current:.4f}, p = {p_history[-1]:.4f}, Entropy = {entropy_history[-1]:.4f}")

In [ ]:

import matplotlib.pyplot as plt
import mlai.plot as plot
import mlai

In [ ]:

# Create a figure showing the evolution
fig, axes = plt.subplots(2, 3, figsize=(15, 8))
fig.tight_layout(pad=3.0)

# Select steps to display
steps_to_show = [0, 300, 600, 900, 1200, 1500]

# Plot histograms for selected steps
for i, step in enumerate(steps_to_show):
    row, col = i // 3, i % 3
    plot_histogram(axes[row, col], theta_history[step])
    axes[row, col].set_title(f"Step {step}: θ = {theta_history[step]:.2f}, p = {p_history[step]:.3f}")

mlai.write_figure(filename='two-bin-histogram-evolution.svg', 
                  directory = './information-game')

# Plot entropy evolution
plt.figure(figsize=(10, 6))
plt.plot(range(num_steps+1), entropy_history, 'o-')
plt.xlabel('Gradient Ascent Step')
plt.ylabel('Entropy')
plt.title('Entropy Evolution During Gradient Ascent')
plt.grid(True)
mlai.write_figure(filename='two-bin-entropy-evolution.svg', 
                  directory = './information-game')

# Plot trajectory in theta space
plt.figure(figsize=(10, 6))
theta_range = np.linspace(-5, 5, 1000)
entropy_curve = [entropy(t) for t in theta_range]
plt.plot(theta_range, entropy_curve, 'b-', label='Entropy Landscape')
plt.plot(theta_history, entropy_history, 'ro-', label='Gradient Ascent Path')
plt.xlabel('Natural Parameter θ')
plt.ylabel('Entropy')
plt.title('Gradient Ascent Trajectory in Natural Parameter Space')
plt.axvline(x=0, color='k', linestyle='--', alpha=0.3)
plt.legend()
plt.grid(True)
mlai.write_figure(filename='two-bin-trajectory.svg', 
                  directory = './information-game')

Figure: Evolution of the two-bin histogram during gradient ascent in natural parameter space.

Figure: Entropy evolution during gradient ascent for the two-bin histogram.

Figure: Gradient ascent trajectory in the natural parameter space for the two-bin histogram.

The gradient ascent visualization shows how the system evolves in the natural parameter space $\theta$ . Starting from a negative $\theta$ (corresponding to a low-entropy state with $p << 0.5$ ), the system follows the gradient of entropy with respect to $\theta$ until it reaches $\theta = 0$ (corresponding to $p = 0.5$ ), which is the maximum entropy state.

Note that the maximum entropy occurs at $\theta = 0$ , which corresponds to $p = 0.5$ . The gradient of entropy with respect to $\theta$ is zero at this point, making it a stable equilibrium for the gradient ascent process.

Four Bin Histogram Entropy Game¶

[edit]

To do this we represent the histogram parameters as a vector of length 4, $\mathbf{ w}{\lambda} = [\lambda_1, \lambda_2, \lambda_3, \lambda_4]$ and define the histogram probabilities to be $p_i = \lambda_i^2 / \sum_{j=1}^4 \lambda_j^2$ .

In [ ]:

import numpy as np

In [ ]:

# Define the entropy function 
def entropy(lambdas):
    p = lambdas**2/(lambdas**2).sum()
    
    # Safe entropy calculation
    nonzero_p = p[p > 0]
    nonzero_lambdas = lambdas[p > 0]
    return np.log2(np.sum(lambdas**2))-np.sum(nonzero_p * np.log2(nonzero_lambdas**2))

# Define the gradient of the entropy function
def entropy_gradient(lambdas):
    denominator = np.sum(lambdas**2)
    p = lambdas**2/denominator
    
    # Safe log calculation
    log_terms = np.zeros_like(lambdas)
    nonzero_idx = lambdas != 0
    log_terms[nonzero_idx] = np.log2(np.abs(lambdas[nonzero_idx]))
    
    p_times_lambda_entropy = -2*log_terms/denominator
    const = (p*p_times_lambda_entropy).sum()
    gradient = 2*lambdas*(p_times_lambda_entropy - const)
    return gradient

# Numerical gradient check
def numerical_gradient(func, lambdas, h=1e-5):
    numerical_grad = np.zeros_like(lambdas)
    for i in range(len(lambdas)):
        temp_lambda_plus = lambdas.copy()
        temp_lambda_plus[i] += h
        temp_lambda_minus = lambdas.copy()
        temp_lambda_minus[i] -= h
        numerical_grad[i] = (func(temp_lambda_plus) - func(temp_lambda_minus)) / (2 * h)
    return numerical_grad

We can then ascend the gradeint of the entropy function, starting at a parameter setting where the mass is placed in the first bin, we take $\lambda_2 = \lambda_3 = \lambda_4 = 0.01$ and $\lambda_1 = 100$ .

First to check our code we compare our numerical and analytic gradients.

In [ ]:

import numpy as np

In [ ]:

# Initial parameters (lambda)
initial_lambdas = np.array([100, 0.01, 0.01, 0.01])

# Gradient check
numerical_grad = numerical_gradient(entropy, initial_lambdas)
analytical_grad = entropy_gradient(initial_lambdas)
print("Numerical Gradient:", numerical_grad)
print("Analytical Gradient:", analytical_grad)
print("Gradient Difference:", np.linalg.norm(numerical_grad - analytical_grad))  # Check if close to zero

Now we can run the steepest ascent algorithm.

In [ ]:

import numpy as np

In [ ]:

# Steepest ascent algorithm
lambdas = initial_lambdas.copy()

learning_rate = 1
turns = 15000
entropy_values = []
lambdas_history = []

for _ in range(turns):
    grad = entropy_gradient(lambdas)
    lambdas += learning_rate * grad # update lambda for steepest ascent
    entropy_values.append(entropy(lambdas))
    lambdas_history.append(lambdas.copy())

We can plot the histogram at a set of chosen turn numbers to see the progress of the algorithm.

In [ ]:

import matplotlib.pyplot as plt
import mlai.plot as plot
import mlai

In [ ]:

fig, ax = plt.subplots(figsize=plot.big_wide_figsize)
plot_at = [0, 100, 1000, 2500, 5000, 7500, 10000, 12500, turns-1]
for i, iter in enumerate(plot_at):
    plot_histogram(ax, lambdas_history[i]**2/(lambdas_history[i]**2).sum(), 1)
    # write the figure,
    mlai.write_figure(filename=f'four-bin-histogram-turn-{i:02d}.svg', 
                      directory = './information-game')

In [ ]:

import notutils as nu
from ipywidgets import IntSlider

In [ ]:

nu.display_plots('two_point_sample{sample:0>3}.svg', 
                            './information-game', 
                            sample=IntSlider(5, 5, 5, 1))

¶

Figure: Intermediate stages of the histogram entropy game. After 0, 1000, 5000, 10000 and 15000 iterations.

And we can also plot the changing entropy as a function of the number of game turns.

In [ ]:

fig, ax = plt.subplots(figsize=plot.big_wide_figsize)
ax.plot(range(turns), entropy_values)
ax.set_xlabel("turns")
ax.set_ylabel("entropy")
ax.set_title("Entropy vs. turns (Steepest Ascent)")
mlai.write_figure(filename='four-bin-histogram-entropy-vs-turns.svg', 
                  directory = './information-game')

Figure: Four bin histogram entropy game. The plot shows the increasing entropy against the number of turns across 15000 iterations of gradient ascent.

Note that the entropy starts at a saddle point, increaseases rapidly, and the levels off towards the maximum entropy, with the gradient decreasing slowly in the manner of Zeno’s paradox.

Constructed Quantities and Lemmas¶

Variable Partition¶

$X(t) = \left\{ i \mid \left| \frac{\text{d}\theta_i}{\text{d}t} \right| \geq \varepsilon \right\}, \quad M(t) = Z \setminus X(t)$

Fisher Information Matrix Partitioning¶

We partition the Fisher Information Matrix $G(\boldsymbol{\theta})$ according to the active variables $X(t)$ and latent information reservoir $M(t)$ :

$G(\boldsymbol{\theta}) = \begin{bmatrix} G_{XX} & G_{XM} \\ G_{MX} & G_{MM} \end{bmatrix}$ where

$G_{XX}$ represents the information geometry within active variables,

$G_{MM}$ within the latent reservoir, and

$G_{XM} = G_{MX}^\top$ captures the cross-coupling between active and latent components. This partitioning reveals how information flows between observable dynamics and the latent structure.

Lemma 1: Form of the Minimal Entropy Configuration¶

The minimal-entropy state compatible with the system’s resolution constraint and regularity condition is represented by a density matrix of the exponential form,

$\rho(\boldsymbol{\theta}_o) = \frac{1}{Z(\boldsymbol{\theta}_o)} \exp\left( \sum_i \theta_{oi} H_i \right),$ where all components

$\theta_{oi}$ are sub-threshold

$|\dot{\theta}_{oi}| < \varepsilon.$ This state minimizes entropy under the constraint that it remains regular, continuous, and detectable only above a resolution scale $$. Its structure can be derived via a minimum-entropy analogue of Jaynes’ formalism, using the same density matrix geometry but inverted optimization.

Lemma 2: Symmetry Breaking¶

If $\theta_k \in M(t)$ and $|\dot{\theta}_k| \geq \varepsilon$ , then

$\theta_k \in X(t + \delta t).$

Four-Bin Saddle Point Example¶

[edit]

To illustrate saddle points and information reservoirs, we need at least a 4-bin system. This creates a 3-dimensional parameter space where we can observe genuine saddle points.

Consider a 4-bin system parameterized by natural parameters $\theta_1$ , $\theta_2$ , and $\theta_3$ (with one constraint). A saddle point occurs where the gradient $\nabla_\theta S = 0$ , but the Hessian has mixed eigenvalues - some positive, some negative.

At these points, the Fisher information matrix $G(\theta)$ eigendecomposition reveals.

Fast modes: large positive eigenvalues → rapid evolution
Slow modes: small positive eigenvalues → gradual evolution
Critical modes: near-zero eigenvalues → information reservoirs

The eigenvectors of $G(\theta)$ at the saddle point determine which parameter combinations form information reservoirs.

In [ ]:

import numpy as np

In [ ]:

# Exponential family entropy functions for 4-bin system
def exponential_family_entropy(theta):
    """
    Compute entropy of a 4-bin exponential family distribution
    parameterized by natural parameters theta
    """
    # Compute the log-partition function (normalization constant)
    log_Z = np.log(np.sum(np.exp(theta)))
    
    # Compute probabilities
    p = np.exp(theta - log_Z)
    
    # Compute entropy: -sum(p_i * log(p_i))
    entropy = -np.sum(p * np.log(p), where=p>0)
    
    return entropy

def entropy_gradient(theta):
    """
    Compute the gradient of the entropy with respect to theta
    """
    # Compute the log-partition function (normalization constant)
    log_Z = np.log(np.sum(np.exp(theta)))
    
    # Compute probabilities
    p = np.exp(theta - log_Z)
    
    # Gradient is theta times the second derivative of log partition function
    return -p*theta + p*(np.dot(p, theta))

# Add a gradient check function
def check_gradient(theta, epsilon=1e-6):
    """
    Check the analytical gradient against numerical gradient
    """
    # Compute analytical gradient
    analytical_grad = entropy_gradient(theta)
    
    # Compute numerical gradient
    numerical_grad = np.zeros_like(theta)
    for i in range(len(theta)):
        theta_plus = theta.copy()
        theta_plus[i] += epsilon
        entropy_plus = exponential_family_entropy(theta_plus)
        
        theta_minus = theta.copy()
        theta_minus[i] -= epsilon
        entropy_minus = exponential_family_entropy(theta_minus)
        
        numerical_grad[i] = (entropy_plus - entropy_minus) / (2 * epsilon)
    
    # Compare
    print("Analytical gradient:", analytical_grad)
    print("Numerical gradient:", numerical_grad)
    print("Difference:", np.abs(analytical_grad - numerical_grad))
    
    return analytical_grad, numerical_grad

# Project gradient to respect constraints (sum of theta is constant)
def project_gradient(theta, grad):
    """
    Project gradient to ensure sum constraint is respected
    """
    # Project to space where sum of components is zero
    return grad - np.mean(grad)

# Perform gradient ascent on entropy
def gradient_ascent_four_bin(theta_init, steps=100, learning_rate=1):
    """
    Perform gradient ascent on entropy for 4-bin system
    """
    theta = theta_init.copy()
    theta_history = [theta.copy()]
    entropy_history = [exponential_family_entropy(theta)]
    
    for _ in range(steps):
        # Compute gradient
        grad = entropy_gradient(theta)
        proj_grad = project_gradient(theta, grad)
        
        # Update parameters
        theta += learning_rate * proj_grad
        
        # Store history
        theta_history.append(theta.copy())
        entropy_history.append(exponential_family_entropy(theta))
    
    return np.array(theta_history), np.array(entropy_history)

In [ ]:

# Test the gradient calculation
test_theta = np.array([0.5, -0.3, 0.1, -0.3])
test_theta = test_theta - np.mean(test_theta)  # Ensure constraint is satisfied
print("Testing gradient calculation:")
analytical_grad, numerical_grad = check_gradient(test_theta)

# Verify if we're ascending or descending
entropy_before = exponential_family_entropy(test_theta)
step_size = 0.01
test_theta_after = test_theta + step_size * analytical_grad
entropy_after = exponential_family_entropy(test_theta_after)
print(f"Entropy before step: {entropy_before}")
print(f"Entropy after step: {entropy_after}")
print(f"Change in entropy: {entropy_after - entropy_before}")
if entropy_after > entropy_before:
    print("We are ascending the entropy gradient")
else:
    print("We are descending the entropy gradient")

In [ ]:

# Initialize with asymmetric distribution (away from saddle point)
theta_init = np.array([1.0, -0.5, -0.2, -0.3])
theta_init = theta_init - np.mean(theta_init)  # Ensure constraint is satisfied

# Run gradient ascent
theta_history, entropy_history = gradient_ascent_four_bin(theta_init, steps=100, learning_rate=1.0)

# Create a grid for visualization
x = np.linspace(-2, 2, 100)
y = np.linspace(-2, 2, 100)
X, Y = np.meshgrid(x, y)

# Compute entropy at each grid point (with constraint on theta3 and theta4)
Z = np.zeros_like(X)
for i in range(X.shape[0]):
    for j in range(X.shape[1]):
        # Create full theta vector with constraint that sum is zero
        theta1, theta2 = X[i,j], Y[i,j]
        theta3 = -0.5 * (theta1 + theta2)
        theta4 = -0.5 * (theta1 + theta2)
        theta = np.array([theta1, theta2, theta3, theta4])
        Z[i,j] = exponential_family_entropy(theta)

# Compute gradient field
dX = np.zeros_like(X)
dY = np.zeros_like(Y)
for i in range(X.shape[0]):
    for j in range(X.shape[1]):
        # Create full theta vector with constraint
        theta1, theta2 = X[i,j], Y[i,j]
        theta3 = -0.5 * (theta1 + theta2)
        theta4 = -0.5 * (theta1 + theta2)
        theta = np.array([theta1, theta2, theta3, theta4])
        
        # Get full gradient and project
        grad = entropy_gradient(theta)
        proj_grad = project_gradient(theta, grad)
        
        # Store first two components
        dX[i,j] = proj_grad[0]
        dY[i,j] = proj_grad[1]

# Normalize gradient vectors for better visualization
norm = np.sqrt(dX**2 + dY**2)
# Avoid division by zero
norm = np.where(norm < 1e-10, 1e-10, norm)
dX_norm = dX / norm
dY_norm = dY / norm

# A few gradient vectors for visualization
stride = 10

In [ ]:

import matplotlib.pyplot as plt
import mlai.plot as plot
import mlai

In [ ]:

fig = plt.figure(figsize=plot.big_wide_figsize)

# Create contour lines only (no filled contours)
contours = plt.contour(X, Y, Z, levels=15, colors='black', linewidths=0.8)
plt.clabel(contours, inline=True, fontsize=8, fmt='%.2f')

# Add gradient vectors (normalized for direction, but scaled by magnitude for visibility)
plt.quiver(X[::stride, ::stride], Y[::stride, ::stride], 
           dX_norm[::stride, ::stride], dY_norm[::stride, ::stride], 
           color='r', scale=30, width=0.003, scale_units='width')

# Plot the gradient ascent trajectory
plt.plot(theta_history[:, 0], theta_history[:, 1], 'b-', linewidth=2, 
         label='Gradient Ascent Path')
plt.scatter(theta_history[0, 0], theta_history[0, 1], color='green', s=100, 
           marker='o', label='Start')
plt.scatter(theta_history[-1, 0], theta_history[-1, 1], color='purple', s=100, 
           marker='*', label='End')

# Add labels and title
plt.xlabel('$\\theta_1$')
plt.ylabel('$\\theta_2$')
plt.title('Entropy Contours with Gradient Field')

# Mark the saddle point (approximately at origin for this system)
plt.scatter([0], [0], color='yellow', s=100, marker='*', 
            edgecolor='black', zorder=10, label='Saddle Point')
plt.legend()

mlai.write_figure(filename='simplified-saddle-point-example.svg', 
                  directory = './information-game')

# Plot entropy evolution during gradient ascent
plt.figure(figsize=plot.big_figsize)
plt.plot(entropy_history)
plt.xlabel('Gradient Ascent Step')
plt.ylabel('Entropy')
plt.title('Entropy Evolution During Gradient Ascent')
plt.grid(True)
mlai.write_figure(filename='four-bin-entropy-evolution.svg', 
                  directory = './information-game')

Figure: Visualisation of a saddle point projected down to two dimensions.

Figure: Entropy evolution during gradient ascent on the four-bin system.

The animation of system evolution would show initial rapid movement along high-eigenvalue directions, progressive slowing in directions with low eigenvalues and formation of information reservoirs in the critically slowed directions. Parameter-capacity uncertainty emerges naturally at the saddle point.

Entropy-Time¶

$\tau(t) := S_{X(t)}(t)$

Lemma 3: Monotonicity of Entropy-Time¶

$\tau(t_2) \geq \tau(t_1) \quad \text{for all } t_2 > t_1$

Corollary: Irreversibility¶

$\tau(t)$ increases monotonically, preventing time-reversal globally.

Conjecture: Frieden-Analogous Extremal Flow¶

At points where the latent-to-active flow functional is locally extremal (e.g., ), the system may exhibit critical slowing where information resevoir variables are slow relative to active variables. It may be possible to separate the system entropy into active variables and, $I = S[\rho_X]$ and “intrinsic information” $J= S[\rho_{X|M}]$ allowing us to create an information analogous to B. Roy Frieden’s extreme physical information (Frieden (1998)) which allows derivation of locally valid differential equations that depend on the information topography.

Thanks!¶

For more information on these subjects and more you might want to check the following resources.

Appendix¶

Variational Derivation of the Initial Curvature Structure¶

[edit]

We will determine constraints on the Fisher Information Matrix $G(\boldsymbol{\theta})$ that are consistent with the system’s unfolding rules and internal information geometry. We follow Jaynes (Jaynes, 1957) in solving a variational problem that captures the allowed structure of the system’s origin (minimal entropy) state.

Hirschman Jr (1957) established a connection between entropy and the Fourier transform, showing that the entropy of a function and its Fourier transform cannot both be arbitrarily small. This result, known as the Hirschman uncertainty principle, was later strengthened by Beckner (Beckner, 1975) who derived the optimal constant in the inequality. Białynicki-Birula and Mycielski (1975) extended these ideas to derive uncertainty relations for information entropy in wave mechanics.

From these results we know that there are fundamental limits to how we express the entropy of position and its conjugate space simultaneously. These limits inspire us to focus on the von Neumann entropy so that our system respects the Hirschman uncertainty principle.

A density matrix has the form

$\rho(\boldsymbol{\theta}) = \frac{1}{Z(\boldsymbol{\theta})} \exp\left( \sum_i \theta_i H_i \right)$ where

$Z(\boldsymbol{\theta}) = \mathrm{tr}\left[\exp\left( \sum_i \theta_i H_i \right)\right]$ and

$\boldsymbol{\theta} \in \mathbb{R}^d$ ,

$H_i$ are Hermitian observables.

The von Neumman entropy is given by

$S[\rho] = -\text{tr} (\rho \log \rho)$

We now derive the minimal entropy configuration inspired by Jaynes’s free-form variational approach. This enables us to derive the form of the density matrix directly from information-theoretic constraints (Jaynes, 1963).

Jaynesian Derivation of Minimal Entropy Configuration¶

[edit]

Jaynes suggested that statistical mechanics problems should be treated as problems of inference. Assign the probability distribution (or density matrix) that is maximally noncommittal with respect to missing information, subject to known constraints.

While Jaynes applied this idea to derive the maximum entropy configuration given constraints, here we adapt it to derive the minimum entropy configuration, under an assumption of zero initial entropy bounded by a maximum entropy of $N$ bits.

Let $\rho$ be a density matrix describing the state of a system. The von Neumann entropy is,

$S[\rho] = -\mathrm{tr}(\rho \log \rho),$ we wish to minimize

$S[\rho]$ , subject to constraints that encode the resolution bounds.

In the game we assume that the system begins in a state of minimal entropy, the state cannot be a delta function (no singularities, so it must obey a resolution constraint $\varepsilon$ ) and the entropy is bounded above by $N$ bits: $S[\rho] \leq N$ .

We apply a variational principle where we minimise

$S[\rho] = -\mathrm{tr}(\rho \log \rho)$ subject to constraints.

Constraints¶

The first constraint is normalization, $\mathrm{tr}(\rho) = 1$ .
The resolution constraint is motivated by the entropy being constrained to be,
$S[\rho] \leq N$ with the bound saturated only when the system is at maximum entropy. This implies that the system is finite in resolution. To reflect this we introduce a resolution constraint, $\mathrm{tr}(\rho \hat{Z}^2) \geq \epsilon^2$ . And/or $\mathrm{tr}(\rho \hat{P}^2) \geq \delta^2$ , and other optional moment or dual-space constraints.

We introduce Lagrange multipliers $\lambda_0$ , $\lambda_z$ , $\lambda_p$ for these constraints and define the Lagrangian $$ \mathcal{L}[\rho] = -\mathrm{tr}(\rho \log \rho)

\lambda_0 (\mathrm{tr}(\rho) - 1)

\lambda_x (\mathrm{tr}(\rho \hat{Z}^2) - \epsilon^2)
\lambda_p (\mathrm{tr}(\rho \hat{P}^2) - \delta^2).

$To find the extremum, we take the functional derivative and set it to zero,$

\frac{\delta \mathcal{L}}{\delta \rho} = -\log \rho - 1 - \lambda_x \hat{Z}^2 - \lambda_p \hat{P}^2 + \lambda_0 = 0

$and solving for $\rho$ gives$ \rho = \frac{1}{Z} \exp\left(-\lambda_z \hat{Z}^2 - \lambda_p \hat{P}^2\right)

$where the partition function (which ensures normalisation) is$ Z = \mathrm{tr}\left[\exp\left(-\lambda_z \hat{Z}^2 - \lambda_p \hat{P}^2\right)\right] $$ This is a Gaussian state for a density matrix, which is consistent with the minimum entropy distribution under uncertainty constraints.

The Lagrange multipliers $\lambda_z, \lambda_p$ enforce lower bounds on variance. These define the natural parameters as $\theta_z = -\lambda_z$ and $\theta_p = -\lambda_p$ in the exponential family form $\rho(\boldsymbol{\theta}) \propto \exp(\boldsymbol{\theta} \cdot \mathbf{H})$ . The form of $\rho$ is a density matrix. The curvature (second derivative) of $\log Z(\boldsymbol{\theta})$ gives the Fisher Information matrix $G(\boldsymbol{\theta})$ . Steepest ascent trajectories in $\boldsymbol{\theta}$ space will trace the system’s entropy dynamics.

Next we compute $G(\boldsymbol{\theta})$ from $\log Z(\boldsymbol{\theta})$ to explore the information geometry. From this we should verify that the following conditions hold,

$\left| \left[G(\boldsymbol{\theta}) \boldsymbol{\theta}\right]_i \right| < \varepsilon \quad \text{for all } i$ which implies that all variables remain latent at initialization.

The Hermitians have a non-commuting observable pair constraint,

$[H_i, H_j] \neq 0,$ which is equivalent to an uncertainty relation,

$\mathrm{Var}(H_i) \cdot \mathrm{Var}(H_j) \geq C > 0,$ and ensures that we have bounded curvature

$\mathrm{tr}(G(\boldsymbol{\theta})) \geq \gamma > 0.$

We can then use $\varepsilon$ and $N$ to define initial thresholds and maximum resolution and examine how variables decouple and how saddle-like regions emerge as the landscape unfolds through gradient ascent.

This constrained minimization problem yields the structure of the initial density matrix $\rho(\boldsymbol{\theta}_o)$ and the permissible curvature geometry $G(\boldsymbol{\theta}_o)$ and a constraint-consistent basis of observables $\{H_i\}$ that have a quadratic form. This ensures the system begins in a regular, latent, low-entropy state.

This is the configuration from which entropy ascent and symmetry-breaking transitions emerge.

References¶

Barato, A.C., Seifert, U., 2014. Stochastic thermodynamics with information reservoirs. Physical Review E 90, 042150. https://doi.org/10.1103/PhysRevE.90.042150

Beckner, W., 1975. Inequalities in Fourier analysis. Annals of Mathematics 159–182. https://doi.org/10.2307/1970980

Białynicki-Birula, I., Mycielski, J., 1975. Uncertainty relations for information entropy in wave mechanics. Communications in Mathematical Physics 44, 129–132. https://doi.org/10.1007/BF01608825

Frieden, B.R., 1998. Physics from fisher information: A unification. Cambridge University Press, Cambridge, UK. https://doi.org/10.1017/CBO9780511622670

Hirschman Jr, I.I., 1957. A note on entropy. American Journal of Mathematics 79, 152–156. https://doi.org/10.2307/2372390

Jaynes, E.T., 1963. Information theory and statistical mechanics, in: Ford, K.W. (Ed.), Brandeis University Summer Institute Lectures in Theoretical Physics, Vol. 3: Statistical Physics. W. A. Benjamin, Inc., New York, pp. 181–218.

Jaynes, E.T., 1957. Information theory and statistical mechanics. Physical Review 106, 620–630. https://doi.org/10.1103/PhysRev.106.620

Parrondo, J.M.R., Horowitz, J.M., Sagawa, T., 2015. Thermodynamics of information. Nature Physics 11, 131–139. https://doi.org/10.1038/nphys3230