Memory: Ergodic Theory and Agent Based Modeling

GitHub

Memory is simultaneously one of the most empowering and crippling qualities that humanity possesses. Our memories enable us to learn, adapt, and celebrate old traditions. Without a strong memory, we would quickly forget the lessons we have learned and never be able to advance. However, too strong a memory can act as an inertial force, holding us back from change. We get stuck in bad habits and fall victim to past trauma. How many world religions preach forgiveness in some way or another? How many times have we heard that the secret to success and happiness is staying present? There is a fine line between a healthy respect for the past and becoming overly traditional.

I aim to answer the questions: how much memory is optimal for growth? How far back into a system's history must we go until the past no longer has significant influence on the present? In neuroscience and machine learning, these questions manifest in the stability plasticity dilemma. I approach this question with the machinery of chaotic dynamical systems, specifically, ergodic theory. We must first be able to quantify a system's memory.

Ergodic Theory

Ergodic theory provides a way to describe the average behavior of a dynamical system over time, without worrying about specific details of its evolution. Before getting into the math, we should consider conceptually what "memory" even means for a system. If a person (or an entire population) remembers something from their past, then that thing has the ability to influence their present; it still matters to some degree. On the other hand, if something has been completely forgotten; that is, if there are no remnants of it––no surviving trace, then it no longer has any influence on the present state of reality.

Appendix: Ergodic Theory

Mixing

Ergodic theory formalizes this idea. A measure-preserving dynamical system $(X, B, μ, T)$ is said to be strongly mixing if for all $A, B \in B$ , we have

n \to \infty lim μ (T^{- n} A \cap B) = μ (A) μ (B) .

The set $T^{- n} A \cap B$ is the set of points currently in $B$ which were in the set $A$ $n$ iterations of $T$ earlier. Therefore, this equality simply states that as time progresses, any statistical dependence between $A$ and $B$ ––which may have existed in the system's history––is lost. Thus, information about the initial conditions of the system is forgotten. Since this holds for all sets $A$ and $B$ , we can think of $T$ as "mixing" the sets around the phase space $X$ , resulting in a loss of the initial structure.

Ergodicity and the Birkhoff Ergodic Theorem

Ergodic theory relies on a more general definition than mixing: ergodicity. A measure-preserving dynamical system $(X, B, μ, T)$ is ergodic if:

If A \in B such that T^{- 1} A = A, then μ (A) = 0 or μ (A) = 1.

In other words, the phase space of the system cannot be decomposed into smaller subsets which are invariant under the system; that is, there is no region of the phase space which is only ever mapped into itself. This property is similar, but weaker, than that of mixing.

The Birkhoff Ergodic Theorem is an important result in ergodic theory. It also justifies a key assumption which statistical mechanics relies on. We consider some observable $f$ of the system which is a $μ$ -integrable function. Firstly, the Birkhoff Ergodic Theorem guarantees that the time average of $f$ over a measure-preserving dynamical system exists. That is,

n \to \infty lim \frac{1}{n} k = 0 \sum n - 1 f (T^{k} (x))

exists for $μ$ -almost every $x \in X$ .

Secondly, if the system is ergodic, then the ensemble average of $f$ over the system is equal to its time average over the system. That is,

n \to \infty lim \frac{1}{n} k = 0 \sum n - 1 f (T^{k} (x)) = \int_{X} f d μ,

for $μ$ -almost every $x \in X$ .

Kolmogorov-Sinai Entropy

We can now begin to quantify a dynamical system's memory. The KS (Kolmogorov-Sinai) Entropy of a measure-preserving dynamical system $(X, B, μ, T)$ is defined as follows:

For a finite measurable partition $P = {P_{1}, P_{2}, \dots, P_{k}} \subset B$ , define the entropy of $P$ with respect to $μ$ by

H_{μ} (P) = - i = 1 \sum k μ (P_{i}) lo g μ (P_{i}) .

This is the Shannon-information entropy of the partition $P$ with respect to the measure $μ$ .

The entropy of the partition under $T$ is defined as:

h_{μ} (T, P) = n \to \infty lim \frac{1}{n} H_{μ} (i = 0 ⋁ n - 1 T^{- i} P),

where $⋁_{i = 0}^{n - 1} T^{- i} P$ is the refinement of the partition $P$ under the preimages of $T$ over $n$ steps. Conceptually, the refinement of the partition captures a similar idea to the concept of microstates in the thermodynamic definition of entropy. Following this rough analogy, the partition splits the phase space into equivalence classes, which are analogous to macrostates.

The Kolmogorov-Sinai entropy of the system is defined

h_{μ} (T) = P sup h_{μ} (T, P),

where the supremum is taken over all finite measurable partitions $P$ of $X$ .

The KS entropy of a dynamical system describes the rate at which the system generates Shannon information over time. If a dynamical system has positive KS entropy, it is chaotic and thus has a short memory. One way to understand this is that the present state of the system depends on newly generated information––information which was not contained in the system's past states. Thus the present depends more weakly on the past than it would in an invertible system, which would not generate information over time.

Lyapunov Exponents and Pesin's Entropy Formula

While KS entropy contains information about the global complexity of a dynamical system, we can study the system locally with Lyapunov Exponents. These describe how quickly nearby trajectories of the dynamical system diverge or converge.

Here, we assume $(X, B, μ, T)$ to be a smooth, measure-preserving dynamical system. Starting with some $x \in X$ as an initial condition, we consider a small perturbation, and let $δ_{0}$ be the vector of initial separation between two trajectories. The separation between these trajectories often grows like

∣∣ δ (t) ∣∣ \approx e^{λ t} ∣∣ δ_{0} ∣∣.

We call $λ$ a Lyapunov Exponent.

Lyapunov Exponents measure the average exponential rate of divergence or convergence of nearby trajectories in a dynamical system. These are defined as

λ (x, v) = n \to \infty lim \frac{1}{n} lo g ∥ D_{x} T^{n} (v) ∥,

where $D_{x} T^{n}$ is the Jacobian (derivative) of the dynamical system after $n$ iterations, and $v$ is a vector in the tangent space at $x \in X$ . Notice that if $v$ is an eigenvector of $D_{x} T^{n}$ , then the norm $∣∣ D_{x} T^{n} (v) ∣∣$ will simply scale $v$ by the corresponding eigenvalue. Thus, the eigenvalues of the Jacobian can give us insight into the Lyapunov Exponents.

Pesin's Entropy Formula bridges our study of the systems's local chaotic behavior with its global complexity. This formula expresses KS entropy in terms of the spectrum of positive Lyapunov Exponents, stating that

h_{μ} (T) = \int_{X} λ_{i} (x) > 0 \sum λ_{i} (x) d μ (x) .

Here, the directional dependence of Lyapunov exponents is implicitly assumed to be given by the eigendirections of the Jacobian matrix $D_{x} T^{n}$ .

Taking a step back, Lyapunov Exponents can help us answer one of our previous questions: How far back into a system's history must we go until the past no longer has significant influence on the present?

Measure-Theoretic Conjugacy of Dynamical Systems

It is possible that apparently different systems follow the same underlying dynamics. If this is the case, we say that these systems are conjugate. Formally, two measure-preserving dynamical systems $(X, B_{X}, μ, T)$ and $(Y, B_{Y}, ν, S)$ are conjugate if there exists a measurable bijection $ϕ : X \to Y$ such that

$ϕ$ preserves the measure: $ν = ϕ_{*} μ$ , where $ϕ_{*} μ (A) := μ (ϕ^{- 1} (A))$ for all $A \in B_{Y}$ , and
$ϕ$ preserves the dynamics: $ϕ \circ T = S \circ ϕ$ , for $μ$ -almost every $x \in X$ .

Measure-theoretic conjugate systems are equivalent to each other from a measure-theoretic perspective, meaning that their long-term statistical behavior is identical. This notion plays a similar role to conjugacy and, more generally, the concept of an isomorphism, in algebra; we are trying to ignore the way that we "label" the system and only focus on its fundamental structure. Thus, it is a natural result that a system's KS entropy is invariant under conjugacy.

Bernoulli Shifts

A Bernoulli shift is a stochastic process that can be used to model the chaotic part of a dynamical system. This is used in ergodic theory as well as in a related field called symbolic dynamics, which models dynamical systems as symbolic sequences.

Given an alphabet of finite size $A = {1, \dots, n}$ , we call bi-infinite sequences $x : Z \to A$ , denoted $(x_{i})_{i \in Z}$ , words. We let $Ω_{n} := A^{Z}$ be the set of all words. The Bernoulli product measure, $μ^{Z}$ , is defined

μ^{Z} (x_{i_{1}} = a_{i_{1}}, \dots, x_{i_{k}} = a_{i_{k}}) = j = 1 \prod k μ (a_{i_{j}}),

with $μ (a)$ being the probability assigned to symbol $a \in A$ . The random variables ${π_{i}}_{i \in Z}$ , where $π_{i} (x) = x_{i}$ , are independent and identically distributed (i.i.d.) with distribution $μ$ .

We then define the shift map $σ : (x)_{i} = x_{i + 1}$ . This simply shifts every symbol in the sequence one index to the left.

The Bernoulli shift $(Ω_{n}, μ^{Z}, σ)$ is a measure-preserving dynamical system.

We can consider the KS entropy of a Bernoulli shift. Since the KS entropy of a system describes how much Shannon information is generated by the system over time, and because each position of the sequence's value is independent of all others, the KS entropy of the shift is simply the Shannon entropy of the distribution $μ$ . That is,

h_{μ} (σ) = - i = 1 \sum n p_{i} lo g p_{i} .

Bernoulli shifts satisfy the Markov property which says that the future state only directly depends on the present state, not on the path which led to that current state. This is why Markov processes are called memoryless. However, this does not necessarily mean that these processes completely forget the initial conditions, because the probability of reaching that present state in the first place was determined by that past. However, the next step is no longer directly dependent on that past. Therefore, these processes tend to forget their history faster.

The Sinai Factor Theorem and the Ornstein Isomorphism Theorem

These next two theorems tell us that Bernoulli shifts play a role in describing any dynamical system with positive KS entropy, and allow us to characterize these systems.

The Sinai Factor Theorem states that any measure-preserving ergodic dynamical system can be decomposed into a Bernoulli shift of equal entropy to that of the system, and a trivial (invertible) component. An equivalent statement is that if the system has KS entropy $h$ , then a Bernoulli shift with KS entropy less than or equal to $h$ is a factor of the system. Formally, if $(X, B, μ, T)$ is a measure-preserving ergodic dynamical system, then there exists a measurable, measure-preserving factor map

π : X \to Ω_{n},

such that $π \circ T = σ \circ π$ , which projects the chaotic part of the system onto a Bernoulli shift. The trivial part of the system is $ker π$ .

In the context of a physical system, any symmetries of the system are contained in the trivial part. For example, we know that time-translational symmetry leads to conservation of energy. Since energy is constant, no new information is generated here as the system evolves over time. Thus, this part of this system would not be represented in the Bernoulli shift factor.

This means that in order to study the memory of some dynamical system, we can simply study the memory of a Bernoulli shift of equal KS entropy. This may allow us to employ tools of symbolic dynamics.

Lastly, the Ornstein Isomorphism Theorem states that any two Bernoulli shifts with the same KS entropy are isomorphic.

Agent Based Modeling

We define agent based models as dynamical systems and compute the KS entropy of these simulated systems in order to draw conclusions about the effect of strong and weak memory on system behavior.