Markov chains are one of the most useful classes of stochastic processes, being
You will find them in many of the workhorse models of economics and finance
In this lecture we review some of the theory of Markov chains
We will also introduce some of the high quality routines for working with Markov chains available in QuantEcon.jl
Prerequisite knowledge is basic probability and linear algebra
using InstantiateFromURL
activate_github("QuantEcon/QuantEconLecturePackages", tag = "v0.9.0") # activate the QuantEcon environment
using LinearAlgebra, Statistics, Compat # load common packages
using Distributions, Plots, Printf, QuantEcon, Random
gr(fmt = :png)
Plots.GRBackend()
A stochastic matrix (or Markov matrix) is an n×n square matrix P such that
Each row of P can be regarded as a probability mass function over n possible outcomes
It is too not difficult to check 1 that if P is a stochastic matrix, then so is the k-th power Pk for all k∈N
There is a close connection between stochastic matrices and Markov chains
To begin, let S be a finite set with n elements {x1,…,xn}
The set S is called the state space and x1,…,xn are the state values
A Markov chain {Xt} on S is a sequence of random variables on S that have the Markov property
This means that, for any date t and any state y∈S,
P{Xt+1=y|Xt}=P{Xt+1=y|Xt,Xt−1,…} | (1) |
In other words, knowing the current state is enough to know probabilities for future states
In particular, the dynamics of a Markov chain are fully determined by the set of values
P(x,y):=P{Xt+1=y|Xt=x}(x,y∈S) | (2) |
By construction,
We can view P as a stochastic matrix where
Pij=P(xi,xj)1≤i,j≤nGoing the other way, if we take a stochastic matrix P, we can generate a Markov chain {Xt} as follows:
By construction, the resulting process satisfies (2)
Consider a worker who, at any given time t, is either unemployed (state 1) or employed (state 2)
Suppose that, over a one month period,
In terms of a Markov model, we have
We can write out the transition probabilities in matrix form as
P=(1−ααβ1−β)Once we have the values α and β, we can address a range of questions, such as
We’ll cover such applications below
Using US unemployment data, Hamilton [Ham05] estimated the stochastic matrix
P=(0.9710.02900.1450.7780.07700.5080.492)where
For example, the matrix tells us that when the state is normal growth, the state will again be normal growth next month with probability 0.97
In general, large values on the main diagonal indicate persistence in the process {Xt}
This Markov process can also be represented as a directed graph, with edges labeled by transition probabilities
Here “ng” is normal growth, “mr” is mild recession, etc.
One natural way to answer questions about Markov chains is to simulate them
(To approximate the probability of event E, we can simulate many times and count the fraction of times that E occurs)
Nice functionality for simulating Markov chains exists in QuantEcon.jl
However, it’s also a good exercise to roll our own routines — let’s do that first and then come back to the methods in QuantEcon.jl
In these exercises we’ll take the state space to be S=1,…,n
To simulate a Markov chain, we need its stochastic matrix P and either an initial state or a probability distribution ψ for initial state to be drawn from
The Markov chain is then constructed as discussed above. To repeat:
In order to implement this simulation procedure, we need a method for generating draws from a discrete distributions
For this task we’ll use a Categorical random variable (i.e. a discrete random variable with assigned probabilities)
d = Categorical([0.5, 0.3, 0.2]) # 3 discrete states
@show rand(d, 5)
@show supertype(typeof(d))
@show pdf(d, 1) # the probability to be in state 1
@show support(d)
@show pdf.(d, support(d)); # broadcast the pdf over the whole support
rand(d, 5) = [3, 3, 2, 3, 1] supertype(typeof(d)) = Distribution{Univariate,Discrete} pdf(d, 1) = 0.5 support(d) = 1:3 pdf.(d, support(d)) = [0.5, 0.3, 0.2]
We’ll write our code as a function that takes the following three arguments
function mc_sample_path(P; init = 1, sample_size = 1000)
@assert size(P)[1] == size(P)[2] # square required
N = size(P)[1] # should be square
# create vector of discrete RVs for each row
dists = [Categorical(P[i, :]) for i in 1:N]
# setup the simulation
X = fill(0, sample_size) # allocate memory, or zeros(Int64, sample_size)
X[1] = init # set the initial state
for t in 2:sample_size
dist = dists[X[t-1]] # get discrete RV from previous state's transition distribution
X[t] = rand(dist) # draw new value
end
return X
end
mc_sample_path (generic function with 1 method)
P = [0.4 0.6; 0.2 0.8]
X = mc_sample_path(P, sample_size = 100_000); # note 100_000 = 100000
μ_1 = count(X .== 1)/length(X) # .== broadcasts test for equality. Could use mean(X .== 1)
0.25203
As discussed above, QuantEcon.jl has routines for handling Markov chains, including simulation
Here’s an illustration using the same P as the preceding example
P = [0.4 0.6; 0.2 0.8];
mc = MarkovChain(P)
X = simulate(mc, 100_000);
μ_2 = count(X .== 1)/length(X) # or mean(x -> x == 1, X)
0.25285
If we wish to, we can provide a specification of state values to MarkovChain
These state values can be integers, floats, or even strings
The following code illustrates
mc = MarkovChain(P, ["unemployed", "employed"])
simulate(mc, 4, init = 1) # start at state 1
4-element Array{String,1}: "unemployed" "employed" "employed" "employed"
simulate(mc, 4, init = 2) # start at state 2
4-element Array{String,1}: "employed" "employed" "unemployed" "employed"
simulate(mc, 4) # start with randomly chosen initial condition
4-element Array{String,1}: "employed" "employed" "employed" "employed"
simulate_indices(mc, 4)
4-element Array{Int64,1}: 2 2 2 1
What then is the distribution of Xt+1, or, more generally, of Xt+m?
Let ψt be the distribution of Xt for t=0,1,2,…
Our first aim is to find ψt+1 given ψt and P
To begin, pick any y∈S
Using the law of total probability, we can decompose the probability that Xt+1=y as follows:
P{Xt+1=y}=∑x∈SP{Xt+1=y|Xt=x}⋅P{Xt=x}In words, to get the probability of being at y tomorrow, we account for all ways this can happen and sum their probabilities
Rewriting this statement in terms of marginal and conditional probabilities gives
(None)There are n such equations, one for each y∈S
If we think of ψt+1 and ψt as row vectors (as is traditional in this literature), these n equations are summarized by the matrix expression
ψt+1=ψtP | (4) |
In other words, to move the distribution forward one unit of time, we postmultiply by P
By repeating this m times we move forward m steps into the future
Hence, iterating on (4), the expression ψt+m=ψtPm is also valid — here Pm is the m-th power of P
As a special case, we see that if ψ0 is the initial distribution from which X0 is drawn, then ψ0Pm is the distribution of Xm
This is very important, so let’s repeat it
X0∼ψ0⟹Xm∼ψ0Pm | (5) |
and, more generally,
Xt∼ψt⟹Xt+m∼ψtPm | (6) |
We know that the probability of transitioning from x to y in one step is P(x,y)
It turns out that the probability of transitioning from x to y in m steps is Pm(x,y), the (x,y)-th element of the m-th power of P
To see why, consider again (6), but now with ψt putting all probability on state x
Inserting this into (6), we see that, conditional on Xt=x, the distribution of Xt+m is the x-th row of Pm
In particular
P{Xt+m=y}=Pm(x,y)=(x,y)-th element of PmRecall the stochastic matrix P for recession and growth considered above
Suppose that the current state is unknown — perhaps statistics are available only at the end of the current month
We estimate the probability that the economy is in state x to be ψ(x)
The probability of being in recession (either mild or severe) in 6 months time is given by the inner product
ψP6⋅(011)The marginal distributions we have been studying can be viewed either as probabilities or as cross-sectional frequencies in large samples
To illustrate, recall our model of employment / unemployment dynamics for a given worker discussed above
Consider a large (i.e., tending to infinite) population of workers, each of whose lifetime experiences are described by the specified dynamics, independently of one another
Let ψ be the current cross-sectional distribution over {1,2}
The cross-sectional distribution records the fractions of workers employed and unemployed at a given moment
The same distribution also describes the fractions of a particular worker’s career spent being employed and unemployed, respectively
Irreducibility and aperiodicity are central concepts of modern Markov chain theory
Let’s see what they’re about
Let P be a fixed stochastic matrix
Two states x and y are said to communicate with each other if there exist positive integers j and k such that
Pj(x,y)>0andPk(y,x)>0In view of our discussion above, this means precisely that
The stochastic matrix P is called irreducible if all states communicate; that is, if x and y communicate for all (x,y) in S×S
For example, consider the following transition probabilities for wealth of a fictitious set of households
We can translate this into a stochastic matrix, putting zeros where there’s no edge between nodes
P:=(0.90.100.40.40.20.10.10.8)It’s clear from the graph that this stochastic matrix is irreducible: we can reach any state from any other state eventually
We can also test this using QuantEcon.jl’s MarkovChain class
P = [0.9 0.1 0.0; 0.4 0.4 0.2; 0.1 0.1 0.8];
mc = MarkovChain(P)
is_irreducible(mc)
true
Here’s a more pessimistic scenario, where the poor are poor forever
This stochastic matrix is not irreducible, since, for example, rich is not accessible from poor
Let’s confirm this
P = [1.0 0.0 0.0; 0.1 0.8 0.1; 0.0 0.2 0.8];
mc = MarkovChain(P);
is_irreducible(mc)
false
We can also determine the “communication classes,” or the sets of communicating states (where communication refers to a nonzero probability of moving in each direction).
communication_classes(mc)
2-element Array{Array{Int64,1},1}: [1] [2, 3]
It might be clear to you already that irreducibility is going to be important in terms of long run outcomes
For example, poverty is a life sentence in the second graph but not the first
We’ll come back to this a bit later
Loosely speaking, a Markov chain is called periodic if it cycles in a predictible way, and aperiodic otherwise
Here’s a trivial example with three states
The chain cycles with period 3:
P = [0 1 0; 0 0 1; 1 0 0];
mc = MarkovChain(P);
period(mc)
3
More formally, the period of a state x is the greatest common divisor of the set of integers
D(x):={j≥1:Pj(x,x)>0}In the last example, D(x)={3,6,9,…} for every state x, so the period is 3
A stochastic matrix is called aperiodic if the period of every state is 1, and periodic otherwise
For example, the stochastic matrix associated with the transition probabilities below is periodic because, for example, state a has period 2
We can confirm that the stochastic matrix is periodic as follows
P = zeros(4, 4);
P[1, 2] = 1;
P[2, 1] = P[2, 3] = 0.5;
P[3, 2] = P[3, 4] = 0.5;
P[4, 3] = 1;
mc = MarkovChain(P);
period(mc)
2
is_aperiodic(mc)
false
P = [.4 .6; .2 .8];
ψ = [0.25, 0.75];
ψ' * P
1×2 Adjoint{Float64,Array{Float64,1}}: 0.25 0.75
Such distributions are called stationary, or invariant
Formally, a distribution ψ∗ on S is called stationary for P if ψ∗=ψ∗P
From this equality we immediately get ψ∗=ψ∗Pt for all t
This tells us an important fact: If the distribution of X0 is a stationary distribution, then Xt will have this same distribution for all t
Hence stationary distributions have a natural interpretation as stochastic steady states — we’ll discuss this more in just a moment
Mathematically, a stationary distribution is a fixed point of P when P is thought of as the map ψ↦ψP from (row) vectors to (row) vectors
Theorem. Every stochastic matrix P has at least one stationary distribution
(We are assuming here that the state space S is finite; if not more assumptions are required)
For a proof of this result you can apply Brouwer’s fixed point theorem, or see EDTC, theorem 4.3.5
There may in fact be many stationary distributions corresponding to a given stochastic matrix P
Since stationary distributions are long run equilibria, to get uniqueness we require that initial conditions are not infinitely persistent
Infinite persistence of initial conditions occurs if certain regions of the state space cannot be accessed from other regions, which is the opposite of irreducibility
This gives some intuition for the following fundamental theorem
Theorem. If P is both aperiodic and irreducible, then
For a proof, see, for example, theorem 5.2 of [Haggstrom02]
(Note that part 1 of the theorem requires only irreducibility, whereas part 2 requires both irreducibility and aperiodicity)
A stochastic matrix satisfying the conditions of the theorem is sometimes called uniformly ergodic
One easy sufficient condition for aperiodicity and irreducibility is that every element of P is strictly positive
Recall our model of employment / unemployment dynamics for a given worker discussed above
Assuming α∈(0,1) and β∈(0,1), the uniform ergodicity condition is satisfied
Let ψ∗=(p,1−p) be the stationary distribution, so that p corresponds to unemployment (state 1)
Using ψ∗=ψ∗P and a bit of algebra yields
p=βα+βThis is, in some sense, a steady state probability of unemployment — more on interpretation below
Not surprisingly it tends to zero as β→0, and to one as α→0
As discussed above, a given Markov matrix P can have many stationary distributions
That is, there can be many row vectors ψ such that ψ=ψP
In fact if P has two distinct stationary distributions ψ1,ψ2 then it has infinitely many, since in this case, as you can verify,
ψ3:=λψ1+(1−λ)ψ2is a stationary distribution for P for any λ∈[0,1]
If we restrict attention to the case where only one stationary distribution exists, one option for finding it is to try to solve the linear system ψ(In−P)=0 for ψ, where In is the n×n identity
But the zero vector solves this equation
Hence we need to impose the restriction that the solution must be a probability distribution
A suitable algorithm is implemented in QuantEcon.jl — the next code block illustrates
P = [.4 .6; .2 .8];
mc = MarkovChain(P);
stationary_distributions(mc)
1-element Array{Array{Float64,1},1}: [0.25, 0.75]
The stationary distribution is unique
Part 2 of the Markov chain convergence theorem stated above tells us that the distribution of Xt converges to the stationary distribution regardless of where we start off
This adds considerable weight to our interpretation of ψ∗ as a stochastic steady state
The convergence in the theorem is illustrated in the next figure
P = [0.971 0.029 0.000
0.145 0.778 0.077
0.000 0.508 0.492] # stochastic matrix
ψ = [0.0 0.2 0.8] # initial distribution
t = 20 # path length
x_vals = zeros(t)
y_vals = similar(x_vals)
z_vals = similar(x_vals)
colors = [repeat([:red], 20); :black] # for plotting
for i in 1:t
x_vals[i] = ψ[1]
y_vals[i] = ψ[2]
z_vals[i] = ψ[3]
ψ = ψ * P # update distribution
end
mc = MarkovChain(P)
ψ_star = stationary_distributions(mc)[1]
x_star, y_star, z_star = ψ_star # unpack the stationary dist
plt = scatter([x_vals; x_star], [y_vals; y_star], [z_vals; z_star], color = colors, gridalpha = 0.5, legend = :none)
plot!(plt, camera = (45,45))
Here
The code for the figure can be found here — you might like to try experimenting with different initial conditions
Under irreducibility, yet another important result obtains: For all x∈S,
1nm∑t=11{Xt=x}→ψ∗(x)as m→∞ | (7) |
Here
The result tells us that the fraction of time the chain spends at state x converges to ψ∗(x) as time goes to infinity
This gives us another way to interpret the stationary distribution — provided that the convergence result in (7) is valid
The convergence in (7) is a special case of a law of large numbers result for Markov chains — see EDTC, section 4.3.4 for some additional information
Recall our cross-sectional interpretation of the employment / unemployment model discussed above
Assume that α∈(0,1) and β∈(0,1), so that irreducibility and aperiodicity both hold
We saw that the stationary distribution is (p,1−p), where
p=βα+βIn the cross-sectional interpretation, this is the fraction of people unemployed
In view of our latest (ergodicity) result, it is also the fraction of time that a worker can expect to spend unemployed
Thus, in the long-run, cross-sectional averages for a population and time-series averages for a given person coincide
This is one interpretation of the notion of ergodicity
We are interested in computing expectations of the form
E[h(Xt)] | (8) |
and conditional expectations such as
E[h(Xt+k)∣Xt=x] | (9) |
where
The unconditional expectation (8) is easy: We just sum over the distribution of Xt to get
E[h(Xt)]=∑x∈S(ψPt)(x)h(x)Here ψ is the distribution of X0
Since ψ and hence ψPt are row vectors, we can also write this as
E[h(Xt)]=ψPthFor the conditional expectation (9), we need to sum over the conditional distribution of Xt+k given Xt=x
We already know that this is Pk(x,⋅), so
E[h(Xt+k)∣Xt=x]=(Pkh)(x) | (10) |
The vector Pkh stores the conditional expectation E[h(Xt+k)∣Xt=x] over all x
Sometimes we also want to compute expectations of a geometric sum, such as ∑tβth(Xt)
In view of the preceding discussion, this is
E[∞∑j=0βjh(Xt+j)∣Xt=x]=[(I−βP)−1h](x)where
(I−βP)−1=I+βP+β2P2+⋯Premultiplication by (I−βP)−1 amounts to “applying the resolvent operator”
According to the discussion above, if a worker’s employment dynamics obey the stochastic matrix
P=(1−ααβ1−β)with α∈(0,1) and β∈(0,1), then, in the long-run, the fraction of time spent unemployed will be
p:=βα+βIn other words, if {Xt} represents the Markov chain for employment, then ˉXm→p as m→∞, where
ˉXm:=1mm∑t=11{Xt=1}Your exercise is to illustrate this convergence
First,
Second, repeat the first step, but this time taking X0=2
In both cases, set α=β=0.1
The result should look something like the following — modulo randomness, of course
(You don’t need to add the fancy touches to the graph—see the solution if you’re interested)
A topic of interest for economics and many other disciplines is ranking
Let’s now consider one of the most practical and important ranking problems — the rank assigned to web pages by search engines
(Although the problem is motivated from outside of economics, there is in fact a deep connection between search ranking systems and prices in certain competitive equilibria — see [DLP13])
To understand the issue, consider the set of results returned by a query to a web search engine
For the user, it is desirable to
Ranking according to a measure of importance is the problem we now consider
The methodology developed to solve this problem by Google founders Larry Page and Sergey Brin is known as PageRank
To illustrate the idea, consider the following diagram
Imagine that this is a miniature version of the WWW, with
Now let’s think about which pages are likely to be important, in the sense of being valuable to a search engine user
One possible criterion for importance of a page is the number of inbound links — an indication of popularity
By this measure, m and j are the most important pages, with 5 inbound links each
However, what if the pages linking to m, say, are not themselves important?
Thinking this way, it seems appropriate to weight the inbound nodes by relative importance
The PageRank algorithm does precisely this
A slightly simplified presentation that captures the basic idea is as follows
Letting j be (the integer index of) a typical page and rj be its ranking, we set
rj=∑i∈Ljriℓiwhere
This is a measure of the number of inbound links, weighted by their own ranking (and normalized by 1/ℓi)
There is, however, another interpretation, and it brings us back to Markov chains
Let P be the matrix given by P(i,j)=1{i→j}/ℓi where 1{i→j}=1 if i has a link to j and zero otherwise
The matrix P is a stochastic matrix provided that each page has at least one link
With this definition of P we have
rj=∑i∈Ljriℓi=∑all i1{i→j}riℓi=∑all iP(i,j)riWriting r for the row vector of rankings, this becomes r=rP
Hence r is the stationary distribution of the stochastic matrix P
Let’s think of P(i,j) as the probability of “moving” from page i to page j
The value P(i,j) has the interpretation
Thus, motion from page to page is that of a web surfer who moves from one page to another by randomly clicking on one of the links on that page
Here “random” means that each link is selected with equal probability
Since r is the stationary distribution of P, assuming that the uniform ergodicity condition is valid, we can interpret rj as the fraction of time that a (very persistent) random surfer spends at page j
Your exercise is to apply this ranking algorithm to the graph pictured above, and return the list of pages ordered by rank
When you solve for the ranking, you will find that the highest ranked node is in fact g, while the lowest is a
In numerical work it is sometimes convenient to replace a continuous model with a discrete one
In particular, Markov chains are routinely generated as discrete approximations to AR(1) processes of the form
yt+1=ρyt+ut+1Here ut is assumed to be iid and N(0,σ2u)
The variance of the stationary probability distribution of {yt} is
σ2y:=σ2u1−ρ2Tauchen’s method [Tau86] is the most common method for approximating this continuous state process with a finite state Markov chain
A routine for this already exists in QuantEcon.jl but let’s write our own version as an exercise
As a first step we choose
Next we create a state space {x0,…,xn−1}⊂R and a stochastic n×n matrix P such that
Let F be the cumulative distribution function of the normal distribution N(0,σ2u)
The values P(xi,xj) are computed to approximate the AR(1) process — omitting the derivation, the rules are as follows:
The exercise is to write a function approx_markov(rho, sigma_u, m = 3, n = 7) that returns {x0,…,xn−1}⊂R and n×n matrix P as described above
Compute the fraction of time that the worker spends unemployed, and compare it to the stationary probability.
α = 0.1 # probability of getting hired
β = 0.1 # probability of getting fired
N = 10_000
p_bar = β / (α + β) # steady-state probabilities
P = [1 - α α
β 1 - β] # stochastic matrix
mc = MarkovChain(P)
labels = ["start unemployed", "start employed"]
y_vals = Array{Vector}(undef, 2) # sample paths holder
for x0 in 1:2
X = simulate_indices(mc, N; init = x0) # generate the sample path
X_bar = cumsum(X .== 1) ./ (1:N) # compute state fraction. ./ required for precedence
y_vals[x0] = X_bar .- p_bar # plot divergence from steady state
end
plot(y_vals, color = [:blue :green], fillrange = 0, fillalpha = 0.1,
ylims = (-0.25, 0.25), label = reshape(labels, 1, length(labels)))
web_graph_data = sort(Dict('a' => ['d', 'f'],
'b' => ['j', 'k', 'm'],
'c' => ['c', 'g', 'j', 'm'],
'd' => ['f', 'h', 'k'],
'e' => ['d', 'h', 'l'],
'f' => ['a', 'b', 'j', 'l'],
'g' => ['b', 'j'],
'h' => ['d', 'g', 'l', 'm'],
'i' => ['g', 'h', 'n'],
'j' => ['e', 'i', 'k'],
'k' => ['n'],
'l' => ['m'],
'm' => ['g'],
'n' => ['c', 'j', 'm']))
OrderedCollections.OrderedDict{Char,Array{Char,1}} with 14 entries: 'a' => ['d', 'f'] 'b' => ['j', 'k', 'm'] 'c' => ['c', 'g', 'j', 'm'] 'd' => ['f', 'h', 'k'] 'e' => ['d', 'h', 'l'] 'f' => ['a', 'b', 'j', 'l'] 'g' => ['b', 'j'] 'h' => ['d', 'g', 'l', 'm'] 'i' => ['g', 'h', 'n'] 'j' => ['e', 'i', 'k'] 'k' => ['n'] 'l' => ['m'] 'm' => ['g'] 'n' => ['c', 'j', 'm']
nodes = keys(web_graph_data)
n = length(nodes)
# create adjacency matrix of links (Q[i, j] = true for link, false otherwise)
Q = fill(false, n, n)
for (node, edges) in enumerate(values(web_graph_data))
Q[node, nodes .∈ Ref(edges)] .= true
end
# create the corresponding stochastic matrix
P = Q ./ sum(Q, dims = 2)
mc = MarkovChain(P)
r = stationary_distributions(mc)[1] # stationary distribution
ranked_pages = Dict(zip(keys(web_graph_data), r)) # results holder
# print solution
println("Rankings\n ***")
sort(collect(ranked_pages), by = x -> x[2], rev = true) # print sorted
Rankings ***
14-element Array{Pair{Char,Float64},1}: 'g' => 0.160708 'j' => 0.159362 'm' => 0.119515 'n' => 0.10877 'k' => 0.0910629 'b' => 0.0832646 'e' => 0.0531205 'i' => 0.0531205 'c' => 0.0483421 'h' => 0.0456012 'l' => 0.0320179 'd' => 0.0305625 'f' => 0.0116429 'a' => 0.00291071
A solution from QuantEcon.jl can be found here
Footnotes1Hint: First show that if P and Q are stochastic matrices then so is their product — to check the row sums, try postmultiplying by a column vector of ones. Finally, argue that Pn is a stochastic matrix using induction.