February 2017
In this notebook we discuss some key concepts of statistical decision theory in order to provide a general framework for the comparison of alternative estimators based on their finite sample performance.
The primitive object is a statistical decision problem containing a loss function, an action space, and a set of assumed statistical models. We present estimation problems familiar from econometrics as special cases of statistical decision problems. The common framework helps highlighting similarities and differences.
We compare estimators based on their (finite sample) risk, where risk is derived from an unknown true data generating mechanism.
We present some straightforward examples to illustrate the main ideas.
Let $\mathbb{R}$ and $\mathbb{Z}$ denote the space of reals and integers, respectively. $\mathbb{R}_+$ is the space of nonnegative real numbers. We use the notation $X^Y$ for a function space that consists of functions mapping from the space $Y$ into the space $X$. As a result, $\mathbb{R}^{d\mathbb{Z}}$ denotes the space of sequences made up of $d$-dimensional real numbers, while $\mathbb R^{Z}_+$ denotes the set of functions mapping from the range of (random) variable $Z$ to the space of nonnegative reals.
Let $Z_t$ be a $d$-dimensional random vector representing the value that the $d$ observables take at period $t$. The stochastic process $\{Z_t\}_{t\in\mathbb Z}$ is denoted by $Z^{\infty}$, the partial history including $n$ consecutive elements of $Z^{\infty}$ is $Z^{n}:=\{Z_1, Z_2, \dots, Z_n\}$. Small letters stand for realizations of random variables, hence $z^{\infty}$, $z^n$ and $z$ represent the realization of the stochastic process, the sample and a single observation, respectively.
We use capital letters for distributions, small letter counterparts denote the associated densities. For example, we use the generic $Q$ notation for ergodic distributions, $q$ for the corresponding density and $q(\cdot|\cdot)$ for the conditional density. Caligraphic letters are used to denote sets:
We model data as a partial realization of a stochastic process $Z^{\infty}$ taking values in $\mathbb{R}^{d}$. Denote a particular realization as $z^{\infty} \in \mathbb{R}^{k\mathbb{Z}}$ and let the partial history $z^{n}$ containing $n$ consecutive elements of the realization be the sample of size $n$. We assume that there exists a core mechanism undelying this process that describes the relationship among the elements of the vector $Z$. Our aim is to draw inference about this mechanism after observing a single partial realization $z^{n}$.
How is this possible without being able to draw different samples under the exact same conditions? Following the exposition of Breiman (1969) a fruitful approach is to assume that the underlying mechanism is time invariant with the stochastic process being strictly stationary and study its statistical properties by taking long-run time averages of the realization $z^{\infty}$ (or functions thereof), e.g.
$$\lim_{n\to \infty}\frac{1}{n}\sum_{t = 1}^{n} z_t\quad\quad \lim_{n\to \infty}\frac{1}{n} \sum_{t = 1}^{n} z^2_t\quad\quad \lim_{n\to \infty}\frac{1}{n}\sum_{t = k}^{n+k} z_{t}z_{t-k}$$Since the mechanism is assumed to be stable over time, it does not matter when we start observing the process.
Notice, however, that strictly speaking these time averages are properties of the particular realization, the extent to which they can be generalized to the mechanism itself is not obvious. To address this question, it is illuminating to bundle realizations that share certain statistical properties together in order to construct a universe of (counterfactual) alternative $z^{\infty}$-s, the so called ensemble. Statistical properties of the data generating mechanism can be summarized by assigning probabilities to (sets of) these $z^{\infty}$-s in an internally consistent manner. These considerations lead to the idea of statistical models.
Statistical models are probability distributions over sequences $z^{\infty}$ that assign probabilities so that the unconditional moments are consistent with the associated long-run time averages. In other words, with statistical models the time series and ensemble averages coincide, which is the property known as ergodicity. Roughly speaking, ergodicity allows us to learn about the ensemble dimension by using a single realization $z^{\infty}$.
In reality, being endowed only with a partial history of $z^{\infty}$, we cannot calculate the exact log-run time averages. By imposing more structure on the problem and having a sufficiently large sample, however, we can obtain reasonable approximations. To this end, we need to assume some form of weak independence ("mixing"), or more precisely, the property that on average, the dependence between the elements of $\{Z_t\}_{t\in\mathbb{Z}}$ dies out as we increase the gap between them.
Consequently, if we observe a long segment of $z^{\infty}$ and cut it up into shorter consecutive pieces, say of length $l$, then, we might consider these pieces (provided that $l$ is "large enough") as nearly independent records from the distribution of the $l$-block, $Z^l$. To clarify this point, consider a statistical model $Q_{Z^{\infty}}$ (joint distribution over sequences $z^{\infty}$) with density function $q_{z^{\infty}}$ and denote the implied density of the sample as $q_{n}$. Note that because of strict stationarity, it is enough to use the number of consecutive elements as indices. Under general regularity conditions we can decompose this density as
$$q_{n}\left(z^n\right) = q_{n-1}\left(z_n | z^{n-1}\right)q_{n-1}\left(z^{n-1}\right) = q_{n-1}\left(z_n | z^{n-1}\right)q_{n-2}\left(z_{n-1}|z^{n-2}\right)\dots q_{1}\left(z_{2}|z_1\right)q_{1}\left(z_1\right)$$For simplicity, we assume that the stochastic process is Markov so that the partial histories $z^{i}$ for $i=1,\dots, n-1$ in the conditioning sets can be replaced by the "right" number of lags $z^{n-1}_{n-l}$ and we can drop the subindex from the conditional densties
$$q_{n}(z^n) = q(z_n | z^{n-1}_{n-1-l})q(z_{n-1}|z^{n-2}_{n-2-l})\dots q(z_{l+1}|z_{1}^{l})q_{l}(z^l) \quad\quad\quad (1)$$This assumption is much stronger than what we really need. First, it suffices to require the existence of a history-dependent latent state variable similar to the Kalman filter. Moreover, we could also relax the Markov assumption and allow for dependence that dies out only asymptotically. In practice, however, we often have a stong view about the dependency structure, or at least we are willing to use economic theory to guide our choice of $l$. In these cases we almost always assume a Markovian structure. For simplicity, in these lectures, unless otherwise stated, we will restrict ourselves to the family of Markov processes.
This assumption allows us to learn about the underlying mechanism $Q_{Z^{\infty}}$ via its $l+1$-period building blocks. Once we determine the (ensemble) distribution of the block, $Q_{Z^{[l+1]}}$, we can "build up" $Q_{Z^{\infty}}$ from these blocks by using a formula similar to (1). Having said that the block distribution $Q_{Z^{[l+1]}}$ carries the same information as $Q_{Z^{\infty}}$. Therefore, from now on, we define $Z$ as the minimal block we need to know and treat it as an observation. Statistical models can be represented by their predictions about the ensemble distribution $P$ of this observable.
We assume that the mechanism underlying $Z^{\infty}$ can be represented with a statistical model $P$ and it is called true data generating process (DGP). We seek to learn about the features of this model from the observed data.
Following Wald (1950) every statistical decision problem that we will consider can be represented with a triple $(\mathcal{H}, \mathcal{A}, L)$, where
Assumed statistical models, $\mathcal{H}\subseteq \mathcal{Q} \subset \mathcal{P}$
$\mathcal{H}$ is a collection of ergodic probability measures over the observed data, which captures our maintained assumptions about the mechanism underlying $Z^{\infty}$. The set of all ergodic distributions $\mathcal{Q}$ is a strict subset of $\mathcal{P}$--the space of strictly stationary probability distributions over the observed data. In fact, the set of ergodic distributions, $\mathcal{Q}$, constitute the extremum points of the set $mathcal{P}$. Ergodicity implies that with infinite data we could single out one element from $\mathcal{H}$.
Action space, $\mathcal{A}\subseteq \mathcal{F}$
The set of allowable actions. It is an abstract set embodying our proposed specification by which we aim to capture features of the true data generating mechanism. It is a subset of $\mathcal{F}$--the largest possible set of functions for which the loss function (see below) is well defined.
Loss function $L: \mathcal{P}\times \mathcal{F} \mapsto \mathbb{R}_+$
The loss function measures the performance of alternative actions $a\in \mathcal{F}$ under a given distribution $P\in \mathcal{P}$. In principle, $L$ measures the distance between distributions in $\mathcal{P}$ along particular dimensions determined by features of the data generating mechanism that we are interested in. By assigning zero distance to models that share a particular set of features (e.g. conditional expectation, set of moments, etc.), the loss function can 'determine' the relevant features of the problem.
Given the assumed statistical models, we can restrict the domain of the loss function without loss in generality such that, $L: \mathcal{H}\times\mathcal{A}\mapsto\mathbb{R}_+$.
Quadratic loss:
The most commonly used loss function is the quadratic
$$L(P, a) = \int \lVert z - a \rVert^2\mathrm{d}P(z)$$where the admissible space is $\mathcal{F}\subseteq \mathbb{R}^{k}$. Another important case is when we can write $Z = (Y, X)$, where $Y$ is univariate and the loss function is
$$L(P, a) = \int (y - a(x))^2\mathrm{d}P(y, z)$$and the admissible space $\mathcal{F}$ contains all square integrable real functions of $X$.
Relative entropy loss:
When we specificy a whole distribution and are willing to approximate $P$, one useful measure for comparison of distributions is the Kullback-Leibler divergence, or relative entropy
$$L(P, a) = - \int \log \frac{p}{a}(z) \mathrm{d}P(z)$$in which case the admissible space is the set of distributions which have a density (w.r.t. the Lebesgue measure) $\mathcal{F} = \{a: Z \mapsto \mathbb{R}_+ : \int a(z)\mathrm{d}z=1\}$.
Generalized Method of Moments:
Following the exposition of Manski (1994), many econometric problems can be cast as solving the equation $T(P, \theta) = \mathbf{0}$ in the parameter $\theta$, for a given function $T: \mathcal{P}\times\Theta \mapsto \mathbb{R}^m$ with $\Theta$ being the parameter space. By expressing estimation problems in terms of unconditional moment restrictions, for example, we can write $T(P, \theta) = \int g(z; \theta)\mathrm{d}P(z) = \mathbf{0}$ for some function $g$. Taking an origin-preserving continuous transformation $r:\mathbb{R}^m \mapsto \mathbb{R}_+$ so that
$$T(P, \theta) = \mathbf{0} \iff r(T)=0$$we can present the problem in terms of minimizing a particular loss function. Define the admissible space as $\mathcal{F} = \Theta$, then the method of moment estimator minimizes the loss $L(P, \theta) = r\circ T(P, \theta)$. The most common form of $L$ is
$$L(P, \theta) = \left[\int g(z; \theta)\mathrm{d}P(z)\right]' W \left[\int g(z; \theta)\mathrm{d}P(z)\right]$$where $W$ is a $m\times m$ positive-definite weighting matrix.
By using a loss function, we acknowledge that learning about the true mechanism might be too ambitious, so we better focus our attention only on certain features of it and try to approximate those with our specification. The loss function expresses our assessment about the importance of different features and about the penalty used to punish deviations from the true features. We define the feature functional $\gamma: \mathcal{P}\mapsto \mathcal{F}$ by the following optimization over the admissible space $\mathcal{F}$
$$\gamma(P) := \arg\min_{a \in \mathcal{F}} \ L(P,a)$$and say that $\gamma(P)$ captures the features of $P$ that we wish to learn about. It follows that by changing $L$ we are effectively changing the features of interest.
If one knew the data generating process, there would be no need for statistical inference. What makes the problem statistical is that the distribution $P$ describing the environment is unknown. The statistician must base her action on the available data, which is a partial realization of the underlying data generating mechanism. As we will see, this lack of information implies that for statistical inference the whole admissible space $\mathcal F$ is almost always "too large". As a result, one typically looks for an approximation in a restricted action space $\mathcal{A}\subsetneq \mathcal{F}$, for which we define the best-in-class action as follows
$$a^*_{L,\ P,\ \mathcal{A}} := \arg\min_{a \in \mathcal{A}} \ L(P,a).$$Whith a restricted action space, this best-in-class action might differ from the true feature $\gamma(P)$. We can summarize this scenario compactly by $\gamma(P)\notin \mathcal{A}$ and saying that our specification embodied by $\mathcal{A}$ is misspecified. Naturally, in such cases properties of the loss function become crucial by specifying the nature of punishments used to weight deviations from $\gamma(P)$. We will talk more about misspecification in the following sections. A couple of examples should help clarifying the introduced concepts.
Consider the quadratic loss function over the domain of all square integrable functions $L^2(X, \mathbb{R})$ and let $Z = (Y, X)$, where $Y$ is a scalar. The corresponding feature is
$$\gamma(P) = \mathbb{E}[Y|X] = \arg\min_{a \in L^2(X)} \int\limits_{(Y,X)} (y - a(x))^2\mathrm{d}P(y, x)$$If the action space $\mathcal{A}$ does not include all square integrable functions, but only the set of affine functions, the best in class action, i.e., the linear projection of $Y$ to the space spanned by $X$, will be different from $\gamma(P)$ in general. In other words, the linear specification for the conditional expectation $Y|X$ is misspecified.
Consider the Kullback-Leibler distance over the set of distributions with existing density functions. Denote this set by $D_Z$. Given that the true $P\in D_Z$, the corresponding feature is
$$\gamma(P) = \arg\min_{a \in D_Z} \int\limits_{Z}\log\left(\frac{p(z)}{a(z)}\right) \mathrm{d}P(z)$$which provides the density $p\in\mathbb{R}_+^Z$ such that $\int p(z)\mathrm{d}z =1$ and for any sensible set $B\subseteq \mathbb{R}^k$, $\int_B p(z)\mathrm{d}z = P(B)$. If the action space $\mathcal{A}$ is only a parametric subset of $D_Z$, the best in class action will be the best approximation in terms of KLIC. For an extensive treatment see White (1994).
An important aspect of the statistical decision problem is the relationship between $\mathcal{H}$ and $\mathcal{A}$. Our maintained assumptions about the mechanism are embodied in $\mathcal{H}$, so a natural attitude is to be as agnostic as possible about $\mathcal{H}$ in order to avoid incredible assumptions. Once we determined $\mathcal{H}$, the next step is to choose the specification, that is the action space $\mathcal{A}$.
One approach is to tie $\mathcal{H}$ and $\mathcal{A}$ together. For example, the assumptions of the standard linear regression model outline the distributions contained in $\mathcal{H}$ (normal with zero mean and homoscedasticity), for which the natural action space is the space of affine functions.
On the other hand, many approaches explicitly disentangle $\mathcal{A}$ from $\mathcal{H}$ and try to be agnostic about the maintained assumptions $\mathcal{H}$ and rather impose restrictions on the action space $\mathcal{A}$. At the cost of giving up some potentially undominated actions this approach can largely influence the success of the inference problem in finite samples.
By choosing an action space not being tied to the set of assumed statistical models, the statistician inherently introduces a possibility of misspecification -- for some statistical models there could be an action outside of the action space which would fare better than any other action within $\mathcal{A}$. However, coarsening the action space in this manner has the benefit of restricting the variability of estimated actions arising from the randomness of the sample.
In this case, the best-in-class action has a special role, namely, it minimizes the "distance" between $\mathcal{A}$ and the true feature $\gamma(\mathcal A)$, thus measuring the benchmark bias stemming from restricting $\mathcal{A}$.
The observable is a binary variable $Z\in\{0, 1\}$ generated by some statistical model. One might approach this problem by using the following triple
$$p(z; \theta) = \theta^z(1-\theta)^{1-z}.$$
Action space, $\mathcal{A}$:
Loss function, $L$: We entertain two alternative loss functions
* Quadratic loss
$$L_{MSE}(P, a) = \sum_{z\in\{0,1\}} p(z; \theta)(\theta - a)^2 = E_{\theta}[(\theta - a)^2]$$where $E_{\theta}$ denotes the expectation operator with respect to the distribution $P(z; \theta)\in\mathcal{H}$.
In the basic setup of regression function estimation we write $Z=(Y,X)\in\mathbb{R}^2$ and the objective is to predict the value of $Y$ as a function of $X$ by penalizing the deviations through the quadratic loss function. Let $\mathcal{F}:= \{f:X \mapsto Y\}$ be the family of square integrable functions mapping from $X$ to $Y$. The following is an example for a triple
Assumed statistical models, $\mathcal{H}$
Action space, $\mathcal{A}$
Loss function, $L$
A statistical decision function (or statistical decision rule) is a function mapping samples (of different sizes) to actions from $\mathcal{A}$. In order to flexibly talk about the behavior of decision rules as the sample size grows to infinity, we define the domain of the decision rule to be the set of samples of all potential sample sizes, $\mathcal{S}:= \bigcup_{n\geq1}Z^n$. The decision rule is then defined as a sequence of functions
$$ d:\mathcal{S} \mapsto \mathcal{A} \quad \quad \text{that is} \quad \quad \{d(z^n)\}_{n\geq 1}\subseteq \mathcal{A},\quad \forall z^{n}, \forall n\geq 1. $$One common way to find a decision rule is to plug the empirical distribution $P_{n}$ into the loss function $L(P, a)$ to obtain
$$L_{RE}\left(P_{n}; a\right) = \frac{1}{n}\sum_{i = 1}^{n} \log \frac{p(z_i; \theta)}{p(z_i; a)}\quad\quad\text{and}\quad\quad L_{MSE}\left(P_{n}; a\right) = \frac{1}{n}\sum_{i = 1}^{n} (z_i -a)^2$$and to look for an action that minimizes this sample analog. In case of relative entropy loss, it is
$$d(z^n) := \arg \min_{a} L(P_{n}, a) = \arg\max_{a\in[0,1]} \frac{1}{n}\sum_{i=1}^{n} \log f(z_i ,a) = \arg\max_{a\in[0,1]} \frac{1}{n}\underbrace{\left(\sum_{i=1}^{n} z_i\right)}_{:= y}\log a + \left(\frac{n-y}{n}\right)\log(1-a) $$where we define the random variable $Y_n := \sum_{i = 1}^{n} Z_i$ as the number of $1$s in the sample of size $n$, with $y$ denoting a particular realization. The solution of the above problem is the maximum likelihood estimator taking the following form
$$\hat{a}(z^n) = \frac{1}{n}\sum_{i=1}^{n} z_i = \frac{y}{n}$$and hence the maximum likelihood decision rule is
$$d_{mle}(z^n) = P(z, \hat{a}(z^n)).$$It is straightforward to see that if we used the quadratic loss instead of relative entropy, the decision rule would be identical to $d_{mle}(z^n)$. Nonetheless, the two loss functions can lead to very different assessment of the decision rule as will be shown below.
For comparison, we consider another decision rule, a particular Bayes estimator (posterior mean), which takes the following form
$$d_{bayes}(z^n) = P(z, \hat{a}_B(z^n))\quad\quad\text{where}\quad\quad \hat{a}_B(z^n) = \frac{\sum^{n}_{i=1} z_i + \alpha}{n + \alpha + \beta} = \frac{y + \alpha}{n + \alpha + \beta}$$where $\alpha, \beta > 0$ are given parameters of the Beta prior. Later, we will see how one can derive such estimators. What is important for us now is that this is an alternative decision rule arising from the same triple $(\mathcal{H}, \mathcal{A}, L_{MSE})$ as the maximum likelihood estimator, with possibly different statistical properties.
In this case the approach that we used to derive the maximum likelihood estimator in the coin tossing example leads to the following sample analog objective function
$$ d_{OLS}(z^n):= \arg\min_{a \in \mathcal{A}}L(P_{n},a) = \arg\min_{\beta_0, \ \beta_1} \sum_{t=1}^n (y_t - \beta_0 - \beta_1 x_t)^2. $$With a bit of an abuse of notation redefine $X$ to include the constant for the intercept, i.e. $\mathbf{X} = (\mathbf{\iota}, x^n)$. Then the solution for the vector of coefficients, $\mathbf{\beta}=(\beta_0, \beta_1)$, in the ordinary least squares regression is given by
$$\hat{\mathbf{\beta}}_{OLS} := (\mathbf{X}^T \mathbf{X})^{-1}\mathbf{X}^T \mathbf{Y}. $$Hence, after sample $z^n$, the decision rule predicts $y$ as an affine function given by $d_{OLS}(z^n) = \hat{a}_{OLS}$ such that
$$ \hat{a}_{OLS}(x) := \langle \mathbf{\hat{\beta}}_{OLS}, (1, x) \rangle $$where $\langle \cdot, \cdot \rangle$ denotes the inner product on $\mathbb R^{2}$.
Again, for comparison we consider a Bayesian decision rule where the conditional prior distribution of $\beta$ is distributed as $\beta|\sigma \sim \mathcal{N}(\mu_b, \sigma^2\mathbf{\Lambda_b}^{-1})$. Then the decision rule is given by
$$ \hat{\mathbf{\beta}}_{bayes} := (\mathbf{X}^T \mathbf{X} + \mathbf{\Lambda_b})^{-1}(\mathbf{\Lambda_b} \mu_b + \mathbf{X}^T \mathbf{Y}). $$Hence, decision rule after sample $z^n$ is an affine function given by $d_{bayes}(z^n) = \hat{a}_{bayes}$ such that
$$ \hat{a}_{bayes}(x) := \langle \mathbf{\hat{\beta}}_{bayes}, (1, x) \rangle. $$Again, our only purpose here is to show that we can define alternative decision rules for the same triple $(\mathcal{H}, \mathcal{A}, L_{MSE})$ which might exhibit different statistical properties.
For a given sample $z^n$, the decision rule assigns an action $d(z^n)\in\mathcal{A}$, which is then evaluated with the loss function $L(P, d(z^n))$ using a particular distribution $P\in\mathcal{H}$. Evaluating the decision rule and the loss function with a single sample, however, does not capture the uncertainty arising from the randomness of the sample. To get that we need to assess the decision rule in counterfactual worlds with different realizations for $Z^n$.
For each possible data generating mechanism, we can characterize the properties of a given decision rule by considering the distribution that it induces over losses. It is instructive to note that the decision rule $d$ in fact gives rise to
This approach proves to be useful as the action space can be an abstract space with no immediate notion of metric while the range of the loss function is always the real line (or a subset of it). In other words, a possible way to compare different decision rules is to compare the distributions they induce over losses under different data generating mechanisms for a fixed sample size.
Comparing distributions, however, is often an ambiguous task. A special case where one could safely claim that one decision rule is better than another is if the probability that the loss is under a certain $x$ level is always greater for one decision rule than the other. For instance, we could say that $d_1$ is a better decision rule than $d_2$ relative to $\mathcal{H}$ if for all $P\in\mathcal{H}$
$$ P\{z^n: L(P, d_1(z^n)) \leq x\} \geq P\{z^n: L(P, d_2(z^n)) \leq x\} \quad \forall \ x\in\mathbb{R} $$which is equivalent to stating that the induced distribution of $d_2$ is first-order stochastically dominating the induced distribution of $d_1$ for every $P\in\mathcal{H}$. This, of course, implies that
$$ \mathbb{E}[L(P, d_1(z^n))] \leq \mathbb{E}[L(P, d_2(z^n))]$$where the expectation is taken with respect to the sample distributed according to $P$.
In fact, the expected value of the induced loss is the most common measure to evaluate decision rules. Since the loss is defined over the real line, this measure always gives a single real number which serves as a basis of comparison for a given data generating process. The expected value of the loss induced by a decision rule is called the risk of the decision rule and is denoted by
$$R_n(P, d) = \mathbb{E}[L(P, d(z^n))].$$This functional now provides a clear and straightforward ordering of decision rules so that $d_1$ is preferred to $d_2$ for a given sample size $n$, if $R_n(P, d_1) < R_n\left(P, d_2\right)$. Following this logic, it might be tempting to look for the decision rule that is optimal in terms of finite sample risk. This problem, however, is immensly complicated because its criterion function hinges on an object, $P$, that we cannot observe.
Nonetheless, statistical decision theory provides a very useful common framework in which different approaches to constructing decision rules can be analyzed, highlighting their relative strengths and weaknesses. In notebook3 and notebook4 {REF to notebooks} we will consider three approaches, each of them having alternative ways to handle the ignorance about the true risk.
Consider the case when the true data generating process is indeed i.i.d. Bernoulli with parameter $\theta_0$. This implies that we have a correctly sepcified model. The sample that we are endowed with to use for inference has the size $n=25$.
The left and right panels of the following figure shows the induced action distributions of the MLE and Bayes decision rules (when $\alpha=5$, $\beta=2$) respectively for two alternative values of $\theta_0$. More transparent colors denote the scenario corresponding to the sample distribution of last figure. Faded colors show the distributions induced by an alternative $\theta_0$, while the prior parameters of the Bayes decision rule are kept fixed.
Finally, the figure below compares the performance of the two decision rules according to the their finite sample risk. The first row represents the induced loss distribution of the MLE estimator for the relative entropy and quadratic loss functions. The two panels of the second row show the same distributions for the Bayes decision rule. The vertical dashed lines indicate the value of the respective risk functionals.
Suppose that our model is correctly specified. In particular, let the data generating mechanism be i.i.d. with
$$ (Y,X) \sim \mathcal{N}(\mu, \Sigma) \quad\quad \text{where}\quad\quad \mu = (1, 3)\quad \text{and}\quad \Sigma = \begin{bmatrix} 4 & 1 \\ 1 & 8 \end{bmatrix}.$$Under this data generating mechanism, the optimal regression function is affine with coefficients
$$ \begin{align} \beta_0 &= \mu_Y - \rho\frac{\sigma_Y}{\sigma_X}\mu_X = 1 - \frac{1}{8} 3 = -0.625, \\ \beta_1 &= \rho\frac{\sigma_Y}{\sigma_X} = \frac{1}{8} = 0.125. \end{align} $$Due to correct specification, these coefficients in fact determine the feature, i.e. the true regression function.
For the Bayes estimator consider the prior
$$\mu \sim \mathcal{N}\left(\mu_b, \Lambda_b^{-1}\right) \quad\quad \text{where}\quad\quad \mu_b = (2, 2)\quad \text{and}\quad \Lambda_b = \begin{bmatrix} 6 & -3 \\ -3 & 6 \end{bmatrix}$$and suppose that $\Sigma$ is known. Let the sample size be $n=50$. With the given specification we can simulate the induced action and loss distributions.
The following figure shows contour plots of the induced action distributions associated with the OLS and Bayes estimators. The red dot depicts the best-in-class action.
Using quadrature methods one can calculate the loss of each action which gives rise to the induced loss distribution. As an approximation to these induced loss distributions, the following figure shows the histograms emerging from these calculations.
In the above examples we maintained the assumption of correctly specified models, i.e., the true feature of the data generating process lied within the action set $\mathcal{A}$. In applications using nonexperimental data, however, it is more reasonable to assume that the action set contains only approximations of the true feature.
Nothing in the analysis above prevents us from entertaining the possibility of misspecification. In these instances one can look at $a^{*}_{L, P, \mathcal{A}}$ as the best approximation of $\gamma(P)$ achievable by the model specification $\mathcal{A}$. For example, even though the true regression function (conditional expectation) might not be linear, the exercise of estimating the best linear approximation of the regression function is well defined.
In theory, one can investigate the approximation error emerging from a misspecified $\mathcal{A}$ via the loss function without mentioning the inference (finite sample) problem at all. In particular, the misspecification error can be defined as
$$\min_{a\in\mathcal{A}} \ L(P,a) - L(P, \gamma(P))$$This naturally leads to a dilemma regarding the "size" of the action space: with a richer $\mathcal{A}$, in principle, we can get closer to the true feature by making the misspecification error small. Notice, however, that in practice, not knowing $P$ implies that we cannot solve the above optimization problem and obtain the best-in-class action. As we show in notebook2 {REF}, a possible way to proceed is to require the so called consistency property from our decision rule by which we can guarantee to get very close to $a^{*}_{L, P, \mathcal{A}}$ with sufficiently large samples, however, what "sufficently large" means will be determined by the size of our $\mathcal{A}$. Larger action spaces will require larger samples to get sensible estimates for the best-in-class action. In fact, by using a "too large" $\mathcal{A}$ accompanied with a "too small" sample, our estimator's performance can be so bad that misspecification concerns become secondary.
In other words, finiteness of the sample gives rise to a trade-off between the severity of misspecifiation and the credibility of our estimates. To see this, decompose the deviation of the finite sample risk from the value of loss at the truth (excess risk) for a given decision rule $d$ and sample size $n$:
$$R_n(P, d) - L\left(P, \gamma(P) \right) = \underbrace{R_n(P, d) - L\left(P, a^{*}_{L,P, \mathcal{A}}\right)}_{\text{estimation error}} + \underbrace{L\left(P, a^{*}_{L, P, \mathcal{A}}\right)- L\left(P, \gamma(P)\right)}_{\text{misspecification error}}$$While the estimation error stems from the fact that we do not know $P$, so we have to use a finite sample to approximate the best-in-class action, misspecification error, not influenced by any random object, arises from the necessity of $\mathcal{A}\subsetneq\mathcal{F}$.
This trade-off resembles the bias-variance dilemma well-known from classical statistics. Statisticians often connect the estimation error with the decision rule's variance, whereas the misspecification error is considered as the bias term. We will see in notebook3 {REF} that this interpretation is slightly misleading. Nonetheless, it is true that, similar to the bias-variance trade-off, manipulation of (the size of) $\mathcal{A}$ is the key device to address the estimation-misspecification error trade-off. The minimal excess risk can be reached by the action space where the following two forces are balanced {REF to figure in notebook3}:
In the next lecture {REF: notebook2}, we will give a more elaborate definition of what do we mean by the "size" of $\mathcal{A}$.
A warning
The introduced notion of misspecification is a statistical one. From a modeller's point of view, a natural question to ask is to what extent misspecification affects the economic interpretation of the parameters of a fitted statistical model. Intuitively, a necessary condition for the sensibility of economic interpretation is to have a correctly specified statistical model. Because different economic models can give rise to the same statistical model, this condition is by no means sufficient. From this angle, a misspecified statistical model can easily invalidate any kind of economic interpretation of estimated parameters. This issue is more subtle and it would require an extensive treatment that we cannot deliver here, but it is worth keeping in mind the list of very strong assumptions that we are (implicitly) using when we give well-defined meaning to our parameter estimates. An interesting discussion can be found in Chapter 4 of White (1994).
Breiman, Leo (1969). Probability and Stochastic Processes: With a View Towards Applications. Houghton Mifflin.
Wald, Abraham (1950). Statistical Decision Functions. John Wiley and Sons, New York.
Manski, Charles (1988). Analog estimation in econometrics. Chapman and Hall, London.
White, Halbert (1994). Estimation, Inference and Specification Analysis (Econometric Society Monographs). Cambridge University Press.