Indirect inference is, I think, a really important methodological advance, one which opens the door to doing a lot of useful statistics on models of complex systems. However, Gouriéroux and Monfort write for a reader who is very familiar with theoretical statistics, in particular with concepts such as the likelihood and maximum likelihood estimation, Fisher information, the score, consistency and efficiency, and so forth, though no measure theory. (Say, Wasserman's All of Statistics.) No special knowledge of econometrics is really needed, though the last three chapters may seem under-motivated to those not committed to standard econometric models. All this being the case, in the rest of my review I will presume the reader has at least some recollection of the basic ideas of probability, expectation, etc.
Let me start by giving some concrete examples of what I mean by "what the
model predicts for different parameters". Typically, predictions will depend
not just on the parameters,
, but also on some external or
"exogenous" variables, which the model doesn't attempt to predict, z.
Different methods of estimation can then be based on different predictions
about the "endogenous" variables y.
In the "generalized method of moments", one picks a number of functions of
the data y and the exogenous variables, say
,
with i here just being an index for these "generalized moments". One
would then calculate both the expected or predicted value of the moments (a
function of the parameter)
![\[
\mathbf{E}_{\theta}[K_i(y,z)] \equiv k_i(\theta,z)
\]](index_3.gif)
![\[
\frac{1}{T}\sum_{t}{K_i(y_t,z)} \equiv \hat{k}_i(z)
\]](index_4.gif)
, is the value of
which makes the expectations as close to the realization as possible. Provided
some law of large numbers or ergodic
theorem holds,
![\[
\hat{k}_i(z) \rightarrow k_i(\theta_0,z)
\]](index_7.gif)
is the true parameter value, so the estimator is
"consistent", i.e.,
![\[
\hat{\theta}_{GMM} \rightarrow \theta_0
\]](index_9.gif)
The method of least squares works similarly. We assume that
![\[
\mathbf{E}_{\theta}[Y_t|y_1^{t-1},z] = f(y_1^{t-1},z;\theta)
\]](index_10.gif)
means "all the observations from time 1 to time t-1"). The mean squared
prediction error at a given
is then
![\[
\frac{1}{T}\sum_{t}{{\left(y_t - f(y_1^{t-1},z;\theta)\right)}^2}
\]](index_13.gif)
. (This can be seen as a version of
the method of moments, with a different "moment" for each observation.)
Finally, the method of maximum likelihood asks "how often should we expect to see data like this, under this model?", and tries to maximize that probability:
![\[
L(\theta,z) = \sum_{t}{\log{p_{\theta}(y_t|y_1^{t-1},z)}}
\]](index_15.gif)
is the probability density. (Bayesian estimation is a likelihood-based method, in which the impact of facts
and experience is blunted and smoothed by prejudice.)
Originally, all of these methods of estimation were practical only if one could derive a simple formula for the best-fitting parameter values as a function of the data. Latter, with the rise of numerical optimization on cheap, fast computers, one could get away from needing an exact formula, provided it was possible to say precisely what the model predicted --- most often, what the likelihood function was.
This sounds like it ought to be easy, but there are many models which are very natural from a scientific view-point (because they nicely represent mechanisms we guess are at work) for which exact expressions for the likelihood, or indeed for other predictions, just are not available. In modeling dynamics, for example, if what we observe is not the full state of the system, but rather only part of it (and generally a part distorted by noise and nonlinearity at that), it becomes exceedingly difficult to calculate the probability of seeing a given sequence of observations. Or, again, if one's model is specified in terms of the behavior of large numbers of interacting entities (like molecules or economic agents), each possibly with an unobserved internal state, finding an exact likelihood function is pretty much hopeless. If we nonetheless want to connect our models to reality, and estimate parameters, what then should we do?
Gouriéroux and Monfort's answer turns on the fact that even though many interesting models can be simulated even when they can't be solved. That is, one can fairly quickly and cheaply "run them forward" to generate examples of the kind of behavior they say should happen, if necessary making many simulation runs to get many samples of the behavior they predict. One can then use those samples for estimation, and this in two ways, "direct" and "indirect".
The "direct" method of simulation-based inference is older and more
straightforward; just use the sample of simulation runs as an approximation to
the probability distribution generated by the model. In the formulas where one
would want to use the theoretical probabilities to calculate expectations,
likelihoods, etc., substitute the appropriate average over simulations. The
easiest way to see how this works is with the method of moments. The actual
expectations
can be very hard to calculate
analytically. In the "method of simulated moments" (chapter 2), one doesn't
even try, but rather fixes
and run the simulator S
times, each run being the same size as the data, giving simulated
values
. One then treats the simulated mean,
![\[
\hat{k}^{S}_i(\theta,z) = \frac{1}{S}\sum_{s}{\frac{1}{T}\sum_{t}{K_i(y^{(s,\theta)}_t,z)}}
\]](index_20.gif)
, of course, but this error will shrink as the
number of simulation runs (S) grows. (Gouriéroux and Monfort
consider some clever tricks for re-using the same set of random number draws
for multiple
, which reduces the computational load.) Some
care is needed to preserve convergence to the truth as the data size
(T) grows, but this can still be arranged.
The other classical estimation methods work similarly (chapter 3). If one can
draw from the predictive density,
, then
the average of several such draws is an estimate of the conditional expectation,
, and can be used in the method
of simulated least squares. Only slightly more exotic,
if
can't itself be drawn from, but one
can generate a random variable whose expectation is equal to the
conditional density, one can then employ the method of simulated maximum
likelihood. Remarkably, this retains (approximately) many of the nice
properties of actual maximum likelihood estimation, at least if the number of
simulation runs is large enough compared to the data size.
The "principle of indirect inference" (ch. 4) is more subtle, and to me much
more exciting. In this approach, one introduces an "auxiliary" or
"instrumental" model, which is not in general expected to be correct, but is
supposed to be something which is easy to fit to the data. One then fits the
auxiliary model both to the data, getting auxiliary parameter
values
, simulations from the primary
model for various values of the latter's parameters, getting auxiliary
parameter values
. The indirect
estimate of
is then the parameter setting where
comes closest
to
. In effect, one is still comparing
the model's predictions to the data, but the prediction is now "what will the
auxiliary model look like?", rather than more direct feature of the data.
For this to work, there are essentially two requirements. The first
requirement is that, if we feed in larger and larger samples from the primary
model, with its parameters held to
, then the estimates of the
auxiliary parameters will converge,
. The second requirement is
that
be invertible. Assuming these assumptions hold, the
indirect estimate will be consistent, that is, it will converge on the true
value of
. (Gouriéroux and Monfort actually [p. 85]
prove consistency under a stronger set of assumptions, which entail these, but
these are the ones which actually do the work.) Under somewhat stronger
assumptions, they are also able to say something about the limiting
distribution of indirect estimates around the truth, and even to derive a
version of the Cramér-Rao
inequality.
The first assumption, convergence of auxiliary parameter estimates, is very weak, though not altogether trivial. The second assumption basically demands that the auxiliary model be rich enough to distinguish between different versions of the primary model. Typically, but not necessarily always, this will entail their being at least as many auxiliary parameters as there are primary ones, though these needn't correspond in any useful or comprehensible way. The distributional and Cramér-Rao-style results are of the kind one would expect: the indirect estimates will be more precise when the auxiliary parameters can be precisely estimated from the data, and when small differences in the auxiliary parameters correspond to large differences in the primary parameters.
Chapters 5, 6 and 7 apply direct and indirect simulation inference to a range of popular models from econometrics, comparing the results to those of other estimation methods on both simulated and real-world data. Some of these are extremely impressive — in particular some of the results on complicated time-series models are simply astonishing — but these chapters will frankly be very hard going for anyone who has not seen these econometric models before. (Chapter 5, in particular, includes an awful lot on how to simulate discrete choice models.) Other applications will readily suggest themselves to any reader who has worked with simulation models.
x+174 pp., bibliography, line figures, index (spotty)
Economics / Probability and Statistics
In print as a hardback, ISBN 0-19-877475-3, US$85 [Buy from Powell's]