<?xml version="1.0"?>
<!-- name="generator" content="blosxom/2.0" -->
<!DOCTYPE rss PUBLIC "-//Netscape Communications//DTD RSS 0.91//EN" "http://my.netscape.com/publish/formats/rss-0.91.dtd">

<rss version="0.91">
  <channel>
    <title>Notebooks   </title>
    <link>http://bactra.org/notebooks</link>
    <description>Cosma's Notebooks</description>
    <language>en</language>

  <item>
    <title>Frequentist Consistency of Bayesian Procedures</title>
    <link>http://bactra.org/notebooks/2009/11/13#bayesian-consistency</link>
    <description>
&lt;P&gt;&quot;Bayesian consistency&quot; is usually taken to mean showing that, under Bayesian
updating, the posterior probability concentrates on the true model.  That is,
for every (measurable) set of hypotheses containing the truth, the posterior
probability goes to 1.  (In practice one shows that the posterior probability
of any set not containing the truth goes to zero.)  There is a basic result
here, due to Doob, which essentially says that the Bayesian learner is
consistent, except on a set of data of &lt;em&gt;prior&lt;/em&gt; probability zero.  That
is, the Bayesian is &lt;em&gt;subjectively&lt;/em&gt; certain they will converge on the
truth.  This is not as reassuring as one might wish, and showing Bayesian
consistency &lt;em&gt;under the true distribution&lt;/em&gt; is harder.  In fact, it
usually involves assumptions under which &lt;em&gt;non-Bayes&lt;/em&gt; procedures will
also converge.  These are things like the existence of very powerful consistent
hypothesis tests (an approach favored by Ghosal, van der Vaart, et al.,
supposedly going back to Le Cam), or, inspired
by &lt;a href=&quot;learning-theory.html&quot;&gt;learning theory&lt;/a&gt;, constraints on the
effective size of the hypothesis space which are gradually relaxed as the
sample size grows (as in Barron et al.).  If these assumptions do not hold, one
can construct situations in which Bayesian procedures are inconsistent.

&lt;P&gt;Concentration of the posterior around the truth is only a preliminary.  One
would also want to know that, say, the posterior mean converges, or even better
that the predictive distribution converges.  For many finite-dimensional
problems, what's called the &quot;Bernstein-von Mises theorem&quot; basically says that
the posterior mean and the maximum likelihood estimate converge, so if one
works the other will too.  This breaks down for infinite-dimensional problems.

&lt;P&gt;(PAC-Bayesian results don't fit into this picture particularly neatly.
Essentially, they say that if you find a set of classifiers which all classify
correctly in-sample, and ask about the average out-of-sample performance, the
bounds on the latter are tighter for big sets than for small ones.  This is for
the unmysterious reason that it takes a bigger coincidence for &lt;em&gt;many&lt;/em&gt;
bad classification rules to happen to &lt;em&gt;all&lt;/em&gt; work on the training data
than for a &lt;em&gt;few&lt;/em&gt; bad rules to get lucky.  The actual Bayesian machinery
of posterior updating doesn't really come into play.)

&lt;P&gt;I believe I have contributed a Result to this area, on what happens when the
data are dependent and all the models are mis-specified, but some are more
mis-specified than others.

&lt;P&gt;&lt;em&gt;Query&lt;/em&gt;: are there any situations where Bayesian methods are
consistent but no non-Bayesian method is?  (My recollection is that John
Earman, in &lt;cite&gt;Bayes or Bust&lt;/citE&gt;, provides a negative answer, but I forget
how.)

&lt;ul&gt;Recommended:
	&lt;li&gt;Andrew Barron, Mark J. Schervish and Larry Wasserman, &quot;The
Consistency of Posterior Distributions in Nonparametric Problems&quot;, &lt;cite&gt;Annals
of Statistics&lt;/cite&gt; &lt;strong&gt;27&lt;/strong&gt; (1999): 536--561 [While I am biased
&amp;mdash; Mark and Larry are senior faculty here &amp;mdash; I think this is
definitely one of the best-written papers on the topic.]
	&lt;li&gt;Robert H. Berk [Old but quite nice papers on the
effect of mis-specification, though with IID data assumed, and stronger
assumptions about the models than modern writers are comfortable with.]
		&lt;ul&gt;
		&lt;li&gt;&quot;Limiting Behavior of Posterior Distributions when the Model is Incorrect&quot;, &lt;a href=&quot;http://projecteuclid.org/euclid.aoms/1177699597&quot;&gt;&lt;cite&gt;Annals of Mathematical Statistics&lt;/cite&gt; &lt;strong&gt;37&lt;/strong&gt; (1966):
51--58&lt;/a&gt; [see also the &lt;a href=&quot;http://projecteuclid.org/euclid.aoms/1177699477&quot;&gt;correction&lt;/a&gt;]
		&lt;li&gt;&quot;Consistency a Posteriori&quot;, &lt;a href=&quot;http://projecteuclid.org/euclid.aoms/1177696967&quot;,&gt;&lt;cite&gt;Annals of Mathematical Statistics&lt;/cite&gt; &lt;strong&gt;41&lt;/strong&gt; (1970): 894--906&lt;/a&gt;
		&lt;/ul&gt;
	&lt;li&gt;Taeryon Choi, R. V. Ramamoorthi, &quot;Remarks on consistency of posterior distributions&quot;, &lt;a href=&quot;http://arxiv.org/abs/0805.3248&quot;&gt;arxiv:0805.3248&lt;/a&gt;
	&lt;li&gt;Ronald Christensen, &quot;Inconsistent Bayesian Estimation&quot;,
&lt;a href=&quot;http://ba.stat.cmu.edu/abstracts/Christensen.php&quot;&gt;&lt;cite&gt;Bayesian Analysis&lt;/cite&gt; &lt;strong&gt;4&lt;/strong&gt; (2009): 413--416&lt;/a&gt; [An extremely simple example of how inconsistency can be generated]
	&lt;li&gt;Persi Diaconis and David
Freedman, &lt;a href=&quot;http://dx.doi.org/10.1214/aos/1176349830&quot;&gt;&quot;On the
Consistency of Bayes Estimates&quot;, &lt;cite&gt;The Annals of
Statistics&lt;/cite&gt; &lt;strong&gt;14&lt;/strong&gt; (1986): 1--26&lt;/a&gt; [With accompanying
discussion; the latter is worth reading if only to fully savor the academic
snark in Diaconis and Freedman's reply.]
	&lt;li&gt;David Freedman, &quot;On the Bernstein-von Mises Theorem with
Infinite-Dimensional Parameters&quot;, &lt;cite&gt;Annals of
Statistics&lt;/cite&gt; &lt;strong&gt;27&lt;/strong&gt; (1999): 1119--1140 [As you know, Bob, the
Bernstein-von Mises theorem asserts that, &quot;under the usual conditions&quot;, in the
large sample limit the distribution of the maximum likelihood estimate is
basically the same as the Bayesian posterior distribution, so you can take
credible intervals as approximate confidence intervals and vice versa.  It
turns out that the usual conditions can fail drastically even for very simple
infinite-dimensional problems.]
	&lt;li&gt;Subhashis Ghosal, &quot;A review of consistency and convergence rates of
posterior distribution&quot;
[&lt;a href=&quot;http://www4.stat.ncsu.edu/~sghosal/papers/bhurev.pdf&quot;&gt;PDF&lt;/a&gt;]
	&lt;li&gt;Subhashis Ghosal, Jayanta K. Ghosh and R. V. Ramamoorthi,
&quot;Consistency Issues in Bayesian Nonparametrics&quot; [Review of the IID case, on
Ghosal's website someplace]
	&lt;li&gt;Subhashis Ghosal, Jayanta K. Ghosh and Aad W. van der Vaart,
&quot;Convergence Rates of Posterior
Distributions&quot;, &lt;a href=&quot;http://dx.doi.org/10.1214/aos/1016218228&quot;&gt;&lt;citE&gt;Annals
of Statistics&lt;/cite&gt; &lt;strong&gt;28&lt;/strong&gt; (2000): 500--531&lt;/a&gt;
	&lt;li&gt;Subhashis Ghosal and Yongqiang Tang, &quot;Bayesian Consistency for
Markov Processes&quot;, &lt;cite&gt;Sankhya&lt;/cite&gt; &lt;strong&gt;68&lt;/strong&gt; (2006): 227--239
[This is slick, but I think the cuteness of the proof of the main theorem is
achieved at the cost of the ugliness of verifying the main conditions, as in
their example.  (That may just be jealousy
speaking.)  &lt;a
href=&quot;http://www4.stat.ncsu.edu/~sghosal/papers/Markov_consistency.pdf&quot;&gt;PDF&lt;/a&gt;]
	&lt;li&gt;Subhashis Ghosal and Aad van der Vaart, &quot;Convergence Rates of
Posterior Distributions for Non-IID
Observations&quot;, &lt;a
href=&quot;http://dx.doi.org/10.1214/009053606000001172&quot;&gt;&lt;cite&gt;Annals of
Statistics&lt;/cite&gt; &lt;strong&gt;35&lt;/strong&gt; (2007): 192--223&lt;/a&gt;
	&lt;li&gt;J. K. Ghosh and R. V. Ramamoorthi, &lt;cite&gt;Bayesian
Nonparametrics&lt;/cite&gt; [&lt;a href=&quot;../weblog/algae-2009-10.html#ghosh-ramamoorthi&quot;&gt;Mini-review&lt;/a&gt;]
	&lt;li&gt;B. J. K. Kleijn and A. W. van der Vaart, &quot;Misspecification in
infinite-dimensional Bayesian
statistics&quot;, &lt;a
href=&quot;http://dx.doi.org/10.1214/009053606000000029&quot;&gt;&lt;cite&gt;Annals of
Statistics&lt;/cite&gt; &lt;strong&gt;34&lt;/strong&gt; (2006): 837--877&lt;/a&gt;
	&lt;li&gt;Antonio Lijoi, Igor Prunster and Stephen G. Walker, &quot;Bayesian
Consistency for Stationary
Models&quot;, &lt;a
href=&quot;http://dx.doi.org/10.1017/S0266466607070314&quot;&gt;&lt;cite&gt;Econometric
Theory&lt;/cite&gt; &lt;strong&gt;23&lt;/strong&gt; (2007): 749--759&lt;/a&gt; [Gives a Doob-style
result, that the &lt;em&gt;prior&lt;/em&gt; probability of failing to converge is zero.]
	&lt;li&gt;David A. McAllester, &quot;Some PAC-Bayesian
Theorems&quot;, &lt;a href=&quot;http://dx.doi.org/10.1023/A:1007618624809&quot;&gt;&lt;cite&gt;Machine
Learning&lt;/cite&gt; &lt;strong&gt;37&lt;/strong&gt; (1999): 355--363&lt;/a&gt;
	&lt;li&gt;Lorraine Schwartz, &quot;On Bayes
Procedures&quot;, &lt;a
href=&quot;http://dx.doi.org/10.1007/BF00535479&quot;&gt;&lt;cite&gt;Z. Wahrsch. Verw. Gebiete&lt;/cite&gt; &lt;strong&gt;4&lt;/strong&gt;
(1965): 10--26&lt;/a&gt; [The journal now known as &lt;cite&gt;Probability Theory and Related
Fields&lt;/cite&gt;]
	&lt;li&gt;X. Shen and Larry Wasserman, &quot;Rates of convergence of posterior
distributions&quot;, &lt;a href=&quot;http://dx.doi.org/10.1214/aos/1009210686&quot;&gt;&lt;cite&gt;Annals of Statistics&lt;/cite&gt; &lt;strong&gt;29&lt;/strong&gt; (2001): 687--714&lt;/a&gt;
	&lt;li&gt;Stephen Walker, &quot;New Approaches to Bayesian Consistency&quot;, 
&lt;a href=&quot;http://dx.doi.org/10.1214/009053604000000409&quot;&gt;&lt;cite&gt;Annals of Statistics&lt;/cite&gt; &lt;strong&gt;32&lt;/strong&gt; (2004): 2028--2043&lt;/a&gt; = &lt;a
href=&quot;http://arxiv.org/abs/math.ST/0503672&quot;&gt;math.ST/0503672&lt;/a&gt; [Clever
martingale tricks.]
	&lt;li&gt;Yang Xing, &quot;Convergence rates of posterior distributions for
observations without the iid
structure&quot;, &lt;a href=&quot;http://arxiv.org/abs/0811.4677&quot;&gt;arxiv:0811.4677&lt;/a&gt;
	&lt;li&gt;Yang Xing and Bo Ranneby, &quot;Both necessary and sufficient conditions
for Bayesian exponential
consistency&quot;, &lt;a href=&quot;http://arxiv.org/abs/0812.1084&quot;&gt;arxiv:0812.1084&lt;/a&gt;
[Essentially, a unifying presentation of several existing conditions for IID
samples.]
	&lt;li&gt;&lt;a href=&quot;http://stat.rutgers.edu/~tzhang/&quot;&gt;Tong Zhang&lt;/a&gt;,
&quot;From $\epsilon$-entropy to KL-entropy: Analysis of minimum information complexity density estimation&quot;, &lt;cite&gt;Annals of Statistics&lt;/cite&gt; &lt;strong&gt;34&lt;/strong&gt; (2006): 2180--2210 = &lt;a href=&quot;http://arxiv.org/abs/math.ST/0702653&quot;&gt;arxiv:math.ST/0702653&lt;/a&gt;
	&lt;/ul&gt;

&lt;ul&gt;Modesty forbids me to recommend:
	&lt;li&gt;CRS, &quot;Dynamics of Bayesian Updating with Dependent Data and
Mis-specified
Models&quot;, &lt;a href=&quot;http://arxiv.org/abs/0901.1342&quot;&gt;arxiv:0901.1342&lt;/a&gt;
= &lt;a href=&quot;http://dx.doi.org/10.1214/09-EJS485&quot;&gt;&lt;cite&gt;Electronic Journal of
Statistics&lt;/cite&gt; &lt;strong&gt;3&lt;/strong&gt; (2009): 1039--1074&lt;/a&gt;
	&lt;/ul&gt;

&lt;ul&gt;To read:
	&lt;li&gt;Dennis D. Cox, &quot;An Analysis of Bayesian Inference for Nonparametric
Regression&quot;, &lt;cite&gt;Annals of Statistics&lt;/cite&gt; &lt;strong&gt;21&lt;/strong&gt; (1993):
903--923
	&lt;li&gt;J. L. Doob, &quot;Application of the theory of martingales&quot;, pp. 23--27
in &lt;cite&gt;Colloques Internationaux du Centre National de la Recherche
Scientifique&lt;cite&gt;, no. 13, Centre National de la Recherche Scientifique,
Paris, 1949 [Summary
in &lt;a href=&quot;http://www.ams.org/mathscinet-getitem?mr=33460&quot;&gt;&lt;cite&gt;Mathematical
Reviews&lt;/cite&gt;&lt;/a&gt; by William Feller]
	&lt;li&gt;Subhashis Ghosal, J&amp;uuml;ri Lember and Aad van der Vaart, &quot;Nonparametric Bayesian model selection and averaging&quot;, &lt;a href=&quot;http://dx.doi.org/10/1214/07-EJS090&quot;&gt;&lt;cite&gt;Electronic Journal of Statistics&lt;/cite&gt; &lt;strong&gt;2&lt;/strong&gt; (2008): 63--89&lt;/a&gt;
	&lt;li&gt;Marcus Hutter, &quot;Exact Non-Parametric Bayesian Inference on Infinite Trees&quot;, &lt;a href=&quot;http://arxiv.org/abs/0903.5342&quot;&gt;arxiv:0903.5342&lt;/a&gt;
	&lt;li&gt;John Langford, &quot;Tutorial on Practical Prediction Theory for
Classification&quot;, &lt;a href=&quot;http://jmlr.csail.mit.edu/papers/v6/langford05a.html&quot;&gt;&lt;cite&gt;Journal of Machine Learning Research&lt;/cite&gt;
&lt;strong&gt;6&lt;/strong&gt; (2005): 273--306&lt;/a&gt; [For the PAC-Bayesian result]
	&lt;li&gt;Lucien LeCam, &quot;On the Speed of Convergence of Posterior
Distributions&quot;
[&lt;a
href=&quot;http://stat-www.berkeley.edu/users/rice/LeCam/papers/posterior.pdf&quot;&gt;PDF&lt;/a&gt;]
	&lt;li&gt;David A. McAllester, &quot;PAC-Bayesian Stochastic Model
Selection&quot;, &lt;a href=&quot;http://dx.doi.org/10.1023/A:1021840411064&quot;&gt;&lt;cite&gt;Machine
Learning&lt;/cite&gt; &lt;strong&gt;51&lt;/strong&gt; (2003): 5--21&lt;/a&gt;
	&lt;/ul&gt;

&lt;ul&gt;To write:
	&lt;li&gt;CRS, &quot;Bayesian Learning, Information Theory, and Evolutionary
Search&quot;
	&lt;/ul&gt;
</description>
  </item>
  </channel>
</rss>