<?xml version="1.0"?>
<!-- name="generator" content="blosxom/2.0" -->
<!DOCTYPE rss PUBLIC "-//Netscape Communications//DTD RSS 0.91//EN" "http://my.netscape.com/publish/formats/rss-0.91.dtd">

<rss version="0.91">
  <channel>
    <title>Notebooks   </title>
    <link>http://bactra.org/notebooks</link>
    <description>Cosma's Notebooks</description>
    <language>en</language>

  <item>
    <title>Deviation Inequalities in Probability Theory</title>
    <link>http://bactra.org/notebooks/2011/07/06#deviation-inequalities</link>
    <description>
&lt;P&gt;The laws of large numbers say that averages taken over large samples
converge on expectation values.  But these are asymptotic statements which say
nothing about what happens for samples of any particular size.  A deviation
inequality, by contrast, is a result which says that, for realizations of
such-and-such a stochastic process, the sample value of this functional
deviates by so much from its typical value with no more than a certain
probability: Pr(|&lt;i&gt;f&lt;/i&gt;(&lt;i&gt;X&lt;/i&gt;&lt;sub&gt;1&lt;/sub&gt;, &lt;i&gt;X&lt;/i&gt;&lt;sub&gt;2&lt;/sub&gt;,
... &lt;i&gt;X&lt;/i&gt;&lt;sub&gt;&lt;i&gt;n&lt;/i&gt;&lt;/sub&gt;) - E[&lt;i&gt;f&lt;/i&gt;]| &gt; &lt;i&gt;h&lt;/i&gt;)
&lt; &lt;i&gt;r&lt;/i&gt;(&lt;i&gt;n&lt;/i&gt;,&lt;i&gt;h&lt;/i&gt;,&lt;i&gt;f&lt;/i&gt;), where the rate function &lt;i&gt;r&lt;/i&gt; has to
be given explicitly, and may depend on the true joint distribution of
the &lt;i&gt;X&lt;/i&gt;&lt;sub&gt;&lt;i&gt;i&lt;/i&gt;&lt;/sub&gt; (though it's more useful if it doesn't depend
on that very much).  (And of course one could compare to the median rather than
the mean, or just look at fluctuations above the typical value rather than to
other side, or get within a certain factor of the typical value rather than a
certain distance, etc.)  The rate should be decreasing in &lt;i&gt;h&lt;/i&gt; and
in &lt;i&gt;n&lt;/i&gt;.

&lt;P&gt;An elementary example is &quot;Markov's inequality&quot;: If &lt;i&gt;X&lt;/i&gt; is a
non-negative random variable with a finite mean, then
&lt;br&gt;
Pr(&lt;i&gt;X&lt;/i&gt; &gt; &lt;i&gt;h&lt;/i&gt;) &lt; E[&lt;i&gt;X&lt;/i&gt;]/&lt;i&gt;h&lt;/i&gt;.
&lt;br&gt;One can derive many other deviation inequalities from
Markov's inequality by taking &lt;i&gt;X&lt;/i&gt; = &lt;i&gt;g&lt;/i&gt;(&lt;i&gt;Y&lt;/i&gt;), where &lt;i&gt;Y&lt;/i&gt; is
another random variable and &lt;i&gt;g&lt;/i&gt; is some suitable non-negative-valued
function.

&lt;P&gt;For instance if &lt;i&gt;Y&lt;/i&gt; has a finite variance &lt;i&gt;v&lt;/i&gt;, then
&lt;br&gt;
Pr(|&lt;i&gt;Y&lt;/i&gt;-E[&lt;i&gt;Y&lt;/i&gt;]| &gt; &lt;i&gt;h&lt;/i&gt;) &lt; &lt;i&gt;v&lt;/i&gt;/&lt;i&gt;h&lt;/i&gt;&lt;sup&gt;2&lt;/sup&gt;.
&lt;br&gt;This is
known as &quot;Chebyshev's inequality&quot;.  (Exercise: derive Chebyshev's inequality
from Markov's inequality.  Since Markov was in fact Chebyshev's student, it
would seem that the logical order here reverses the historical one, though
guessing at priority from eponyms is always hazardous.)  Suppose
that &lt;i&gt;X&lt;/i&gt;&lt;sub&gt;1&lt;/sub&gt;, &lt;i&gt;X&lt;/i&gt;&lt;sub&gt;2&lt;/sub&gt;, ... are random variables with
a common mean &lt;i&gt;m&lt;/i&gt; and variance &lt;i&gt;v&lt;/i&gt;, and &lt;i&gt;Y&lt;/i&gt; is the average of
the first &lt;i&gt;n&lt;/i&gt; of these.  Then Chebyshev's inequality tells us that
&lt;br&gt;
Pr(|&lt;i&gt;Y&lt;/i&gt; - &lt;i&gt;m&lt;/i&gt;| &gt; &lt;i&gt;h&lt;/i&gt;) &lt; Var[&lt;i&gt;Y&lt;/i&gt;]/&lt;i&gt;h&lt;/i&gt;&lt;sup&gt;2&lt;/sup&gt;.
&lt;br&gt;
If
the &lt;i&gt;X&lt;/i&gt;&lt;sub&gt;i&lt;/i&gt;&lt;/sub&gt; are uncorrelated (e.g., independent), then
Var[&lt;i&gt;Y&lt;/i&gt;] = &lt;i&gt;v&lt;/i&gt;/&lt;i&gt;n&lt;/i&gt;, so the probability that the sample average
differs from the expectation by &lt;i&gt;h&lt;/i&gt; or more goes to zero, no matter how
small we make &lt;i&gt;h&lt;/i&gt;.  This is precisely the weak law of large numbers.  If
the &lt;i&gt;X&lt;/i&gt;&lt;sub&gt;&lt;i&gt;i&lt;/i&gt;&lt;/sub&gt; are correlated, but nonetheless Var[&lt;i&gt;Y&lt;/i&gt;]
goes to zero as &lt;i&gt;n&lt;/i&gt; grows (generally because correlations decay), then we
get an &lt;a href=&quot;ergodic-theorem.html&quot;&gt;ergodic theorem&lt;/a&gt;.  The rate of
convergence here however is not very good, just O(&lt;i&gt;n&lt;/i&gt;&lt;sup&gt;-1&lt;/sup&gt;).

&lt;P&gt;Since &lt;i&gt;e&lt;/i&gt;&lt;sup&gt;&lt;i&gt;u&lt;/i&gt;&lt;/sup&gt; is a monotonically increasing function of
&lt;i&gt;u&lt;/i&gt;, for any positive &lt;i&gt;t&lt;/i&gt;, &lt;i&gt;X&lt;/i&gt; &gt; &lt;i&gt;h&lt;/i&gt; if and only
if &lt;i&gt;e&lt;/i&gt;&lt;sup&gt;&lt;i&gt;tX&lt;/i&gt;&lt;/sup&gt; &gt; &lt;i&gt;e&lt;/i&gt;&lt;sup&gt;&lt;i&gt;th&lt;/i&gt;&lt;/sup&gt;, so we get an
exponential inequality,
&lt;br&gt;
Pr(&lt;i&gt;X&lt;/i&gt; &gt; &lt;i&gt;h&lt;/i&gt;) &lt; &lt;i&gt;e&lt;/i&gt;&lt;sup&gt;-&lt;i&gt;th&lt;/i&gt;&lt;/sup&gt;
E[&lt;i&gt;e&lt;/i&gt;&lt;sup&gt;&lt;i&gt;tX&lt;/i&gt;&lt;/sup&gt;].
&lt;br&gt;Notice that the first term in the bound does
not depend on the distribution of &lt;i&gt;X&lt;/i&gt;, unlike the second term, which
doesn't depend on the scale of the deviation &lt;i&gt;h&lt;/i&gt;.  We are in fact free to
pick whichever &lt;i&gt;t&lt;/i&gt; gives us the tightest bound.  The quantity
E[&lt;i&gt;e&lt;/i&gt;&lt;sup&gt;&lt;i&gt;tX&lt;/i&gt;&lt;/sup&gt;] is called the &quot;moment generating function&quot;
of &lt;i&gt;X&lt;/i&gt;, let's abbreviate it &lt;i&gt;M&lt;/i&gt;&lt;sub&gt;X&lt;/i&gt;&lt;/sub&gt;(&lt;i&gt;t&lt;/i&gt;), and can fail to
exist if some moments are infinite.  (Write the power series
for &lt;i&gt;e&lt;/i&gt;&lt;sup&gt;&lt;i&gt;u&lt;/i&gt;&lt;/sup&gt; and take expectations term by term to see
this.)  It has however the very nice property that when &lt;i&gt;X&lt;/i&gt;&lt;sub&gt;1&lt;/sub&gt;
and &lt;i&gt;X&lt;/i&gt;&lt;sub&gt;2&lt;/sub&gt; are independent, &lt;i&gt;M&lt;/i&gt;&lt;sub&gt;&lt;i&gt;X&lt;/i&gt;&lt;sub&gt;1&lt;/sub&gt;
+ &lt;i&gt;X&lt;/i&gt;&lt;sub&gt;2&lt;/sub&gt;&lt;/sub&gt;(&lt;i&gt;t&lt;/i&gt;)
= &lt;i&gt;M&lt;/i&gt;&lt;sub&gt;&lt;i&gt;X&lt;/i&gt;&lt;sub&gt;1&lt;/sub&gt;&lt;/sub&gt;(&lt;i&gt;t&lt;/i&gt;) &lt;i&gt;M&lt;/i&gt;&lt;sub&gt;&lt;i&gt;X&lt;/i&gt;&lt;sub&gt;2&lt;/sub&gt;&lt;/sub&gt;(&lt;i&gt;t&lt;/i&gt;).  From this it follows that if &lt;i&gt;Y&lt;/i&gt; is the sum
of &lt;i&gt;n&lt;/i&gt; independent and identically distributed copies of &lt;i&gt;X&lt;/i&gt;,
&lt;i&gt;M&lt;/i&gt;&lt;sub&gt;&lt;i&gt;Y&lt;/i&gt;&lt;/sub&gt;(&lt;i&gt;t&lt;/i&gt;) =
(&lt;i&gt;M&lt;/i&gt;&lt;sub&gt;&lt;i&gt;X&lt;/i&gt;&lt;/sub&gt;(&lt;i&gt;t&lt;/i&gt;))&lt;sup&gt;&lt;i&gt;n&lt;/i&gt;&lt;/sup&gt;.  Thus
&lt;br&gt;
Pr(&lt;i&gt;Y&lt;/i&gt;
&gt; &lt;i&gt;h&lt;/i&gt;)
&lt; &lt;i&gt;e&lt;/i&gt;&lt;sup&gt;-&lt;i&gt;th&lt;/i&gt;&lt;/sup&gt; &lt;i&gt;M&lt;/i&gt;&lt;sub&gt;&lt;i&gt;X&lt;/i&gt;&lt;/sub&gt;(&lt;i&gt;t&lt;/i&gt;))&lt;sup&gt;&lt;i&gt;n&lt;/i&gt;&lt;/sup&gt;.
&lt;br&gt;
If &lt;i&gt;Z&lt;/i&gt; = &lt;i&gt;Y&lt;/i&gt;/&lt;i&gt;n&lt;/i&gt;, the sample mean, this in turn gives
&lt;br&gt;
Pr(&lt;i&gt;Z&lt;/i&gt; &gt; &lt;i&gt;h&lt;/i&gt;)) = Pr(&lt;i&gt;Y&lt;/i&gt; &gt; &lt;i&gt;hn&lt;/i&gt;)
&lt; &lt;i&gt;e&lt;/i&gt;&lt;sup&gt;-&lt;i&gt;thn&lt;/i&gt;&lt;/sup&gt; &lt;i&gt;M&lt;/i&gt;&lt;sub&gt;&lt;i&gt;X&lt;/i&gt;&lt;/sub&gt;(&lt;i&gt;t&lt;/i&gt;))&lt;sup&gt;&lt;i&gt;n&lt;/i&gt;&lt;/sup&gt;.
&lt;br&gt;
So we can get exponential rates of convergence for the law of large numbers
from this.  [Students who took the CMU statistics department's probability
qualifying exam in 2010 now know who wrote problem 9.]  Again, the restriction
to IID random variables is not really essential, allowing dependence just means
that the moment generating functions don't factor exactly, but if they almost
factor than we can get results of the same form.  (Often, we end up
with &lt;i&gt;n&lt;/i&gt; being replaced by &lt;i&gt;n&lt;/i&gt;/&lt;i&gt;n&lt;/i&gt;&lt;sub&gt;0&lt;/sub&gt;,
where &lt;i&gt;n&lt;/i&gt;&lt;sub&gt;0&lt;/sub&gt; is something like how long it takes dependence to
decay to trivial levels.)

&lt;P&gt;I don't feel like going into the reasoning behind the other common deviation
bounds &amp;mdash; Bernstein, Chernoff, Hoeffding, Azuma, McDiarmid, etc. &amp;mdash;
because I feel like I've given enough of the flavor already.  I am using this
notebook as, actually, a notebook, more specifically a place to collect
references on deviation inequalities, especially ones that apply to collections
of dependent random variables.  Results here typically appeal to various
notions of mixing or decay of correlations, as found
in &lt;a href=&quot;ergodic-theory.html&quot;&gt;ergodic theory&lt;/a&gt;.

&lt;P&gt;See also:
	&lt;a href=&quot;concentration-of-measure.html&quot;&gt;Concentration of Measure&lt;/a&gt;;
	&lt;a href=&quot;ergodic-theory.html&quot;&gt;Ergodic Theory&lt;/a&gt;;
	&lt;a href=&quot;large-deviations.html&quot;&gt;Large Deviations&lt;/a&gt;;
	&lt;a href=&quot;learning-theory.html&quot;&gt;Learning Theory&lt;/a&gt;;
	&lt;a href=&quot;probability.html&quot;&gt;Probability&lt;/a&gt;

&lt;ul&gt;Recommended:
	&lt;li&gt;G. G. Bosco, F. P. Machado and Thomas Logan Ritchie, &quot;Exponential
Rates of Convergence in the Ergodic Theorem: A Constructive
Approach&quot;, &lt;a href=&quot;http://dx.doi.org/10.1007/s10955-010-9945-4&quot;&gt;&lt;cite&gt;Journal
of Statistical Physics&lt;/cite&gt;
&lt;strong&gt;139&lt;/strong&gt; (2010): 367--374&lt;/a&gt;
	&lt;li&gt;&lt;a href=&quot;http://www.kyb.mpg.de/~bousquet/&quot;&gt;Olivier Bousquet&lt;/a&gt;,
&lt;a href=&quot;http://www.lri.fr/~bouchero/&quot;&gt;St&amp;eacute;phane Boucheron&lt;/a&gt;
and &lt;a href=&quot;http://www.econ.upf.es/~lugosi/&quot;&gt;G&amp;aacute;bor Lugosi&lt;/a&gt;,
&quot;Introduction to Statistical Learning Theory&quot;
[Gives a very nice review of many deviation inequalities, with references.]
	&lt;li&gt;Nicolo Cesa-Bianchi and Gabor Lugosi, &lt;citE&gt;Prediction, Learning,
and Games&lt;/cite&gt; [Provides exemplary proofs in the appendix.  &lt;a href=&quot;../weblog/algae-2008-07.html#prediction&quot;&gt;Mini-review&lt;/a&gt;]
	&lt;li&gt;Iosif Pinelis, &quot;Between Chebyshev and Cantelli&quot;, &lt;a href=&quot;http://arxiv.org/abs/1011.6065&quot;&gt;arxiv:1011.6065&lt;/a&gt; [Cute, with an appealing proof]
	&lt;li&gt;Yongqiang Tang, &quot;A Hoeffding-Type Inequality for Ergodic Time
Series&quot;, &lt;a href=&quot;http://dx.doi.org/10.1007/s10959-007-0057-2&quot;&gt;&lt;cite&gt;Journal of
Theoretical Probability&lt;/cite&gt; &lt;strong&gt;20&lt;/strong&gt; (2007): 167--176&lt;/a&gt;
[&lt;a href=&quot;http://www4.stat.ncsu.edu/~sghosal/papers/Tang.pdf&quot;&gt;PDF preprint&lt;/a&gt;]
	&lt;/ul&gt;

&lt;ul&gt;To read:
	&lt;li&gt;Olivier Catoni
		&lt;ul&gt;
		&lt;li&gt;&quot;High confidence estimates of the mean of heavy-tailed real random variables&quot;, &lt;a href=&quot;http://arxiv.org/abs/0909.5366&quot;&gt;arxiv:0909.5366&lt;/a&gt;
		&lt;li&gt;&quot;Challenging the empirical mean and empirical variance: a deviation study&quot;, &lt;a href=&quot;http://arxiv.org/abs/1009.2048&quot;&gt;arxiv:1009.2048&lt;/a&gt;
		&lt;/ul&gt;
	&lt;li&gt;Patrick Cattiaux and Arnaud Guillin, &quot;Deviation bounds for additive
functionals of Markov process&quot;, &lt;a
href=&quot;http://arxiv.org/abs/math.PR/0603021&quot;&gt;math.PR/0603021&lt;/a&gt;
[Non-asymptotic bounds for the probability that time averages deviate from
expectations with respect to the invariant measure, when the process is
stationary and ergodic and the invariant measure is reasonably regular.]
	&lt;li&gt;Marian Grendar Jr. and Marian Grendar, &quot;Chernoff's bound forms,&quot; &lt;a
href=&quot;http://arxiv.org/abs/math.PR/0306326&quot;&gt;math.PR/0306326&lt;/a&gt;
	&lt;li&gt;Vladislav Kargin, &quot;A large deviation inequality for vector functions on finite reversible Markov Chains&quot;, &lt;cite&gt;Annals of Applied Probability&lt;/cite&gt;
&lt;strong&gt;17&lt;/strong&gt; (2007): 1202--1221, &lt;a href=&quot;http://arxiv.org/abs/0508538&quot;&gt;arxiv:0508538&lt;/a&gt;
	&lt;li&gt;Carlos A. Leon and Francois Perron, &quot;Optimal Hoeffding bounds for
discrete reversible Markov
chains&quot;, &lt;a href=&quot;http://arxiv.org/abs/math.PR/0405296&quot;&gt;math.PR/0405296&lt;/a&gt;
	&lt;li&gt;Dasha Loukianova, Oleg Loukianov, Eva Loecherbach, &quot;Polynomial
bounds in the Ergodic Theorem for positive recurrent one-dimensional diffusions
and integrability of hitting
times&quot;, &lt;a href=&quot;http://arxiv.org/abs/0903.2405&quot;&gt;arxiv:0903.2405&lt;/a&gt;
[&lt;em&gt;non-asymptotic&lt;/em&gt; deviation bounds from bounds on moments of
recurrence times]
	&lt;li&gt;P. Major
		&lt;ul&gt;
		&lt;li&gt;&quot;On a multivariate version of Bernstein's inequality&quot;,
&lt;a href=&quot;http://arxiv.org/abs/math.PR/0411287&quot;&gt;math.PR/0411287&lt;/a&gt;
		&lt;li&gt;&quot;A multivariate generalization of Hoeffding's
ineqality&quot;, &lt;a href=&quot;http://arxiv.org/abs/math.PR/0411288&quot;&gt;math.PR/0411288&lt;/a&gt;
		&lt;/ul&gt;
	&lt;li&gt;Ted Theodosopoulos, &quot;A Reversion of the Chernoff Bound&quot;, &lt;a
href=&quot;http://arxiv.org/abs/math.PR/0501360&quot;&gt;math.PR/0501360&lt;/a&gt;
	&lt;/ul&gt;

&lt;P&gt;(Thanks to Michael Kalgalenko for typo-spotting)
</description>
  </item>
  </channel>
</rss>
