Deviation Inequalities in Probability Theory
06 Jul 2011 17:46
The laws of large numbers say that averages taken over large samples converge on expectation values. But these are asymptotic statements which say nothing about what happens for samples of any particular size. A deviation inequality, by contrast, is a result which says that, for realizations of such-and-such a stochastic process, the sample value of this functional deviates by so much from its typical value with no more than a certain probability: Pr(|f(X1, X2, ... Xn) - E[f]| > h) < r(n,h,f), where the rate function r has to be given explicitly, and may depend on the true joint distribution of the Xi (though it's more useful if it doesn't depend on that very much). (And of course one could compare to the median rather than the mean, or just look at fluctuations above the typical value rather than to other side, or get within a certain factor of the typical value rather than a certain distance, etc.) The rate should be decreasing in h and in n.
An elementary example is "Markov's inequality": If X is a
non-negative random variable with a finite mean, then
Pr(X > h) < E[X]/h.
One can derive many other deviation inequalities from
Markov's inequality by taking X = g(Y), where Y is
another random variable and g is some suitable non-negative-valued
function.
For instance if Y has a finite variance v, then
Pr(|Y-E[Y]| > h) < v/h2.
This is
known as "Chebyshev's inequality". (Exercise: derive Chebyshev's inequality
from Markov's inequality. Since Markov was in fact Chebyshev's student, it
would seem that the logical order here reverses the historical one, though
guessing at priority from eponyms is always hazardous.) Suppose
that X1, X2, ... are random variables with
a common mean m and variance v, and Y is the average of
the first n of these. Then Chebyshev's inequality tells us that
Pr(|Y - m| > h) < Var[Y]/h2.
If
the Xi are uncorrelated (e.g., independent), then
Var[Y] = v/n, so the probability that the sample average
differs from the expectation by h or more goes to zero, no matter how
small we make h. This is precisely the weak law of large numbers. If
the Xi are correlated, but nonetheless Var[Y]
goes to zero as n grows (generally because correlations decay), then we
get an ergodic theorem. The rate of
convergence here however is not very good, just O(n-1).
Since eu is a monotonically increasing function of
u, for any positive t, X > h if and only
if etX > eth, so we get an
exponential inequality,
Pr(X > h) < e-th
E[etX].
Notice that the first term in the bound does
not depend on the distribution of X, unlike the second term, which
doesn't depend on the scale of the deviation h. We are in fact free to
pick whichever t gives us the tightest bound. The quantity
E[etX] is called the "moment generating function"
of X, let's abbreviate it MX(t), and can fail to
exist if some moments are infinite. (Write the power series
for eu and take expectations term by term to see
this.) It has however the very nice property that when X1
and X2 are independent, MX1
+ X2(t)
= MX1(t) MX2(t). From this it follows that if Y is the sum
of n independent and identically distributed copies of X,
MY(t) =
(MX(t))n. Thus
Pr(Y
> h)
< e-th MX(t))n.
If Z = Y/n, the sample mean, this in turn gives
Pr(Z > h)) = Pr(Y > hn)
< e-thn MX(t))n.
So we can get exponential rates of convergence for the law of large numbers
from this. [Students who took the CMU statistics department's probability
qualifying exam in 2010 now know who wrote problem 9.] Again, the restriction
to IID random variables is not really essential, allowing dependence just means
that the moment generating functions don't factor exactly, but if they almost
factor than we can get results of the same form. (Often, we end up
with n being replaced by n/n0,
where n0 is something like how long it takes dependence to
decay to trivial levels.)
I don't feel like going into the reasoning behind the other common deviation bounds — Bernstein, Chernoff, Hoeffding, Azuma, McDiarmid, etc. — because I feel like I've given enough of the flavor already. I am using this notebook as, actually, a notebook, more specifically a place to collect references on deviation inequalities, especially ones that apply to collections of dependent random variables. Results here typically appeal to various notions of mixing or decay of correlations, as found in ergodic theory.
See also: Concentration of Measure; Ergodic Theory; Large Deviations; Learning Theory; Probability
- Recommended:
- G. G. Bosco, F. P. Machado and Thomas Logan Ritchie, "Exponential Rates of Convergence in the Ergodic Theorem: A Constructive Approach", Journal of Statistical Physics 139 (2010): 367--374
- Olivier Bousquet, Stéphane Boucheron and Gábor Lugosi, "Introduction to Statistical Learning Theory" [Gives a very nice review of many deviation inequalities, with references.]
- Nicolo Cesa-Bianchi and Gabor Lugosi, Prediction, Learning, and Games [Provides exemplary proofs in the appendix. Mini-review]
- Iosif Pinelis, "Between Chebyshev and Cantelli", arxiv:1011.6065 [Cute, with an appealing proof]
- Yongqiang Tang, "A Hoeffding-Type Inequality for Ergodic Time Series", Journal of Theoretical Probability 20 (2007): 167--176 [PDF preprint]
- To read:
- Olivier Catoni
- "High confidence estimates of the mean of heavy-tailed real random variables", arxiv:0909.5366
- "Challenging the empirical mean and empirical variance: a deviation study", arxiv:1009.2048
- Patrick Cattiaux and Arnaud Guillin, "Deviation bounds for additive functionals of Markov process", math.PR/0603021 [Non-asymptotic bounds for the probability that time averages deviate from expectations with respect to the invariant measure, when the process is stationary and ergodic and the invariant measure is reasonably regular.]
- Marian Grendar Jr. and Marian Grendar, "Chernoff's bound forms," math.PR/0306326
- Vladislav Kargin, "A large deviation inequality for vector functions on finite reversible Markov Chains", Annals of Applied Probability 17 (2007): 1202--1221, arxiv:0508538
- Carlos A. Leon and Francois Perron, "Optimal Hoeffding bounds for discrete reversible Markov chains", math.PR/0405296
- Dasha Loukianova, Oleg Loukianov, Eva Loecherbach, "Polynomial bounds in the Ergodic Theorem for positive recurrent one-dimensional diffusions and integrability of hitting times", arxiv:0903.2405 [non-asymptotic deviation bounds from bounds on moments of recurrence times]
- P. Major
- "On a multivariate version of Bernstein's inequality", math.PR/0411287
- "A multivariate generalization of Hoeffding's ineqality", math.PR/0411288
- Ted Theodosopoulos, "A Reversion of the Chernoff Bound", math.PR/0501360
(Thanks to Michael Kalgalenko for typo-spotting)
