Estimating Entropies and Informations
02 May 2008 10:39
The central mathematical objects in information theory are the entropies of random variables. These ("Shannon") entropies are properties of the probability distributions of the variables, rather than of particular realizations. (This is unlike the Boltzmann entropy of statistical mechanics, which is an objective property of the microscopic state, at least once we fix our partition of the former into macroscopic states. Confusing the two entropies is common but bad.) The question which concerns me here is how to estimate the entropy of the distribution, given a sample of realizations. The most obvious approach, if one knows the form of the distribution but not the parameters, is to estimate the parameters and then plug in. But it feels like one should be able to do this more non-parametrically. The obvious non-parametric estimator is of course just the entropy of the empirical distribution, what one might call the empirical entropy. However, the empirical distribution isn't always the best estimate of the true distribution (one might perfer, e.g., some kind of kernel density estimate). For that matter, we often don't really care about the distribution, just its entropy, so some more direct estimator would be nice.
What would be really nice would be to not just have point estimates but also confidence intervals. Non-parametrically, my guess is that the only feasible way to do this is bootstrapping.
For finite alphabets, one approach would be to use something like variable length Markov chains, or causal state reconstruction, to reconstruct a get machine capable of generating the sequence. From the machine, it is easy to calculate the entropy of words or blocks of any finite length, and even the entropy rate. My experience with using CSSR is that the entropy rate estimates can get very good even when the over-all reconstruction of the structure is very poor, but I don't have any real theory on that. I suspect CSSR converges on the true entropy rate faster than do variable length Markov chains, because the former has greater expressive power, but again I don't know that for sure.
Using gzip is a bad idea (for this purpose; it works fine for data compression).
See also: Bootstrapping Entropy Estimates
- Recommended:
- Jose M. Amigo, Janusz Szczepanski, Elek Wajnryb and Maria V. Sanchez-Vives, "Estimating the Entropy Rate of Spike Trains via Lempel-Ziv Complexity", Neural Computation 16 (2004): 717--736 [Normally, I have strong views on using Lempel-Ziv to measure entropy rates, but here they are using the 1976 Lempel-Ziv definitions, not the 1978 ones. The difference is subtle, but important; 1978 leads to gzip and practical compression algorithms, but very bad entropy estimates; 1976 leads, as they show numerically, to quite good entropy rate estimates, at least for some processes. Thanks to Dr. Szczepanski for correspondence about this paper.]
- J.-R. Chazottes and D. Gabrielli, "Large deviations for empirical entropies of Gibbsian sources", math.PR/0406083 = Nonlinearity 18 (2005): 2545--2563 [This is a very cool result which shows that block entropies, and entropy rates estimated from those blocks, obey the large deviation principle even as one lets the length of the blocks grow with the amount of data, provided the block-length doesn't grow too quickly (only logarithmically). I wish I could write papers like this.]
- John W. Fisher III, Alexander T. Ihler and Paula A. Viola, "Learning Informative Statistics: A Nonparametric Approach", pp. 900--906 in NIPS 12 (1999) [PDF reprint. I'd call this more of a semi-parametric approach than a fully non-parametric one; they assume a parametric form for the dependence structure, but are agnostic about the distributions of innovations, and so try to maximize non-parametrically estimated mutual informations.]
- Alexander Kraskov, Harald Stögbauer and Peter Grassberger, "Estimating Mutual Information", cond-mat/0305641 = Physical Review E 69 (2004): 066138
- Liam Paninski, "Estimation of Entropy and Mutual Information", Neural Computation 15 (2003): 1191--1253 [Preprint; code]
- Thomas Schuermann and Peter Grassberger, "Entropy estimation of symbol sequences," Chaos 6 (1996): 414--427 = cond-mat/0203436
- Jonathan D. Victor, "Asymptotic Bias in Information Estimates and the Exponential (Bell) Polynomials", Neural Computation 12 (2000): 2797--2804 [Calculates the bias in the empirical entropy, as an estimator of the true entropy, under IID sampling of a discrete space. Interestingly, the first-order (1/n) term in the bias does not depend on the actual distribution, though the higher-order terms do.]
- Benjamin Weiss, Single Orbit Dynamics [Discusses procedures for non-parametrically estimating entropy of suitably ergodic sources, using just one realization of the process.]
- To read:
- Juan A. Bonachela, Haye Hinrichsen, Miguel A. Munoz, "Entropy estimates of small data sets", arxiv:0804.4561
- H. Cai, S. R. Kulkarni and S. Verdu, "Universal Entropy Estimation via Block Sorting", IEEE Transactions on Information Theory 50 (2004): 1551--1561
- C. J. Cellucci, A. M. Albano and P. E. Rapp, "Statistical validation of mutual information calculations: Comparison of alternative numerical algorithms", Physical Review E 71 (2005): 066208 [From the abstract: "[A] minimum description length argument is used to determine the optimal number of elements to use when characterizing the distributions of X and Y. However, even when using partitions of the X and Y axis indicated by minimum description length, mutual information calculations performed with a uniform partition of the XY plane can give misleading results. This motivated the construction of an algorithm for calculating mutual information that uses an adaptive partition. This algorithm also incorporates an explicit test of the statistical independence of X and Y in a calculation that returns an assessment of the corresponding null hypothesis. The previously published Fraser-Swinney algorithm for calculating mutual information includes a sophisticated procedure for local adaptive control of the partitioning process. When the Fraser and Swinney algorithm and the algorithm constructed here are compared, they give very similar numerical results (less than 4% difference in a typical application). Detailed comparisons are possible when X and Y are correlated jointly Gaussian distributed because an analytic expression for I(X,Y) can be derived for that case. Based on these tests, three conclusions can be drawn. First, the algorithm constructed here has an advantage over the Fraser-Swinney algorithm in providing an explicit calculation of the probability of the null hypothesis that X and Y are independent. Second, the Fraser-Swinney algorithm is marginally the more accurate of the two algorithms when large data sets are used. With smaller data sets, however, the Fraser-Swinney algorithm reports structures that disappear when more data are available. Third, the algorithm constructed here requires about 0.5% of the computation time required by the Fraser-Swinney algorithm."]
- J.-R. Chazottes and E. Uglade, "Entropy estimation and fluctuations of Hitting and Recurrence Times for Gibbsian sources", math.DS/0401093
- Tommy W. S. Chow and D. Huang, "Estimating Optimal Feature Subsets Using Efficient Estimation of High-Dimensional Mutual Information", IEEE Transactions on Neural Networks 16 (2005): 213--224
- J. M. Finn, J. D. Goettee, Z. Toroczkai, M. Anghel and B. P. Wood, "Estimation of entropies and dimensions by nonlinear symbolic time series analysis", Chaos 13 (2003): 444--456
- Yun Gao, Ioannis Kontoyiannis, Elie Bienenstock, "From the entropy to the statistical structure of spike trains", arxiv:0710.4117
- Peter Grassberger, "Data Compression and Entropy Estimates by Non-sequential Recursive Pair Substitution," physics/0207023 [On Jimenez-Montano, Ebeling and Poeschel]
- Yongmiao Hong and Halbert White, "Asymptotic Distribution Theory for Nonparametric Entropy Measures of Serial Dependence", Econometrica 73 (2005): 837--901
- Miguel Angel Jimenez-Montano, Werner Ebeling, and Thorsten Poeschel, "SYNTAX: A computer program to compress a sequence and to estimate its information content," cond-mat/0204134
- Alexei Kaltchenko, "Algorithms for Estimating Information Distance with Applications to Bioinformatics and Linguistics", cs.CC/0404039
- Matthew B. Kennel, Jonathon Shlens, Henry D. I. Abarbanel and E. J. Chichilnisky, "Estimating Entropy Rates with Bayesian Confidence Intervals", Neural Computation 17 (2005): 1531--1576
- Shiraj Khan, Sharba Bandyopadhyay, Auroop R. Ganguly, Sunil Saigal, David J. Erickson, III, Vladimir Protopopescu, and George Ostrouchov, "Relative performance of mutual information estimation methods for quantifying the dependence among short and noisy data", Physical Review E 76 (2007): 026209
- Ioannis Kontoyiannis, P. H. Algoet, Yu. M. Suhov and A. J. Wyner, "Nonparametric Entropy Estimation for Stationary Processes and Random Fields, with Applications to English Text"
- Christophe Letellier, "Estimating the Shannon Entropy: Recurrence Plots versus Symbolic Dynamics", Physical Review Letters 96 (2006): 254102
- Johan Lim, "Estimation of the Entropy Functional from Dependent Samples", Communications in Statistics: Theory and Methods 36 (2007): 1577--1589
- Ilya Nemenman, "Inference of entropies of discrete random variables with unknown cardinalities," physics/0207009
- Ilya Nemenman, William Bialek and Rob de Ruyter van Steveninck, "Entropy and information in neural spike trains: Progress on the sampling problem", 0306063
- Liam Paninski, "Estimating Entropy on m Bins Given Fewer Than m Samples", IEEE Transactions on Information Theory 50 (2004): 2200--2203
- G. Pola, R. S. Petersen, A. Thiele, M. P. Young and S. Panzeri, "Data-Robust Tight Lower Bounds to the Information Carried by Spike Times of a Neuronal Population", Neural Computation 17 (2005): 1962--2005
- Thomas Schürmann
- "Bias Analysis in Entropy Estimates", cond-mat/0403192
- "Scaling behaviour of entropy estimates," cond-mat/0203409
