Estimating Entropies and Informations
10 Oct 2011 10:34
The central mathematical objects in information theory are the entropies of random variables. These ("Shannon") entropies are properties of the probability distributions of the variables, rather than of particular realizations. (This is unlike the Boltzmann entropy of statistical mechanics, which is an objective property of the microscopic state, at least once we fix our partition of the former into macroscopic states. Confusing the two entropies is common but bad.) The question which concerns me here is how to estimate the entropy of the distribution, given a sample of realizations. The most obvious approach, if one knows the form of the distribution but not the parameters, is to estimate the parameters and then plug in. But it feels like one should be able to do this more non-parametrically. The obvious non-parametric estimator is of course just the entropy of the empirical distribution, what one might call the empirical entropy. However, the empirical distribution isn't always the best estimate of the true distribution (one might perfer, e.g., some kind of kernel density estimate). For that matter, we often don't really care about the distribution, just its entropy, so some more direct estimator would be nice.
What would be really nice would be to not just have point estimates but also confidence intervals. Non-parametrically, my guess is that the only feasible way to do this is bootstrapping.
For finite alphabets, one approach would be to use something like variable length Markov chains, or causal state reconstruction, to reconstruct a get machine capable of generating the sequence. From the machine, it is easy to calculate the entropy of words or blocks of any finite length, and even the entropy rate. My experience with using CSSR is that the entropy rate estimates can get very good even when the over-all reconstruction of the structure is very poor, but I don't have any real theory on that. I suspect CSSR converges on the true entropy rate faster than do variable length Markov chains, because the former has greater expressive power, but again I don't know that for sure.
Using gzip is a bad idea (for this purpose; it works fine for data compression).
See also: Bootstrapping Entropy Estimates
- Recommended:
- Jose M. Amigo, Janusz Szczepanski, Elek Wajnryb and Maria V. Sanchez-Vives, "Estimating the Entropy Rate of Spike Trains via Lempel-Ziv Complexity", Neural Computation 16 (2004): 717--736 [Normally, I have strong views on using Lempel-Ziv to measure entropy rates, but here they are using the 1976 Lempel-Ziv definitions, not the 1978 ones. The difference is subtle, but important; 1978 leads to gzip and practical compression algorithms, but very bad entropy estimates; 1976 leads, as they show numerically, to quite good entropy rate estimates, at least for some processes. Thanks to Dr. Szczepanski for correspondence about this paper.]
- J.-R. Chazottes and D. Gabrielli, "Large deviations for empirical entropies of Gibbsian sources", math.PR/0406083 = Nonlinearity 18 (2005): 2545--2563 [This is a very cool result which shows that block entropies, and entropy rates estimated from those blocks, obey the large deviation principle even as one lets the length of the blocks grow with the amount of data, provided the block-length doesn't grow too quickly (only logarithmically). I wish I could write papers like this.]
- John W. Fisher III, Alexander T. Ihler and Paula A. Viola, "Learning Informative Statistics: A Nonparametric Approach", pp. 900--906 in NIPS 12 (1999) [PDF reprint. I'd call this more of a semi-parametric approach than a fully non-parametric one; they assume a parametric form for the dependence structure, but are agnostic about the distributions of innovations, and so try to maximize non-parametrically estimated mutual informations.]
- Yongmiao Hong and Halbert White, "Asymptotic Distribution Theory for Nonparametric Entropy Measures of Serial Dependence", Econometrica 73 (2005): 837--901 [JSTOR; PDF reprint via Prof. White]
- Matthew B. Kennel, Jonathon Shlens, Henry D. I. Abarbanel and E. J. Chichilnisky, "Estimating Entropy Rates with Bayesian Confidence Intervals", Neural Computation 17 (2005): 1531--1576
- Alexander Kraskov, Harald Stögbauer and Peter Grassberger, "Estimating Mutual Information", cond-mat/0305641 = Physical Review E 69 (2004): 066138
- Liam Paninski, "Estimation of Entropy and Mutual Information", Neural Computation 15 (2003): 1191--1253 [Preprint; code]
- Thomas Schuermann and Peter Grassberger, "Entropy estimation of symbol sequences," Chaos 6 (1996): 414--427 = cond-mat/0203436
- Jonathon Shlens, Matthew B. Kennel, Henry D. I. Abarbanel, E. J. Chichilnisky, "Estimating Information Rates with Confidence Intervals in Neural Spike Trains", Neural Computation 19 (2007): 1683--1719
- Jonathan D. Victor, "Asymptotic Bias in Information Estimates and the Exponential (Bell) Polynomials", Neural Computation 12 (2000): 2797--2804 [Calculates the bias in the empirical entropy, as an estimator of the true entropy, under IID sampling of a discrete space. Interestingly, the first-order (1/n) term in the bias does not depend on the actual distribution, though the higher-order terms do.]
- Benjamin Weiss, Single Orbit Dynamics [Discusses procedures for non-parametrically estimating entropy of suitably ergodic sources, using just one realization of the process.]
- To read:
- M. S. Baptista, E. J. Ngamga, Paulo R. F. Pinto, Margarida Brito, J. Kurths, "Kolmogorov-Sinai entropy from recurrence times", arxiv:0908.3401
- D. Benedetto, E. Caglioti, G. Cristadoro, M. Degli Esposti, "Relative entropy via non-sequential recursive pair substitutions", arxiv:1007.3384 [The first two authors were also the lead authors of the epic-fail "Language Trees and Zipping" paper, arxiv:cond-mat/0108530, but perhaps they've improved.]
- Juan A. Bonachela, Haye Hinrichsen, Miguel A. Munoz, "Entropy estimates of small data sets", arxiv:0804.4561
- Salim Bouzebda and Issam Elhattab, "Uniform-in-bandwidth consistency for kernel-type estimators of Shannon's entropy", Electronic Journal of Statistics 5 (2011): 440--459
- H. Cai, S. R. Kulkarni and S. Verdu, "Universal Entropy Estimation via Block Sorting", IEEE Transactions on Information Theory 50 (2004): 1551--1561
- C. J. Cellucci, A. M. Albano and P. E. Rapp, "Statistical validation of mutual information calculations: Comparison of alternative numerical algorithms", Physical Review E 71 (2005): 066208 [From the abstract: "[A] minimum description length argument is used to determine the optimal number of elements to use when characterizing the distributions of X and Y. However, even when using partitions of the X and Y axis indicated by minimum description length, mutual information calculations performed with a uniform partition of the XY plane can give misleading results. This motivated the construction of an algorithm for calculating mutual information that uses an adaptive partition. This algorithm also incorporates an explicit test of the statistical independence of X and Y in a calculation that returns an assessment of the corresponding null hypothesis. The previously published Fraser-Swinney algorithm for calculating mutual information includes a sophisticated procedure for local adaptive control of the partitioning process. When the Fraser and Swinney algorithm and the algorithm constructed here are compared, they give very similar numerical results (less than 4% difference in a typical application). Detailed comparisons are possible when X and Y are correlated jointly Gaussian distributed because an analytic expression for I(X,Y) can be derived for that case. Based on these tests, three conclusions can be drawn. First, the algorithm constructed here has an advantage over the Fraser-Swinney algorithm in providing an explicit calculation of the probability of the null hypothesis that X and Y are independent. Second, the Fraser-Swinney algorithm is marginally the more accurate of the two algorithms when large data sets are used. With smaller data sets, however, the Fraser-Swinney algorithm reports structures that disappear when more data are available. Third, the algorithm constructed here requires about 0.5% of the computation time required by the Fraser-Swinney algorithm."]
- J.-R. Chazottes and E. Uglade, "Entropy estimation and fluctuations of Hitting and Recurrence Times for Gibbsian sources", math.DS/0401093
- Gabriela Ciuperca and Valerie Girardin, "Estimation of the Entropy Rate of a Countable Markov Chain", Communications in Statistics: Theory and Methods 36 (2007): 2543--2557
- G. Ciuperca, V. Girardin and L. Lhote, "Computation and Estimation of Generalized Entropy Rates for Denumerable Markov Chains", IEEE Transactions on Information Theory 57 (2011): 4026--4034 [The estimation is just plugging in the MLE of the parameters, for finitely-parametrized chains, but they claim to show that works well]
- J.-R. Chazottes, C. Maldonado, "Concentration bounds for entropy estimation of one-dimensional Gibbs measures", arxiv:1102.1816
- Tommy W. S. Chow and D. Huang, "Estimating Optimal Feature Subsets Using Efficient Estimation of High-Dimensional Mutual Information", IEEE Transactions on Neural Networks 16 (2005): 213--224
- Peter Clifford and Ioana Ada Cosma, "A simple sketching algorithm for entropy estimation", arxiv:0908.3961
- Marshall Crumiller, Bruce Knight, Yunguo Yu and Ehud Kaplan, "Estimating the amount of information conveyed by a population of neurons" [PDF preprint via Dr. Kaplan]
- J. M. Finn, J. D. Goettee, Z. Toroczkai, M. Anghel and B. P. Wood, "Estimation of entropies and dimensions by nonlinear symbolic time series analysis", Chaos 13 (2003): 444--456
- Yun Gao, Ioannis Kontoyiannis, Elie Bienenstock, "From the entropy to the statistical structure of spike trains", arxiv:0710.4117
- M. N. Goria, N. N. Leonenko, V. V. Mergel and Pl L. Novi Inverardi, "A new class of random vector entropy estimators and its applications in testing statistical hypotheses", Journal of Nonparametric Statistics 17 (2005): 277--297
- Peter Grassberger, "Data Compression and Entropy Estimates by Non-sequential Recursive Pair Substitution," physics/0207023 [On Jimenez-Montano, Ebeling and Poeschel]
- Jean Hausser and Korbinian Strimmer, "Entropy Inference and the James-Stein Estimator, with Application to Nonlinear Gene Association Networks", Journal of Machine Learning Research 10 (2009): 1469--1484
- Miguel Angel Jimenez-Montano, Werner Ebeling, and Thorsten Poeschel, "SYNTAX: A computer program to compress a sequence and to estimate its information content," cond-mat/0204134
- David Källberg, Nikolaj Leonenko, Oleg Seleznjev, "Statistical Inference for Rényi Entropy Functionals", arxiv:1103.4977
- Alexei Kaltchenko, "Algorithms for Estimating Information Distance with Applications to Bioinformatics and Linguistics", cs.CC/0404039
- Shiraj Khan, Sharba Bandyopadhyay, Auroop R. Ganguly, Sunil Saigal, David J. Erickson, III, Vladimir Protopopescu, and George Ostrouchov, "Relative performance of mutual information estimation methods for quantifying the dependence among short and noisy data", Physical Review E 76 (2007): 026209
- Ioannis Kontoyiannis, P. H. Algoet, Yu. M. Suhov and A. J. Wyner, "Nonparametric Entropy Estimation for Stationary Processes and Random Fields, with Applications to English Text"
- Nikolai Leonenko, Luc Pronzato and Vippal Savani, "A class of Rényi information estimators for multidimensional densities", Annals of Statistics 36 (2008): 2153--2182, arxiv:0810.5302
- Annick Lesne, Jean-Luc Blanc and Laurent Pezard, "Entropy estimation of very short symbolic sequences", Physical Review E 79 (2009): 046208
- Christophe Letellier, "Estimating the Shannon Entropy: Recurrence Plots versus Symbolic Dynamics", Physical Review Letters 96 (2006): 254102
- Johan Lim, "Estimation of the Entropy Functional from Dependent Samples", Communications in Statistics: Theory and Methods 36 (2007): 1577--1589
- Tiger W. Lin and George N. Reeke, "A Continuous Entropy Rate Estimator for Spike Trains Using a K-Means-Based Context Tree", Neural Computation 22 (2010): 998--1024
- Ilya Nemenman, "Inference of entropies of discrete random variables with unknown cardinalities," physics/0207009
- Ilya Nemenman, William Bialek and Rob de Ruyter van Steveninck, "Entropy and information in neural spike trains: Progress on the sampling problem", 0306063
- XuanLong Nguyen, Martin J. Wainwright, Michael I. Jordan, "Estimating divergence functionals and the likelihood ratio by convex risk minimization", arxiv:0809.0853
- Leandro Pardo, Statistical Inference Based on Divergence Measures
- Dávid Pál, Barnabás Póczos, Csaba Szepesvári, "Estimation of Rényi Entropy and Mutual Information Based on Generalized Nearest-Neighbor Graphs", arxiv:1003.1954
- Liam Paninski, "Estimating Entropy on m Bins Given Fewer Than m Samples", IEEE Transactions on Information Theory 50 (2004): 2200--2203
- Angeliki Papana and Dimitris Kugiumtzis, "Evaluation of Mutual Information Estimators for Time Series", arxiv:0904.4753
- G. Pola, R. S. Petersen, A. Thiele, M. P. Young and S. Panzeri, "Data-Robust Tight Lower Bounds to the Information Carried by Spike Times of a Neuronal Population", Neural Computation 17 (2005): 1962--2005
- Thomas Schürmann
- "Bias Analysis in Entropy Estimates", cond-mat/0403192
- "Scaling behaviour of entropy estimates," cond-mat/0203409
- Kumar Sricharan, Raviv Raich, Alfred O. Hero III, "Empirical estimation of entropy functionals with confidence", arxiv:1012.4188
- Taiji Suzuki, Masashi Sugiyama and Toshiyuki Tanaka, "Mutual information approximation via maximum likelihood estimation of desnity ratio", ISIT 2009 [PDF preprint via Prof. Sugiyama]
