Ensemble Methods in Machine Learning
15 Apr 2012 17:34
Boosting, bagging, binning, stacking, mixtures of experts, ...
I have an Idea about how to use model averaging to cope with non-stationary time series forecasting, but need to find time to work on it.
Value of diversity.
See also: Collective Cognition; Learning Theory; Model Selection
- Recommended (totally inadequate, what happened to come to mind cleaning
up my files):
- Pierre Alquier and Olivier Wintenberger, "Model selection and randomization for weakly dependent time series forecasting", arxiv:0902.2924
- Sanjeev Arora, Elad Hazan and Satyen Kale, "The Multiplicative Weights Update Method: a Meta Algorithm and Applications " [PDF preprint. This is an interesting kind of result, which promises performance which comes to close that achieved by any strategy within a fixed class, no matter what sequence of data is observed --- but it's performance on that sequence, which, as the saying goes, "is no guarantee of future results". Cesa-Bianchi and Lugosi's book has a lot more along these lines.]
- Peter Bühlmann and Sara van de Geer, Statistics for High-Dimensional Data: Methods, Theory and Applications [For the extensive treatment of boosting. Mini-review]
- Nicolo Cesa-Bianchi and Gabor Lugosi, Prediction, Learning, and Games [Mini-review]
- Gerda Claeskens and Nils Lid Hjort, Model Selection and Model Averaging [Review: How Can You Pick Just One?]
- Pedro Domingos, "The Role of Occam's Razor in Knowledge Discovery," Data Mining and Knowledge Discovery, 3 (1999) [Online. Ensemble methods as an apparent violation of Occam's Razor.]
- Bruce E. Hansen
- "Least Squares Model Averaging", Econometrica 75 (2007): 1175--1189 [Reprint via Prof. Hansen]
- "Least Squares Forecast Averaging", Journal of Econometrics 146 (2008): 342--350 [Reprint via Prof. Hansen]
- Elad Hazan and Satyen Kale, "Extracting certainty from uncertainty: regret bounded by variation in costs", Machine Learning 80 (2010): 165--188
- Robert Kleinberg, Alexandru Niculescu-Mizil, Yogeshwer Sharma, "Regret Bounds for Sleeping Experts and Bandits", Machine Learning 80 (2010): 245--272
- J. Zico Kolter and Marcus A. Maloof, "Dynamic Weighted Majority: An Ensemble Method for Drifting Concepts", Journal of Machine Learning Research 8 (2007): 2755--2790
- A. Juditsky, P. Rigollet, A. B. Tsybakov, "Learning by mirror averaging", arxiv:math/0511468 = Annals of Statistics 36 (2008): 2183--2206
- G. Langer and U. Parlitz, "Modeling parameter dependence from time series", Physical Review E 70 (2004): 056217 [Interesting use of ensemble methods in state space modeling]
- Laurence K. Saul and Michael I. Jordan, "Mixed Memory Markov Models: Decomposing Complex Stochastic Processes as Mixtures of Simpler Ones", Machine Learning 37 (1999): 75--87
- Robert E. Schapire, Yoav Freund, Peter Bartlett and Wee Sun Lee, "Boosting the Margin: A New Explanation for the Effectiveness of Voting Methods", Annals of Statistics 26 (1998): 1651--1686
- Kyupil Yeon, Moon Sup Song, Yongdai Kim, Hosik Choi, Cheolwoo Park, "Model averaging via penalized regression for tracking concept drift", Journal of Computational and Graphical Statistics online before print (2010)
- To read:
- Ran Avnimelech and Nathan Intrator, "Boosted Mixture of Experts: An Ensemble Learning Scheme", Neural Computation 11 (1999): 483--497
- Larry M. Bartels, "Specification Uncertainty and Model Averaging", American Journal of Political Science 41 (1997): 641--674
- Gérard Biau, Luc Devroye and Gábor Lugosi, "Consistency of Random Forests and Other Averaging Classifiers", Journal of Machine Learning Research 9 (2008): 2015--2033 ["In the last years of his life, Leo Breiman promoted random forests for use in classification. He suggested using averaging as a means of obtaining good discrimination rules. The base classifiers used for averaging are simple and randomized, often based on random samples from the data. He left a few questions unanswered regarding the consistency of such rules. In this paper, we give a number of theorems that establish the universal consistency of averaging rules. We also show that some popular classifiers, including one suggested by Breiman, are not universally consistent."]
- Gavin Brown, Jeremy L. Wyatt and Pter Tino, "Managing Diversity in Regression Ensembles", Journal of Machine Learning Research 6 (2005): 1621--1650
- Peter Bühlmann and Torsten Hothorn, "Boosting Algorithms: Regularization, Prediction and Model Fitting", Statistical Science 22 (2007): 477--505, arxiv:0804.2752 [with commentary following]
- Bruno Caprile, Cesare Furlanello and Stefano Merler, "The Dynamics of AdaBoost Weights Tells You What's Hard to Classify," cs.LG/0201014
- Kamalika Chaudhuri, Yoav Freund, Daniel Hsu, "A parameter-free hedging algorithm", arxiv:0903.2851 [Doing about as well as a given fraction of the ensemble]
- Zhuo Chen and Yuhong Yan, "Time Series Models for Forecasting: Testing or Combining?", Studies in Nonlinear Dynamics and Econometrics 11:1 (2007): 3
- Matthieu Cornec, "Estimating Subbagging by cross-validation", arxiv:1011.5142
- M. Di Marzio and C. C. Taylor, "Kernel density classification and boosting: an L2 analysis", Statistics and Computing 15 (2005): 113--123
- Yoav Freund, "A more robust boosting algorithm", arxiv:0905.2138
- Yoav Freund, Yishay Mansour and Robert E. Schapire, "Generalization bounds for averaged classifiers", Annals of Statistics 32 (2004): 1698--1722 = math.ST/0410092
- Yoav Freund, Robert E. Schapire, Yoram Singer and Manfred K. Warmuth, "Using and combining predictors that specialize" [PDF preprint]
- Jerome H. Friedman, Bogdan E. Popescu, "Predictive learning via rule ensembles", arxiv:0811.1679
- G. Fumera and F. Roli, "A Theoretical and Experimental Analysis of Linear Combiners for Multiple Classifier Systems", IEEE Transactions on Pattern Analysis and Machine Intelligence 27 (2005): 942--956
- Stéphane Gaïffas and Guillaume Lecué, "Hyper-Sparse Optimal Aggregation", Journal of Machine Learning Research 12 (2011): 1813--1833
- Nicolas Garcia-Pedrajas, Cesar Garcia-Osorio and Colin Fyfe, "Nonlinear Boosting Projections for Ensemble Construction", Journal of Machine Learning Research 8 (2007): 1--33
- Alexander Goldenshluger, "A universal procedure for aggregating estimators", arxiv:0704.2500 = Annals of Statistics 37 (2009): 542--568
- Etienne Grossmann, "A Theory of Probabilistic Boosting, Decision Trees and Matryoshki", cs.LG/0607110
- S. Gualdi, A. De Martino, "How does informational heterogeneity affect the quality of forecasts?", arxiv:0906.0552
- Jakob Vogdrup Hansen, Combining Predictors: Meta Machine Learning Methods and Bias/Variance & Ambiguity Decompositions [Ph.D. thesis, University of Aarhus, 2000; on-line]
- Geoffrey E. Hinton, "Training Products of Experts by Minimizing Contrastive Divergence," Neural Computation 14 (2002): 1771--1800.
- Benjamin Hofner, Torsten Hothorn, Thomas Kneib, and Matthias Schmid, "A Framework for Unbiased Model Selection Based on Boosting", Journal of Computational and Graphical Statistics forthcoming (2011)
- Marcus Hutter and Jan Poland, "Adaptive Online Prediction by Following the Perturbed Leader", cs.AI/0504078 = Journal of Machine Learning Research 6 (2005): 639--660
- Robert A. Jacobs, "Bias/Variance Analyses of Mixtures-of-Experts Architectures", Neural Computation 9 (1997): 369--383 ["This article investigates the bias and variance of mixtures-of-experts (ME) architectures. The variance of an ME architecture can be expressed as the sum of two terms: the first term is related to the variances of the expert networks that comprise the architecture and the second term is related to the expert networks' covariances. One goal of this article is to study and quantify a number of properties of ME architectures via the metrics of bias and variance. A second goal is to clarify the relationships between this class of systems and other systems that have recently been proposed. It is shown that in contrast to systems that produce unbiased experts whose estimation errors are uncorrelated, ME architectures produce biased experts whose estimates are negatively correlated."]
- Wenxin Jiang, "Boosting with Noisy Data: Some Views from Statistical Theory", Neural Computation 16 (2004): 789--810
- Jeremy Z. Kolter and Marcus A. Maloof, "Using Additive Expert Ensembles to Cope with Concept Drift", ICML 2005 [PDF reprint via Kolter]
- Nicole Kraemer, "Boosting for Functional Data", math.ST/0605751
- Ludmila I. Kuncheva, Combining Pattern Classifiers: Methods and Algorithms
- Guillaume Lecu&eaucte;, "Lower Bounds and Aggregation in Density Estimation", Journal of Machine Learning Research 7 (2006): 971--981
- David Mease, Abraham J. Wyner and Andreas Buja, "Boosted Classification Trees and Class Probability/Quantile Estimation", Journal of Machine Learning Research 8 (2007): 409--439
- Nicolai Meinshausen, "Forest Garrote", arxiv:0906.3590
- David J. Miller and Siddharth Pal, "Transductive Methods for the Distributed Ensemble Classification Problem", Neural Computation 19 (2007): 856--884
- Andriy Norets, "Approximation of conditional densities by smooth mixtures of regressions", Annals of Statistics 38 (2010): 1733--1766, arxiv:1010.0581
- L. Nunes and E. Oliveira, "On Learning by Exchanging Advice," cs.LG/0203010
- Frenando C. Pereira and Yoram Singer, "An Efficient Extension to Mixture Techniques for Prediction and Decision Trees", Machine Learning 36 (1999): 183--199
- Evgueni Petrov, "Constraint-based analysis of composite solvers," cs.AI/0302036
- Benedikt M. Pötscher, "The distribution of model averaging estimators and an impossibility result regarding its estimation", arxiv:math/0702781
- Philippe Rigollet, "Maximum likelihood aggregation and misspecified generalized linear models", arxiv:0911.2919
- Robert E. schapire and Yoav Freund, Boosting: Foundations and Algorithms [Blurb]
- Yoram Singer, "Adaptive Mixtures of Probabilistic Transducers", Neural Computation 9 (1997): 1711--1733 [PS.gz preprint]
- David S. Siroky, "Navigating Random Forests and related advances in algorithmic modeling", Statistics Surveys 3 (2009): 147--163
- Eiji Takimoto and Akira Maruoka, "Top-down decision tree learning as information based boosting," Theoretical Computer Science 292 (2002): 447-464
- Peter Welinder, Steve Branson, Serge Belongie and Pietro Perona, "The Multidimensional Wisdom of Crowds", NIPS 2011 (NIPS 23) [PDF reprint]
- Héla Zouari, Laurent Heutte and Yves Lecourtier, "Controlling the diversity in classifier ensembles through a measure of agreement", Pattern Recognition 38 (2005): 2195--2199
- To write:
- CRS, "Adapting to non-stationarity with growing predictor ensembles"
