<?xml version="1.0"?>
<!-- name="generator" content="blosxom/2.0" -->
<!DOCTYPE rss PUBLIC "-//Netscape Communications//DTD RSS 0.91//EN" "http://my.netscape.com/publish/formats/rss-0.91.dtd">

<rss version="0.91">
  <channel>
    <title>Notebooks   </title>
    <link>http://bactra.org/notebooks</link>
    <description>Cosma's Notebooks</description>
    <language>en</language>

  <item>
    <title>Ensemble Methods in Machine Learning</title>
    <link>http://bactra.org/notebooks/2009/12/27#ensemble-ml</link>
    <description>
&lt;P&gt;Boosting, bagging, binning, stacking, mixtures of experts, ...

&lt;P&gt;Value of &lt;a href=&quot;diversity-in-ml.html&quot;&gt;diversity&lt;/a&gt;.

&lt;P&gt;See also:
	&lt;a href=&quot;collective-cognition.html&quot;&gt;Collective Cognition&lt;/a&gt;;
	&lt;a href=&quot;learning-theory.html&quot;&gt;Learning Theory&lt;/a&gt;;
	&lt;a href=&quot;model-selection.html&quot;&gt;Model Selection&lt;/a&gt;

&lt;ul&gt;Recommended (totally inadequate, what happened to come to mind cleaning
up my files):
	&lt;li&gt;Sanjeev Arora, Elad Hazan and Satyen Kale, &quot;The Multiplicative
Weights Update Method: a Meta Algorithm and Applications &quot;
[&lt;a href=&quot;http://www.cs.princeton.edu/~arora/pubs/MWsurvey.pdf&quot;&gt;PDF
preprint&lt;/a&gt;.  This is an interesting kind of result, which promises
performance which comes to close that achieved by any strategy within a fixed
class, no matter what sequence of data is observed --- but it's
performance &lt;em&gt;on that sequence&lt;/em&gt;, which, as the saying goes, &quot;is no
guarantee of future results&quot;.  Cesa-Bianchi and Lugosi's book has a lot more
along these lines.]
	&lt;li&gt;Nicolo Cesa-Bianchi and Gabor Lugosi, &lt;citE&gt;Prediction, Learning,
and Games&lt;/cite&gt; [&lt;a href=&quot;../weblog/algae-2008-07.html#prediction&quot;&gt;Mini-review&lt;/a&gt;]
	&lt;li&gt;Gerda Claeskens and Nils Lid Hjort, &lt;cite&gt;Model Selection
and Model Averaging&lt;/cite&gt;
	&lt;li&gt;&lt;a href=&quot;http://www.cs.washington.edu/homes/pedrod/&quot;&gt;Pedro
Domingos&lt;/a&gt;, &quot;The Role of Occam's Razor in Knowledge Discovery,&quot; &lt;cite&gt;Data
Mining and Knowledge Discovery,&lt;/cite&gt; &lt;strong&gt;3&lt;/strong&gt; (1999) [&lt;a
href=&quot;http://www.cs.washington.edu/homes/pedrod/dmkd99.ps.gz&quot;&gt;Online&lt;/a&gt;.
Ensemble methods as an apparent violation of Occam's Razor.]
	&lt;li&gt;A. Juditsky, P. Rigollet, A. B. Tsybakov, &quot;Learning by mirror averaging&quot;, &lt;a href=&quot;http://arxiv.org/abs/math/0511468&quot;&gt;arxiv:math/0511468&lt;/a&gt; =
&lt;cite&gt;Annals of Statistics&lt;/cite&gt; &lt;strong&gt;36&lt;/strong&gt; (2008): 2183--2206
	&lt;li&gt;G. Langer and U. Parlitz, &quot;Modeling parameter dependence from time
series&quot;, &lt;a href=&quot;http://dx.doi.org/10.1103/PhysRevE.70.056217&quot;&gt;&lt;cite&gt;Physical
Review E&lt;/cite&gt; &lt;strong&gt;70&lt;/strong&gt; (2004): 056217&lt;/a&gt; [Interesting use of
ensemble methods in state space modeling]
	&lt;li&gt;Laurence K. Saul and Michael I. Jordan, &quot;Mixed Memory Markov
Models: Decomposing Complex Stochastic Processes as Mixtures of Simpler
Ones&quot;, &lt;cite&gt;Machine Learning&lt;/cite&gt; &lt;strong&gt;37&lt;/strong&gt; (1999): 75--87
	&lt;li&gt;Robert E. Schapire, Yoav Freund, Peter Bartlett and Wee Sun Lee,
&quot;Boosting the Margin: A New Explanation for the Effectiveness of Voting
Methods&quot;, &lt;cite&gt;Annals of Statistics&lt;/cite&gt; &lt;strong&gt;26&lt;/strong&gt; (1998):
1651--1686
	&lt;/ul&gt;

&lt;ul&gt;To read:
	&lt;li&gt;Ran Avnimelech and Nathan Intrator, &quot;Boosted Mixture of Experts: An
Ensemble Learning Scheme&quot;, &lt;a
href=&quot;http://dx.doi.org/10.1162/089976699300016737&quot;&gt;&lt;cite&gt;Neural Computation&lt;/cite&gt; &lt;strong&gt;11&lt;/strong&gt;
(1999): 483--497&lt;/a&gt;
	&lt;li&gt;Larry M. Bartels, &quot;Specification Uncertainty and Model
Averaging&quot;, &lt;cite&gt;American Journal of Political
Science&lt;/cite&gt; &lt;strong&gt;41&lt;/strong&gt; (1997): 641--674
	&lt;li&gt;G&amp;eacute;rard Biau, Luc Devroye and G&amp;aacute;bor Lugosi,
&quot;Consistency of Random Forests and Other Averaging Classifiers&quot;,
&lt;a href=&quot;&quot;&gt;&lt;cite&gt;Journal of Machine Learning Research&lt;/cite&gt; &lt;strong&gt;9&lt;/strong&gt;
(2008): 2015--2033&lt;/a&gt; [&quot;In the last years of his life, Leo Breiman promoted
random forests for use in classification. He suggested using averaging as a
means of obtaining good discrimination rules. The base classifiers used for
averaging are simple and randomized, often based on random samples from the
data. He left a few questions unanswered regarding the consistency of such
rules. In this paper, we give a number of theorems that establish the universal
consistency of averaging rules. We also show that some popular classifiers,
including one suggested by Breiman, are not universally consistent.&quot;]
 	&lt;li&gt;Gavin Brown, Jeremy L. Wyatt and Pter Tino, &quot;Managing Diversity
in Regression Ensembles&quot;, &lt;cite&gt;Journal of Machine Learning Research&lt;/cite&gt;
&lt;strong&gt;6&lt;/strong&gt; (2005): 1621--1650
	&lt;li&gt;Bruno Caprile, Cesare Furlanello and Stefano Merler, &quot;The Dynamics
of AdaBoost Weights Tells You What's Hard to Classify,&quot; &lt;a
href=&quot;http://arxiv.org/abs/cs.LG/0201014&quot;&gt;cs.LG/0201014&lt;/a&gt;
	&lt;li&gt;Zhuo Chen and Yuhong Yan, &quot;Time Series Models for Forecasting:
Testing or Combining?&quot;, &lt;cite&gt;Studies in Nonlinear Dynamics and
Econometrics&lt;/cite&gt; &lt;strong&gt;11:1&lt;/strong&gt; (2007): 3
	&lt;li&gt;M. Di Marzio and C. C. Taylor, &quot;Kernel density classification and
boosting: an L2 analysis&quot;, &lt;a
href=&quot;http://dx.doi.org/10.1007/s11222-005-6203-8&quot;&gt;&lt;cite&gt;Statistics and
Computing&lt;/cite&gt; &lt;strong&gt;15&lt;/strong&gt; (2005): 113--123&lt;/a&gt;
	&lt;li&gt;Yoav Freund, Yishay Mansour and Robert E. Schapire, &quot;Generalization
bounds for averaged classifiers&quot;, &lt;cite&gt;Annals of
Statistics&lt;/cite&gt; &lt;strong&gt;32&lt;/strong&gt; (2004): 1698--1722 = &lt;a
href=&quot;http://arxiv.org/abs/math.ST/0410092&quot;&gt;math.ST/0410092&lt;/a&gt;
	&lt;li&gt;Yoav Freund, Robert E. Schapire, Yoram Singer and Manfred
K. Warmuth, &quot;Using and combining predictors that specialize&quot; [&lt;a
href=&quot;http://www1.cs.columbia.edu/~freund/papers/SpecializedExperts.pdf&quot;&gt;PDF
preprint&lt;/a&gt;]
	&lt;li&gt;Jerome H. Friedman, Bogdan E. Popescu, &quot;Predictive learning via rule ensembles&quot;, &lt;a href=&quot;http://arxiv.org/abs/0811.1679&quot;&gt;arxiv:0811.1679&lt;/a&gt;
	&lt;li&gt;G. Fumera and F. Roli, &quot;A Theoretical and Experimental Analysis of
Linear Combiners for Multiple Classifier Systems&quot;, &lt;a
href=&quot;http://dx.doi.org/10.1109/TPAMI.2005.109&quot;&gt;&lt;cite&gt;IEEE Transactions on
Pattern Analysis and Machine Intelligence&lt;/cite&gt; &lt;strong&gt;27&lt;/strong&gt; (2005):
942--956&lt;/a&gt;
	&lt;li&gt;Nicolas Garcia-Pedrajas, Cesar Garcia-Osorio and Colin Fyfe,
&quot;Nonlinear Boosting Projections for Ensemble Construction&quot;,
&lt;a
href=&quot;http://jmlr.csail.mit.edu/papers/volume8/garcia-pedrajas07a/garcia-pedrajas07a.pdf&quot;&gt;&lt;cite&gt;Journal
of Machine Learning Research&lt;/cite&gt; &lt;strong&gt;8&lt;/strong&gt; (2007): 1--33&lt;/a&gt;
	&lt;li&gt;Alexander Goldenshluger, &quot;A universal procedure for
aggregating estimators&quot;, &lt;a href=&quot;http://arxiv.org/abs/0704.2500&quot;&gt;arxiv:0704.2500&lt;/a&gt; = &lt;citE&gt;Annals of Statistics&lt;/cite&gt; &lt;strong&gt;37&lt;/strong&gt; (2009): 542--568
	&lt;li&gt;Etienne Grossmann, &quot;A Theory of Probabilistic Boosting, Decision
Trees and Matryoshki&quot;, &lt;a
href=&quot;http://arxiv.org/abs/cs.LG/0607110&quot;&gt;cs.LG/0607110&lt;/a&gt;
	&lt;li&gt;Jakob Vogdrup Hansen, &lt;cite&gt;Combining Predictors: Meta Machine
Learning Methods and Bias/Variance &amp;amp; Ambiguity Decompositions&lt;/cite&gt; [Ph.D.
thesis, University of Aarhus, 2000;
&lt;a href=&quot;http://www.daimi.au.dk/~vogdrup/diss.ps&quot;&gt;on-line&lt;/a&gt;]
	&lt;li&gt;Geoffrey E. Hinton, &quot;Training Products of Experts by Minimizing
Contrastive Divergence,&quot; &lt;cite&gt;Neural Computation&lt;/cite&gt; &lt;strong&gt;14&lt;/strong&gt;
(2002): 1771--1800.
	&lt;li&gt;Marcus Hutter and Jan Poland, &quot;Adaptive Online Prediction by
Following the Perturbed Leader&quot;, &lt;a
href=&quot;http://arxiv.org/abs/cs.AI/0504078&quot;&gt;cs.AI/0504078&lt;/a&gt; = &lt;cite&gt;Journal of
Machine Learning Research&lt;/cite&gt; &lt;strong&gt;6&lt;/strong&gt; (2005): 639--660
	&lt;li&gt;Robert A. Jacobs, &quot;Bias/Variance Analyses of Mixtures-of-Experts
Architectures&quot;, &lt;cite&gt;Neural Computation&lt;/cite&gt; &lt;strong&gt;9&lt;/strong&gt; (1997):
369--383 [&quot;This article investigates the bias and variance of
mixtures-of-experts (ME) architectures. The variance of an ME architecture can
be expressed as the sum of two terms: the first term is related to the
variances of the expert networks that comprise the architecture and the second
term is related to the expert networks' covariances. One goal of this article
is to study and quantify a number of properties of ME architectures via the
metrics of bias and variance. A second goal is to clarify the relationships
between this class of systems and other systems that have recently been
proposed. It is shown that in contrast to systems that produce unbiased experts
whose estimation errors are uncorrelated, ME architectures produce biased
experts whose estimates are negatively correlated.&quot;]
	&lt;li&gt;Wenxin Jiang, &quot;Boosting with Noisy Data: Some Views from
Statistical
Theory&quot;, &lt;a
href=&quot;http://neco.mitpress.org/cgi/content/abstract/16/4/789&quot;&gt;&lt;citE&gt;Neural
Computation&lt;/cite&gt; &lt;strong&gt;16&lt;/strong&gt; (2004): 789--810&lt;/a&gt;
	&lt;li&gt;Ludmila I. Kuncheva, &lt;cite&gt;Combining Pattern Classifiers: Methods
and Algorithms&lt;/cite&gt;
	&lt;li&gt;Nicole Kraemer, &quot;Boosting for Functional
Data&quot;, &lt;a href=&quot;http://arxiv.org/abs/math.ST/0605751&quot;&gt;math.ST/0605751&lt;/a&gt;
	&lt;li&gt;Guillaume Lecu&amp;eaucte;, &quot;Lower Bounds and Aggregation in Density
Estimation&quot;, &lt;a
href=&quot;http://jmlr.csail.mit.edu/papers/v7/lecue06a.html&quot;&gt;&lt;cite&gt;Journal of
Machine Learning Research&lt;/cite&gt;
&lt;strong&gt;7&lt;/strong&gt; (2006): 971--981&lt;/a&gt;
	&lt;li&gt;David Mease, Abraham J. Wyner and Andreas Buja, &quot;Boosted
Classification Trees and Class Probability/Quantile
Estimation&quot;, &lt;a href=&quot;http://jmlr.csail.mit.edu/papers/volume8/mease07a.html&quot;&gt;&lt;cite&gt;Journal
of Machine Learning Research&lt;/cite&gt; &lt;strong&gt;8&lt;/strong&gt; (2007): 409--439&lt;/a&gt;
	&lt;li&gt;Nicolai Meinshausen, &quot;Forest
Garrote&quot;, &lt;a href=&quot;http://arxiv.org/abs/0906.3590&quot;&gt;arxiv:0906.3590&lt;/a&gt;
	&lt;li&gt;David J. Miller and Siddharth Pal, &quot;Transductive Methods for the
Distributed Ensemble Classification Problem&quot;, &lt;a
href=&quot;http://neco.mitpress.org/cgi/content/abstract/19/3/856&quot;&gt;&lt;cite&gt;Neural
Computation&lt;/cite&gt; &lt;strong&gt;19&lt;/strong&gt; (2007): 856--884&lt;/a&gt;
	&lt;li&gt;Seiji Miyoshi, Kazuyuki Hara, and Masato Okada, &quot;Analysis of
ensemble learning using simple perceptrons based on online learning theory&quot;,
&lt;a href=&quot;http://dx.doi.org/10.1103/PhysRevE.71.036116&quot;&gt;&lt;cite&gt;Physical Review
E&lt;/cite&gt; &lt;strong&gt;71&lt;/strong&gt; (2005): 036116&lt;/a&gt;
	&lt;li&gt;L. Nunes and E. Oliveira, &quot;On Learning by Exchanging
Advice,&quot; &lt;a href=&quot;http://arxiv.org/abs/cs.LG/0203010&quot;&gt;cs.LG/0203010&lt;/a&gt;
	&lt;li&gt;Frenando C. Pereira and Yoram Singer, &quot;An Efficient Extension to
Mixture Techniques for Prediction and Decision Trees&quot;, &lt;a
href=&quot;http://dx.doi.org/10.1023/A:1007670818503&quot;&gt;&lt;cite&gt;Machine
Learning&lt;/cite&gt; &lt;strong&gt;36&lt;/strong&gt; (1999): 183--199&lt;/a&gt;
	&lt;li&gt;Evgueni Petrov, &quot;Constraint-based analysis of composite
solvers,&quot; &lt;a href=&quot;http://arxiv.org/abs/cs.AI/0302036&quot;&gt;cs.AI/0302036&lt;/a&gt;
	&lt;li&gt;Philippe Rigollet, &quot;Maximum likelihood aggregation and
misspecified generalized linear models&quot;, &lt;a href=&quot;http://arxiv.org/abs/0911.2919&quot;&gt;arxiv:0911.2919&lt;/a&gt;
	&lt;li&gt;Yoram Singer, &quot;Adaptive Mixtures of Probabilistic Transducers&quot;, &lt;a
href=&quot;http://neco.mitpress.org/cgi/content/abstract/9/8/1711&quot;&gt;&lt;cite&gt;Neural
Computation&lt;/cite&gt; &lt;strong&gt;9&lt;/strong&gt; (1997): 1711--1733&lt;/a&gt; [&lt;a
href=&quot;http://www.cs.huji.ac.il/~singer/papers/mmap.ps.gz&quot;&gt;PS.gz preprint&lt;/a&gt;]
	&lt;li&gt;Eiji Takimoto and Akira Maruoka, &quot;Top-down decision tree learning
as information based boosting,&quot; &lt;a
href=&quot;http://dx.doi.org/10.1016/S0304-3975(02)00181-0&quot;&gt;&lt;cite&gt;Theoretical
Computer Science&lt;/cite&gt; &lt;strong&gt;292&lt;/strong&gt; (2002): 447-464&lt;/a&gt;
	&lt;li&gt;H&amp;eacute;la Zouari, Laurent Heutte and Yves Lecourtier,
&quot;Controlling the diversity in classifier ensembles through a measure of
agreement&quot;, &lt;a
href=&quot;http://dx.doi.org/10.1016/j.patcog.2005.02.012&quot;&gt;&lt;cite&gt;Pattern
Recognition&lt;/cite&gt; &lt;strong&gt;38&lt;/strong&gt; (2005): 2195--2199&lt;/a&gt;
	&lt;/ul&gt;
</description>
  </item>
  </channel>
</rss>