<?xml version="1.0"?>
<!-- name="generator" content="blosxom/2.0" -->
<!DOCTYPE rss PUBLIC "-//Netscape Communications//DTD RSS 0.91//EN" "http://my.netscape.com/publish/formats/rss-0.91.dtd">

<rss version="0.91">
  <channel>
    <title>Notebooks   </title>
    <link>http://bactra.org/notebooks</link>
    <description>Cosma's Notebooks</description>
    <language>en</language>

  <item>
    <title>Learning Theory (Formal, Computational or Statistical)</title>
    <link>http://bactra.org/notebooks/2009/12/28#learning-theory</link>
    <description>
I qualify it to distinguish this area from the broader field
of &lt;a href=&quot;learning-inference-induction.html&quot;&gt;machine learning&lt;/a&gt;, which
includes much more with lower standards of proof, and from the theory of
learning in organisms, which might be quite different.

&lt;P&gt;The basic set-up is as follows.  We have a bunch of inputs and outputs, and
an unknown relationship between the two.  We do have a class of hypotheses
describing this relationship, and suppose one of them is correct.  (The
hypothesis class is always circumscribed, but may be infinite.)  A learning
algorithm takes in a set of inputs and outputs, its data, and produces a
hypothesis.  Generally we assume the data are generated by some random
process, and the hypothesis changes as the data change.  The key notion is that
of a &lt;em&gt;probaably approximately correct&lt;/em&gt; learning algorithm --- one where,
if we supply enough data, we can get a hypothesis with an arbitrarily small
error, with a probability arbitrarily close to one.

&lt;P&gt;Generally, PAC-results concern either (1) the existence of a PAC algorithm,
(2) quantifying how much data we need, in terms of either accuracy or
reliability, or (3) devising new PAC algorithms with other desirable
properties.  What frustrates me about this literature, and the reason I don't
devote more of my research to it (aside, of course, from my sheer incompetence)
is that almost all of it assumes the data are statistically independent and
identically distributed.  Then PAC-like results follow essentially from
extensions of the ordinary Law of Large Numbers.  What's really needed,
however, is something more like an &lt;a href=&quot;ergodic-theory.html&quot;&gt;ergodic
theorem&lt;/a&gt;, for suitably-dependent data.  That, however,
gets &lt;a href=&quot;dependent-learning.html&quot;&gt;its own notebook&lt;/a&gt;.

&lt;P&gt;An interesting question (which I learned of from Vidyasagar's book) has to
do with the difference between distribution-free and distribution-dependent
bounds.  Generally, the latter are sharper, sometimes much sharper, but this
comes at the price of making more or less strong parametric assumptions about
the distribution.  (One might indeed think of the theory of parametric
statistical inference as learning theory with &lt;em&gt;very&lt;/em&gt; strong
distributional assumptions.)  However, even in the distribution-free set up,
we &lt;em&gt;have&lt;/em&gt; a whole bunch of samples from the distribution, and
non-parametric density estimation is certainly possible --- could one, e.g.,
improve the bounds by using half the sample to estimate the distribution, and
then applying a distribution-dependent bound?  Or will the uncertainty in the
distributional estimate necessarily kill any advantage we might get from
learning about the distribution?  It feels like the latter would say something



&lt;P&gt;See also:
	&lt;a href=&quot;empirical-process-theory.html&quot;&gt;Empirical Process Theory&lt;/a&gt;;
	&lt;a href=&quot;ensemble-ml.html&quot;&gt;Ensemble Methods in Machine Learning&lt;/a&gt;;
	&lt;a href=&quot;bayesian-consistency.html&quot;&gt;Frequentist Consistency of Bayesian Procedures&lt;/a&gt;;
	&lt;a href=&quot;statistics.html&quot;&gt;Statistics&lt;/a&gt;;
	&lt;a href=&quot;structured-data.html&quot;&gt;Statistics of Structured  Data&lt;/a&gt;;
	&lt;a href=&quot;universal-prediction.html&quot;&gt;Universal Prediction Algorithms&lt;/a&gt;


&lt;ul&gt;Recommended, bigger picture:
	&lt;li&gt;&lt;a href=&quot;http://www.kyb.mpg.de/~bousquet/&quot;&gt;Olivier Bousquet&lt;/a&gt;,
&lt;a href=&quot;http://www.lri.fr/~bouchero/&quot;&gt;St&amp;eacute;phane Boucheron&lt;/a&gt;
and &lt;a href=&quot;http://www.econ.upf.es/~lugosi/&quot;&gt;G&amp;aacute;bor Lugosi&lt;/a&gt;,
&quot;Introduction to Statistical Learning Theory&quot;
[&lt;a href=&quot;http://www.stat.cmu.edu/~larry/=sml2008/BBL.pdf&quot;&gt;PDF&lt;/a&gt;.  39 pp.
review on how to bound the error of your learning algorithms.]
	&lt;li&gt;Nicolo Cesa-Bianchi and Gabor Lugosi, &lt;citE&gt;Prediction, Learning,
and Games&lt;/cite&gt; [&lt;a href=&quot;../weblog/algae-2008-07.html#prediction&quot;&gt;Mini-review&lt;/a&gt;]
	&lt;li&gt;Nello Cristianini and John Shawe-Taylor, &lt;cite&gt;An Introduction to
Support Vector Machines&lt;/cite&gt; [While SVMs are one particular technology
among others, this book does an excellent job of crisply introducing the
general theory of learning, and showing its practicality.]
	&lt;li&gt;Michael J. Kearns and Umesh V. Vazirani, &lt;cite&gt;An Introduction to
Computational Learning Theory&lt;/cite&gt; [&lt;a
href=&quot;../reviews/kearns-vazirani/&quot;&gt;Review: How to Build a Better Guesser&lt;/a&gt;]
	&lt;li&gt;John Lafferty and Larry Wasserman, &lt;cite&gt;Statistical
Machine Learning&lt;/cite&gt; [Unpublished lecture notes]
	&lt;li&gt;V. N. Vapnik, &lt;cite&gt;The Nature of
Statistical Learning Theory&lt;/cite&gt; [&lt;a href=&quot;../reviews/vapnik-nature/&quot;&gt;Review:
A Useful Biased Estimator&lt;/a&gt;]
	&lt;li&gt;Mathukumalli Vidyasagar, &lt;cite&gt;A Theory of Learning and
Generalization: With Applications to Neural Networks and Control Systems&lt;/cite&gt;
[Vapnik-nik learning theory meets empirical process
theory.  &lt;a href=&quot;../weblog/algae-209-01.html#vidyasagar&quot;&gt;Mini-review&lt;/a&gt;.]
	&lt;/ul&gt;

&lt;ul&gt;Recommended, close-ups (very misc.):
	&lt;li&gt;Moulinath Banerjee, &quot;Covering Numbers and VC dimension&quot; [review
notes, UM statistics department, 2004; &lt;a
href=&quot;http://www.stat.lsa.umich.edu/~jizhu/LearningTheory/MouliCNVC.ps&quot;&gt;PS&lt;/a&gt;]
	&lt;li&gt;Jonathan Baxter, &quot;A Model of Inductive Bias Learning,&quot;
&lt;cite&gt;Journal of Artificial Intelligence Research&lt;/cite&gt; &lt;strong&gt;12&lt;/strong&gt;
(2000): 149--198 [How to learn what class of hypotheses you should be trying to
use, i.e., your inductive bias.  Assumes independence, again.]
	&lt;li&gt;Wenxin Jiang, &quot;On Uniform Deviations of General Empirical Risks
with Unboundedness, Dependence, and High
Dimensionality&quot;, &lt;a
href=&quot;http://jmlr.csail.mit.edu/papers/v10/jiang09a.html&quot;&gt;&lt;cite&gt;&lt;/cite&gt; &lt;strong&gt;10&lt;/strong&gt;
(2009): 977--996&lt;/a&gt; [This is a &lt;em&gt;very&lt;/em&gt; sweet result, despite the
appearance of constants like 288 in the main theorem.  The term which measures
the effect of dependence is one of the weak dependence coefficients of Dedecker
et al. (see under &lt;a href=&quot;ergodic-theory.html&quot;&gt;ergodic theory&lt;/a&gt;) --- I want
to say it's their first-order gamma coefficient, but I don't have the book to
hand and may be mis-remembering --- applied here to the loss
function.]
	&lt;li&gt;Mehryar Mohri and Afshin Rostamizadeh, &quot;Stability Bound for
Stationary Phi-mixing and Beta-mixing
Processes&quot;, &lt;a href=&quot;http://arxiv.org/abs/0811.1629&quot;&gt;arxiv:0811.1629&lt;/a&gt;
[&quot;Stability&quot; is the property of a learning algorithm that changing a single
observation in the training set leads to only small changes in predictions on
the test set.  This paper shows that stable learning algorithms continue to
perform well with dependent data, provided the dependence decays sufficiently
quickly --- the &quot;mixing&quot; properties of &lt;a href=&quot;ergodic-theory.html&quot;&gt;ergodic
theory&lt;/a&gt;.]
	&lt;li&gt;David Pollard
		&lt;ul&gt;
		&lt;li&gt;&quot;Asymptotics via Empirical Processes&quot;,
&lt;cite&gt;Statistical Science&lt;/cite&gt; &lt;strong&gt;4&lt;/strong&gt; (1989): 341--354
		&lt;li&gt;&lt;cite&gt;Empirical Processes: Theory and Applications&lt;/cite&gt;
[&lt;a href=&quot;../weblog/algae-2008-07.html#pollard&quot;&gt;Review&lt;/a&gt;]
		&lt;/ul&gt;
	&lt;li&gt;C. Scott and R. Nowak, &quot;A Neyman-Pearson Approach to Statistical
Learning&quot;, &lt;a href=&quot;http://dx.doi.org/10.1109/TIT.2005.856955&quot;&gt;&lt;cite&gt;IEEE
Transactions on Information Theory&lt;/cite&gt; &lt;strong&gt;51&lt;/strong&gt; (2005):
3806--3819&lt;/a&gt;
	&lt;/ul&gt;


&lt;ul&gt;To read:
	&lt;li&gt;Andris Ambainis, &quot;Probabilistic and Team PFIN-type Learning:
General Properties&quot;, &lt;a
href=&quot;http://arxiv.org/abs/cs.LG/0504001&quot;&gt;cs.LG/0504001&lt;/a&gt;
	&lt;li&gt;Jean-Yves Audibert, &quot;Fast learning rates in statistical inference
through aggregation&quot;, &lt;a
href=&quot;http://arxiv.org/abs/math.ST/0703854&quot;&gt;math.ST/0703854&lt;/a&gt;
	&lt;li&gt;G. Biau and L. Devroye, &quot;A Note on Density Model Size Testing&quot;,
&lt;cite&gt;IEEE Transactions on Information Theory&lt;/cite&gt; &lt;strong&gt;50&lt;/strong&gt;
(2004): 576--581 [Testing which member of a nested family of model classes you
need to use]
	&lt;li&gt;Gilles Blanchard and Francois Fleuret, &quot;Occam's hammer: a link
between randomized learning and multiple testing FDR
control&quot;, &lt;a href=&quot;http://arxiv.org/abs/math.ST/0608713&quot;&gt;math.ST/0608713&lt;/a&gt;
	&lt;li&gt;Olivier Bousquet and Andr&amp;eacute; Elisseeff, &quot;Algorithmic Stability
and Generalization Performance&quot; [&quot;We present a novel way of obtaining
PAC-style bounds on the generalization of learning algorithms, explicitly
using their stability properties.  A &lt;em&gt;stable&lt;/em&gt; learner is one for
which the learned solution does not change much with small changes in
the training set.  The bounds we obtain do not depend on any measure of
the complexity of the hypothesis spce (e.g. VC dimension) but rather depend
on how the learning algorithm searches this space, and can thus be applied
even when the VC dimension is infinite.&quot;  Sounds like a formalization of
(something close to) Pedro Domingos's &quot;process-oriented evaluation&quot;.  &lt;a href=&quot;http://www.kyb.mpg.de/publications/pdfs/pdf1437.pdf&quot;&gt;PDF&lt;/a&gt;]
	&lt;li&gt;Arnaud Buhot and Mitra B. Gordon, &quot;Storage Capacity of the
Tilinglike Learning Algorithm,&quot; &lt;a
href=&quot;http://arxiv.org/abs/cond-mat/0008162&quot;&gt;cond-mat/0008162&lt;/a&gt;
	&lt;li&gt;Arnaud Buhot, Mirta B. Gordon and Jean-Pierre Nadal, &quot;Rigorous
Bounds to Retarded Learning,&quot; &lt;a
href=&quot;http://arxiv.org/abs/cond-mat/0201256&quot;&gt;cond-mat/0201256&lt;/a&gt;
	&lt;li&gt;Andrea Caponnetto and Alexander Rakhlin, &quot;Stability Properties of
Empirical Risk Minimization over Donsker
Classes&quot;, &lt;a
href=&quot;http://jmlr.csail.mit.edu/papers/volume7/caponnetto06a/caponnetto06a.pdf&quot;&gt;&lt;cite&gt;Journal
of Machine Learning Research&lt;/cite&gt; &lt;strong&gt;7&lt;/strong&gt; (2006): 2565--2583&lt;/a&gt;
	&lt;li&gt;Olivier Catoni, &quot;Improved Vapnik Cervonenkis bounds&quot;, &lt;a
href=&quot;http://arxiv.org/abs/math.ST/0410280&quot;&gt;math.ST/0410280&lt;/a&gt;
	&lt;li&gt;N. Cesa-Bianchi, A. Conconi and C. Gentile, &quot;On the Generalization
Ability of On-Line Learning Algorithms&quot;, &lt;cite&gt;IEEE
Transactions on Information Theory&lt;/cite&gt; &lt;strong&gt;50&lt;/strong&gt; (2004): 2050--2057
	&lt;li&gt;Nicolo Cesa-Bianchi, Yishay Mansour, and Gilles Stoltz,
&quot;Improved Second-Order Bounds for Prediction with Expert Advice&quot;,
&lt;a href=&quot;http://arxiv.org/abs/math.ST/0602629&quot;&gt;math.ST/0602629&lt;/a&gt; [&quot;This work
studies external regret in sequential prediction games with both positive and
negative payoffs. External regret measures the difference between the payoff
obtained by the forecasting strategy and the payoff of the best action.... new
and sharper regret bounds for the well-known exponentially weighted average
forecaster and for a new forecaster with a different multiplicative update
rule. ... no preliminary knowledge about the payoff sequence is needed, not
even its range; ... bounds are expressed in terms of sums of squared payoffs,
replacing larger first-order quantities appearing in previous bounds. In
addition, our most refined bounds have the natural and desirable property of
being stable under rescalings and general translations of the payoff
sequence.&quot;]
	&lt;li&gt;Olivier Chapelle, Bernhard Sch&amp;ouml;lkopf and Alexander Zien
(eds.), &lt;cite&gt;Semi-Supervised Learning&lt;/cite&gt;
[&lt;a href=&quot;http://mitpress.mit.edu/0-262-03358-5&quot;&gt;Blurb&lt;/a&gt;]
	&lt;li&gt;Felipe Cucker and Ding Xuan Zhou, &lt;cite&gt;Learning Theory: An
Approximation Theory Viewpoint&lt;/cite&gt;
[&lt;a href=&quot;http://cambridge.org/052186559X&quot;&gt;Blurb&lt;/a&gt;]
	&lt;li&gt;Arnak Dalalyan and Alexandre B. Tsybakov, &quot;Sparse Regression Learning by Aggregation and Langevin Monte-Carlo&quot;, &lt;a href=&quot;http://arxiv.org/abs/0903.1223&quot;&gt;arxiv:0903.1223&lt;/a&gt;
	&lt;li&gt;Vitaly Feldman, &quot;Robustness of Evolvability&quot;
[&lt;a href=&quot;http://www.eecs.harvard.edu/~vitaly/papers/F_EvolveRobust_09.pdf&quot;&gt;PDF
preprint&lt;/a&gt;]
	&lt;li&gt;Yoav Freund, Yishay Mansour and Robert E. Schapire, &quot;Generalization
bounds for averaged classifiers&quot;, &lt;cite&gt;Annals of
Statistics&lt;/cite&gt; &lt;strong&gt;32&lt;/strong&gt; (2004): 1698--1722 = &lt;a
href=&quot;http://arxiv.org/abs/math.ST/0410092&quot;&gt;math.ST/0410092&lt;/a&gt;
	&lt;li&gt;Alexander Gammerman, Vladimir Vovk, &quot;Hedging predictions in machine
learning&quot;, &lt;a href=&quot;http://arxiv.org/abs/cs.LG/0611011&quot;&gt;cs.LG/0611011&lt;/a&gt;
	&lt;li&gt;Carl Gold and Peter Sollich, &quot;Model Selection for Support Vector
Machine Classification,&quot; &lt;a
href=&quot;http://arxiv.org/abs/cond-mat/0203334&quot;&gt;cond-mat/0203334&lt;/a&gt;
	&lt;li&gt;&lt;a href=&quot;http://www.cs.cmu.edu/~shanneke/&quot;&gt;Steve Hanneke&lt;/a&gt;
		&lt;ul&gt;
		&lt;li&gt;&quot;Teaching Dimension and the Complexity of 
Active Learning&quot; [&lt;a href=&quot;http://www.cs.cmu.edu/~shanneke/docs/2007/hanneke-teaching-dimension.pdf&quot;&gt;PDF&lt;/a&gt;]
		&lt;li&gt;&quot;A Bound on the Label Complexity of Agnostic Active Learning&quot; [&lt;a href=&quot;http://www.cs.cmu.edu/~shanneke/docs/2007/hanneke-agnostic-active.pdf&quot;&gt;PDF&lt;/a&gt;]
		&lt;li&gt;&quot;The Cost Complexity of Active Learning&quot; [&lt;a href=&quot;http://www.cs.cmu.edu/~shanneke/downloads/cost-complexity-working-notes.pdf&quot;&gt;PDF&lt;/a&gt;]
		&lt;/ul&gt;
	&lt;li&gt;Daniel J. L. Herrmann and Dominik Janzing, &quot;Selection Criterion for
Log-Linear Models Using Statistical Learning
Theory&quot;, &lt;a href=&quot;http://arxiv.org/abs/math.ST/0302079&quot;&gt;math.ST/0302079&lt;/a&gt;
	&lt;li&gt;Sanjay Jain, Daniel Oshershon, James S. Royer and Arun Sharma,
&lt;cite&gt;Systems That Learn: An Introduction to Learning Theory&lt;/cite&gt;
	&lt;li&gt;Yuri Kalnishkan, Volodya Vovk and Michael V. Vyugin, &quot;Loss
functions, complexities, and the Legendre transformation&quot;, &lt;a
href=&quot;http://dx.doi.org/10.1016/j.tcs.2003.11.005&quot;&gt;&lt;cite&gt;Theoretical Computer
Science&lt;/cite&gt; &lt;strong&gt;313&lt;/strong&gt; (2004): 195--207&lt;/a&gt;
	&lt;li&gt;Gerard Kerkyacharian and Dominique Picard, &quot;Thresholding in
Learning Theory&quot;, &lt;a
href=&quot;http://arxiv.org/abs/math.ST/0510271&quot;&gt;math.ST/0510271&lt;/a&gt;
	&lt;li&gt;Natalia Komarova and Igor Rivin, &quot;Mathematics of learning,&quot; &lt;a
href=&quot;http://arXiv.org/abs/math/0105235&quot;&gt;math.PR/0105235&lt;/a&gt;
	&lt;li&gt;Pirkko Kuusela, Daniel Ocone and Eduardo D. Sontag, &quot;Learning
Complexity Dimensions for a Continuous-Time Control System,&quot; &lt;a
href=&quot;http://arxiv.org/abs/math.OC/0012163&quot;&gt;math.OC/0012163&lt;/a&gt;
	&lt;li&gt;John Langford, &quot;Tutorial on Practical Prediction Theory for
Classification&quot;, &lt;a href=&quot;http://jmlr.csail.mit.edu/papers/v6/langford05a.html&quot;&gt;&lt;cite&gt;Journal of Machine Learning Research&lt;/cite&gt;
&lt;strong&gt;6&lt;/strong&gt; (2005): 273--306&lt;/a&gt; [For the PAC-Bayesian result]
	&lt;li&gt;A. Lecchini-Visintini, J. Lygeros, J. Maciejowski, &quot;Simulated
Annealing: Rigorous finite-time guarantees for optimization on continuous
domains&quot;, &lt;a href=&quot;http://arxiv.org/abs/0709.2989&quot;&gt;0709.2989&lt;/a&gt;
	&lt;li&gt;Guillaume Lecue, &quot;Simultaneous adaptation to the margin and to
complexity in classification&quot;, &lt;a
href=&quot;http://arxiv.org/abs/math.ST/0509696&quot;&gt;math.ST/0509696&lt;/a&gt;
	&lt;li&gt;Feng Liang, Sayan Mukherjee, Mike West, &quot;The Use of Unlabeled Data
in Predictive
Modeling&quot;, &lt;a href=&quot;http://arxiv.org/abs/0710.4618&quot;&gt;arxiv:0710.4618&lt;/a&gt;
= &lt;cite&gt;Statistical Science&lt;/cite&gt; &lt;strong&gt;22&lt;/strong&gt; (2007): 189--205
	&lt;li&gt;Gabor Lugosi and Marten Wegkamp, &quot;Complexity regularization via
localized random penalties&quot;, &lt;cite&gt;Annals of
Statistics&lt;/cite&gt; &lt;strong&gt;32&lt;/strong&gt; (2004): 16799--1697 = &lt;a
href=&quot;http://arxiv.org/abs/math.ST/0410091&quot;&gt;math.ST/0410091&lt;/a&gt;
	&lt;li&gt;Pascal Massart and &amp;Eacute;lodie N&amp;eacute;d&amp;eacute;lec, &quot;Risk
bounds for statistical
learning&quot;, &lt;a href=&quot;http://arxiv.org/abs/math.ST/0702683&quot;&gt;math.ST/0702683&lt;/a&gt;
= &lt;a href=&quot;http://dx.doi.org/10%2E1214/009053606000000786&quot;&gt;&lt;cite&gt;Annals of
Statistics&lt;/cite&gt;
&lt;strong&gt;34&lt;/strong&gt; (2006): 2326--2366&lt;/a&gt;
	&lt;li&gt;Andreas Maurer, &quot;A Note on the PAC Bayesian Theorem&quot;, &lt;a
href=&quot;http://arxiv.org/abs/cs.LG/0411099&quot;&gt;cs.LG/0411099&lt;/a&gt;
	&lt;li&gt;Shahar Mendelson
		&lt;ul&gt;
		&lt;li&gt;&quot;Improving the Sample Complexity Using Global
Data&quot;, &lt;cite&gt;IEEE Transactions on Information Theory&lt;/cite&gt; &lt;strong&gt;48&lt;/strong&gt;
(2002): 1977--1991
		&lt;li&gt;&quot;Lower Bounds for the Empirical Minimization Algorithm&quot;,
&lt;a href=&quot;http://dx.doi.org/10.1109/TIT.2008.926323&quot;&gt;&lt;cite&gt;IEEE Transactions on
Information Theory&lt;/citE&gt; &lt;strong&gt;54&lt;/strong&gt; (2008): 3797--3803&lt;/a&gt; [&quot;simple
argument ... that under mild geometric assumptions .... the empirical
minimization algorithm cannot yield a uniform error rate that is faster than
$1/sqrt{k}$ in the function learning setup&quot;.]
		&lt;/ul&gt;
	&lt;li&gt;Shahar Mendelson and Gideon Schechtman, &quot;The shattering dimension
of sets of linear functionals&quot;, &lt;citE&gt;Annals of
Probability&lt;/cite&gt; &lt;strong&gt;32&lt;/strong&gt; (2004): 1746--1770 = &lt;a
href=&quot;http://arxiv.org/abs/math.PR/0410096&quot;&gt;math.PR/0410096&lt;/a&gt;
	&lt;li&gt;P. Mitra, C. A. Murthy and S. K. Pal, &quot;A probabilistic active
support vector learning algorithm&quot;, &lt;cite&gt;IEEE Transactions on Pattern Analysis
and Machine Intelligence&lt;/cite&gt; &lt;strong&gt;26&lt;/strong&gt; (2004): 413--418
	  &lt;li&gt;Sayan Mukherjee, Partha Niyogi, Tomaso Poggio and
Ryan Rifkin, &quot;Learning theory: stability is sufficient for generalization
and necessary and sufficient for consistency of empirical
risk minimization&quot;, &lt;cite&gt;Advances in Computational Mathematics&lt;/cite&gt;
&lt;strong&gt;25&lt;/strong&gt; (2006): 161--193
[&lt;a
href=&quot;http://cbcl.mit.edu/projects/cbcl/publications/ps/mukherjee-ACM-06.pdf&quot;&gt;PDF&lt;/a&gt;.
Thanks to Shiva Kaul for pointing this out to me.]
	&lt;li&gt;Roland Nilsson, Jose M. Pena, Johan Bjorkegren and Jesper Tegner,
&quot;Consistent Feature Selection for Pattern Recognition in Polynomial Time&quot;,
&lt;cite&gt;Journal of Machine Learning Research&lt;/cite&gt; &lt;strong&gt;8&lt;/strong&gt; (2007): 589--612
	&lt;li&gt;E. Parrado-Hernandez, I. Mora-Jimenez, J. Arenas-Garca, A. R.
Figueiras-Vidal and A. Navia-Vazquez, &quot;Growing support vector classifiers with
controlled complexity,&quot; &lt;a
href=&quot;http://dx.doi.org/10.1016/S0031-3203(02)00351-5&quot;&gt;&lt;cite&gt;Pattern
Recognition Letters&lt;/cite&gt;
&lt;strong&gt;36&lt;/strong&gt; (2003): 1479--1488&lt;/a&gt;
	&lt;li&gt;Maxim Raginsky, &quot;Achievability results for statistical learning
under communication
constraints&quot;, &lt;a href=&quot;http://arxiv.org/abs/0901.1905&quot;&gt;arxiv:0901.1905&lt;/a&gt;
	&lt;li&gt;Liva Ralaivola, Marie Szafranski, Guillaume Stempfel, &quot;Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary \beta-Mixing Processes&quot;, &lt;a href=&quot;http://arxiv.org/abs/0909.1933&quot;&gt;arxiv:0909.1933&lt;/a&gt;
	&lt;li&gt;Sebastian Risau-Gusman and Mirta B. Gordon
		&lt;ul&gt;
		&lt;li&gt;&quot;Generalization Properties of Finite Size Polynomial
Support Vector Machines,&quot; &lt;a
href=&quot;http://arxiv.org/abs/cond-mat/0002071&quot;&gt;cond-mat/0002071&lt;/a&gt; =?
&quot;Hieararchical Learning in Polynomial Support Vector Machines,&quot; &lt;cite&gt;Machine
Learning&lt;/cite&gt; &lt;strong&gt;46&lt;/strong&gt; (2002): 53--70
		&lt;li&gt;&quot;Learning curves for Soft Margin Classifiers,&quot; &lt;a
href=&quot;http://arxiv.org/abs/cond-mat/0203315&quot;&gt;cond-mat/0203315&lt;/a&gt;
		&lt;/ul&gt;
	&lt;li&gt;Igor Rivin, &quot;The performance of the batch learner algorithm,&quot; &lt;a
href=&quot;http://arxiv.org/abs/cs.LG/0201009&quot;&gt;cs.LG/0201009&lt;/a&gt;
	&lt;li&gt;Cynthia Rudin, &quot;Stability Analysis for Regularized Least Squares
Regression&quot;, &lt;a href=&quot;http://arxiv.org/abs/cs.LG/0502016&quot;&gt;cs.LG/0502016&lt;/a&gt;
	&lt;li&gt;Lorenzo Rosasco, Ernesto De Vito, Andrea Caponnetto, Michele Piana
and Alessandro Verri, &quot;Are Loss Functions All the Same?&quot;, &lt;a
href=&quot;http://neco.mitpress.org/cgi/content/abstract/16/5/1063&quot;&gt;&lt;cite&gt;Neural
Computation&lt;/cite&gt; &lt;strong&gt;16&lt;/strong&gt; (2004): 1063--1076&lt;/a&gt; [Ans.: pretty
much, apparently, at least if they're reasonably convex]
	&lt;li&gt;Glenn Shafer and Vladimir Vovk, &quot;A tutorial on conformal
prediction&quot;,
&lt;a href=&quot;http://arxiv.org/abs/0706.3188&quot;&gt;arxiv:0706.3188&lt;/a&gt;
	&lt;li&gt;Xuhui Shao, Vladimir Cherkassky and William Li, &quot;Measuring the
VC-Dimension Using Optimized Experimental Design,&quot; &lt;citE&gt;Neural
Computation&lt;/cite&gt; &lt;strong&gt;12&lt;/strong&gt; (2000): 1969--1986
	&lt;li&gt;I. Steinwart, &quot;Consistency of Support Vector Machines and Other
Regularized Kernel Classifiers&quot;, &lt;cite&gt;IEEE Transactions on Information
Theory&lt;/cite&gt; &lt;strong&gt;51&lt;/strong&gt; (2005): 128--142
	&lt;li&gt;Joe Suzuki, &quot;On Strong Consistency of Model Selection in
Classification&quot;, &lt;a href=&quot;http://dx.doi.org/10.1109/TIT.2006.883611&quot;&gt;&lt;cite&gt;IEEE
Transactions on Information Theory&lt;/cite&gt; &lt;strong&gt;52&lt;/strong&gt; (2006):
4767--4774&lt;/a&gt; [Based on information-theoretic criteria]
	&lt;li&gt;Sara A. van de Geer
		&lt;ul&gt;
		&lt;li&gt;&quot;On non-asymptotic bounds for estimation in
generalized linear models with highly correlated
design&quot;, &lt;a href=&quot;http://arxiv.org/abs/0709.0844&quot;&gt;0709.0844&lt;/a&gt;
		&lt;li&gt;&quot;High-dimensional generalized linear models and the
lasso&quot;, &lt;cite&gt;Annals of Statistics&lt;/cite&gt; &lt;strong&gt;36&lt;/strong&gt; (2008): 614--645
= &lt;a href=&quot;http://arxiv.org/abs/0804.0703&quot;&gt;arxiv:0804.0703&lt;/a&gt;
		&lt;/ul&gt;
	&lt;li&gt;V. Vapnik and O. Chapelle, &quot;Bounds on Error Expectation for Support
Vector Machines,&quot; &lt;cite&gt;Neural Computation&lt;/cite&gt; &lt;strong&gt;12&lt;/strong&gt; (2000):
2013--2036
	&lt;li&gt;Vladimir Vovk
		&lt;ul&gt;
		&lt;li&gt;&quot;Non-asymptotic calibration and resolution&quot;, &lt;a
href=&quot;http://arxiv.org/abs/cs.LG/0506004&quot;&gt;cs.LG/0506004&lt;/a&gt;
		&lt;li&gt;&quot;Defensive prediction with expert advice&quot;, &lt;a
href=&quot;http://arxiv.org/abs/cs.LG/0506041&quot;&gt;cs.LG/0506041&lt;/a&gt;
		&lt;li&gt;&quot;Competing with wild prediction rules&quot;,
&lt;a href=&quot;http://arxiv.org/abs/cs.LG/0512059&quot;&gt;cs.LG/0512059&lt;/a&gt;
		&lt;li&gt;&quot;Predictions as statements and
decisions&quot;, &lt;a href=&quot;http://arxiv.org/abs/cs.LG/0606093&quot;&gt;cs.LG/0606093&lt;/a&gt;
		&lt;li&gt;&quot;Leading strategies in competitive on-line prediction&quot;,
&lt;a href=&quot;http://arxiv.org/abs/cs.LG/0607134&quot;&gt;cs.LG/0607134&lt;/a&gt;
		&lt;li&gt;&quot;Competing with Markov prediction strategies&quot;,
&lt;a href=&quot;http://arxiv.org/abs/cs.LG/0607136&quot;&gt;cs.LG/0607136&lt;/a&gt;
		&lt;li&gt;&quot;Metric entropy in competitive on-line prediction&quot;,
&lt;a href=&quot;http://arxiv.org/abs/cs.LG/0609045&quot;&gt;cs.LG/0609045&lt;/a&gt;
		&lt;/ul&gt;
	&lt;li&gt;Vladimir Vovk, Alex Gammerman and Glenn Shafer, &lt;cite&gt;Algorithmic
Learning in a Random World&lt;/cite&gt; [&lt;a
href=&quot;http://www.springeronline.com/sgw/cda/frontpage/0,11855,5-0-22-43142242-0,00.html&quot;&gt;Blurb&lt;/a&gt;]
	&lt;li&gt;Vladimir Vovk, Akimichi Takemura, Glenn Shafer, &quot;Defensive
forecasting&quot;, &lt;a href=&quot;http://arxiv.org/abs/cs.LG/0505083&quot;&gt;cs.LG/0505083&lt;/a&gt;
[Also AIStats '05]
	&lt;li&gt;Alon Zakai and Ya'acov Ritov, &quot;Consistency and Localizability&quot;,
&lt;a href=&quot;http://jmlr.csail.mit.edu/papers/v10/zakai09a.html&quot;&gt;&lt;cite&gt;Journal of Machine Learning Reserch&lt;/cite&gt; &lt;strong&gt;10&lt;/strong&gt;
(2009): 827--856&lt;/a&gt;
	&lt;li&gt;Tong Zhang
		&lt;ul&gt;
		&lt;li&gt;&quot;Information-Theoretic Upper and Lower Bounds for
Statistical
Estimation&quot;, &lt;a href=&quot;http://dx.doi.org/10.1109/TIT.2005.864439&quot;&gt;&lt;cite&gt;IEEE
Transactions on Information Theory&lt;/citE&gt;
&lt;strong&gt;52&lt;/strong&gt; (2006): 1307--1321&lt;/a&gt;
		&lt;li&gt;&quot;Learning Bounds for Kernel Regression Using Effective
Data
Dimensionality&quot;, &lt;a
href=&quot;http://neco.mitpress.org/cgi/content/abstract/17/9/2077&quot;&gt;&lt;cite&gt;Neural
Computation&lt;/cite&gt; &lt;strong&gt;17&lt;/strong&gt; (2005): 2077--2098&lt;/a&gt;
		&lt;/ul&gt;
	&lt;/ul&gt;
</description>
  </item>
  </channel>
</rss>