<?xml version="1.0"?>
<!-- name="generator" content="blosxom/2.0" -->
<!DOCTYPE rss PUBLIC "-//Netscape Communications//DTD RSS 0.91//EN" "http://my.netscape.com/publish/formats/rss-0.91.dtd">

<rss version="0.91">
  <channel>
    <title>Notebooks   </title>
    <link>http://bactra.org/notebooks</link>
    <description>Cosma's Notebooks</description>
    <language>en</language>

  <item>
    <title>Learning Theory (Formal, Computational or Statistical)</title>
    <link>http://bactra.org/notebooks/2012/03/18#learning-theory</link>
    <description>
&lt;P&gt;I qualify it to distinguish this area from the broader field
of &lt;a href=&quot;learning-inference-induction.html&quot;&gt;machine learning&lt;/a&gt;, which
includes much more with lower standards of proof, and from the theory of
learning in organisms, which might be quite different.

&lt;P&gt;The basic set-up is as follows.  We have a bunch of inputs and outputs, and
an unknown relationship between the two.  We do have a class of hypotheses
describing this relationship, and suppose one of them is correct.  (The
hypothesis class is always circumscribed, but may be infinite.)  A learning
algorithm takes in a set of inputs and outputs, its data, and produces a
hypothesis.  Generally we assume the data are generated by some random
process, and the hypothesis changes as the data change.  The key notion is that
of a &lt;em&gt;probaably approximately correct&lt;/em&gt; learning algorithm --- one where,
if we supply enough data, we can get a hypothesis with an arbitrarily small
error, with a probability arbitrarily close to one.

&lt;P&gt;Generally, PAC-results concern (1) the existence of a PAC algorithm, (2)
quantifying how much data we need, in terms of either accuracy or reliability,
or (3) devising new PAC algorithms with other desirable properties.  What
frustrates me about this literature, and the reason I don't devote more of my
research to it (aside, of course, from my sheer incompetence) is that almost
all of it assumes the data are statistically independent and identically
distributed.  Then PAC-like results follow essentially from extensions of the
ordinary Law of Large Numbers.  What's really needed, however, is something
more like an &lt;a href=&quot;ergodic-theory.html&quot;&gt;ergodic theorem&lt;/a&gt;, for
suitably-dependent data.  That, however,
gets &lt;a href=&quot;dependent-learning.html&quot;&gt;its own notebook&lt;/a&gt;.

&lt;P&gt;An interesting question (which I learned of from Vidyasagar's book) has to
do with the difference between distribution-free and distribution-dependent
bounds.  Generally, the latter are sharper, sometimes much sharper, but this
comes at the price of making more or less strong parametric assumptions about
the distribution.  (One might indeed think of the theory of parametric
statistical inference as learning theory with &lt;em&gt;very&lt;/em&gt; strong
distributional assumptions.)  However, even in the distribution-free set up,
we &lt;em&gt;have&lt;/em&gt; a whole bunch of samples from the distribution, and
non-parametric density estimation is certainly possible --- could one, e.g.,
improve the bounds by using half the sample to estimate the distribution, and
then applying a distribution-dependent bound?  Or will the uncertainty in the
distributional estimate necessarily kill any advantage we might get from
learning about the distribution?  It feels like the latter would say something
pretty deep (and depressing) about the whole project of observational
science...


&lt;P&gt;See also:
	&lt;a href=&quot;concentration-of-measure.html&quot;&gt;Concentration of Measure&lt;/a&gt;;
	&lt;a href=&quot;decision-theory.html&quot;&gt;Decision Theory&lt;/a&gt;;
	&lt;a href=&quot;empirical-process-theory.html&quot;&gt;Empirical Process Theory&lt;/a&gt;;
	&lt;a href=&quot;ensemble-ml.html&quot;&gt;Ensemble Methods in Machine Learning&lt;/a&gt;;
	&lt;a href=&quot;bayesian-consistency.html&quot;&gt;Frequentist Consistency of Bayesian Procedures&lt;/a&gt;;
	&lt;a href=&quot;low-regret-learning.html&quot;&gt;Low-Regret Learning&lt;/a&gt;;
	&lt;a href=&quot;statistics.html&quot;&gt;Statistics&lt;/a&gt;;
	&lt;a href=&quot;structured-data.html&quot;&gt;Statistics of Structured  Data&lt;/a&gt;;
	&lt;a href=&quot;universal-prediction.html&quot;&gt;Universal Prediction Algorithms&lt;/a&gt;


&lt;ul&gt;Recommended, bigger picture:
	&lt;li&gt;&lt;a href=&quot;http://www.kyb.mpg.de/~bousquet/&quot;&gt;Olivier Bousquet&lt;/a&gt;,
&lt;a href=&quot;http://www.lri.fr/~bouchero/&quot;&gt;St&amp;eacute;phane Boucheron&lt;/a&gt;
and &lt;a href=&quot;http://www.econ.upf.es/~lugosi/&quot;&gt;G&amp;aacute;bor Lugosi&lt;/a&gt;,
&quot;Introduction to Statistical Learning Theory&quot;
[&lt;a href=&quot;http://www.stat.cmu.edu/~larry/=sml2008/BBL.pdf&quot;&gt;PDF&lt;/a&gt;.  39 pp.
review on how to bound the error of your learning algorithms.]
	&lt;li&gt;Nicolo Cesa-Bianchi and Gabor Lugosi, &lt;citE&gt;Prediction, Learning,
and Games&lt;/cite&gt; [&lt;a href=&quot;../weblog/algae-2008-07.html#prediction&quot;&gt;Mini-review&lt;/a&gt;]
	&lt;li&gt;Nello Cristianini and John Shawe-Taylor, &lt;cite&gt;An Introduction to
Support Vector Machines&lt;/cite&gt; [While SVMs are one particular technology
among others, this book does an excellent job of crisply introducing the
general theory of learning, and showing its practicality.]
	&lt;li&gt;Michael J. Kearns and Umesh V. Vazirani, &lt;cite&gt;An Introduction to
Computational Learning Theory&lt;/cite&gt; [Review: &lt;a
href=&quot;../reviews/kearns-vazirani/&quot;&gt;How to Build a Better Guesser&lt;/a&gt;]
	&lt;li&gt;John Lafferty and Larry Wasserman, &lt;cite&gt;Statistical
Machine Learning&lt;/cite&gt; [Unpublished lecture notes]
	&lt;li&gt;&lt;a href=&quot;http://people.ee.duke.edu/~maxim/&quot;&gt;Maxim Raginsky&lt;/a&gt;, &lt;a href=&quot;http://people.ee.duke.edu/~maxim/teaching/spring11/&quot;&gt;Statistical Learning Theory&lt;/a&gt; [Class webpage, with
excellent notes and further readings]
	&lt;li&gt;V. N. Vapnik, &lt;cite&gt;The Nature of Statistical Learning
Theory&lt;/cite&gt; [Review: &lt;a href=&quot;../reviews/vapnik-nature/&quot;&gt;A Useful Biased
Estimator&lt;/a&gt;]
	&lt;li&gt;Mathukumalli Vidyasagar, &lt;cite&gt;A Theory of Learning and
Generalization: With Applications to Neural Networks and Control Systems&lt;/cite&gt;
[&lt;a href=&quot;../weblog/algae-209-01.html#vidyasagar&quot;&gt;Mini-review&lt;/a&gt;.]
	&lt;/ul&gt;

&lt;ul&gt;Recommended, close-ups (very misc.):
	&lt;li&gt;Terrence M. Adams and Andrew B. Nobel, &quot;Uniform Approximation of Vapnik-Chervonenkis Classes&quot;, &lt;a href=&quot;http://arxiv.org/abs/1010.4515&quot;&gt;arxiv:1010.4515&lt;/a&gt;
	&lt;li&gt;Alekh Agarwal, John C. Duchi, Peter L. Bartlett, Clement Levrard, &quot;Oracle inequalities for computationally budgeted model selection&quot; [&lt;a href=&quot;http://colt2011.sztaki.hu/colt2011_submission_82.pdf&quot;&gt;COLT 2011&lt;/a&gt;]
	&lt;li&gt;Moulinath Banerjee, &quot;Covering Numbers and VC dimension&quot; [review
notes, UM statistics department, 2004; &lt;a
href=&quot;http://www.stat.lsa.umich.edu/~jizhu/LearningTheory/MouliCNVC.ps&quot;&gt;PS&lt;/a&gt;]
	&lt;li&gt;Peter L. Bartlett and Shahar Mendelson, &quot;Rademacher and Gaussian
Complexities: Risk Bounds and Structural
Results&quot;, &lt;a href=&quot;http://jmlr.csail.mit.edu/papers/v3/bartlett02a.html&quot;&gt;&lt;cite&gt;Journal
of Machine Learning Research&lt;/cite&gt; &lt;strong&gt;3&lt;/strong&gt; (2002): 463--482&lt;/a&gt;
	&lt;li&gt;Jonathan Baxter, &quot;A Model of Inductive Bias Learning,&quot;
&lt;cite&gt;Journal of Artificial Intelligence Research&lt;/cite&gt; &lt;strong&gt;12&lt;/strong&gt;
(2000): 149--198 [How to learn what class of hypotheses you should be trying to
use, i.e., your inductive bias.  Assumes independence, again.]
	&lt;li&gt;Olivier Bousquet and Andr&amp;eacute; Elisseeff, &quot;Stability and
Generalization&quot;, &lt;a href=&quot;http://jmlr.csail.mit.edu/papers/v2/bousquet02a.html&quot;&gt;&lt;cite&gt;Journal
of Machine Learning Research&lt;/cite&gt; &lt;strong&gt;2&lt;/strong&gt; (2002): 499--526&lt;/a&gt; [A
very cool paper, resting on the idea that if your learning algorithm is stable
under small perturbations of the data, and you have a lot of data, then how
well it predicts the training data has to be a good estimate of how well it
will predict future data.  Clearly related to Pedro Domingos's
&quot;process-oriented evaluation&quot;, which they don't mention, but more formal.]
	&lt;li&gt;N. Cesa-Bianchi, A. Conconi and C. Gentile, &quot;On the Generalization
Ability of On-Line Learning
Algorithms&quot;, &lt;a href=&quot;http://dx.doi.org/10.1109/TIT.2004.833339&quot;&gt;&lt;cite&gt;IEEE
Transactions on Information Theory&lt;/cite&gt; &lt;strong&gt;50&lt;/strong&gt; (2004):
2050--2057&lt;/a&gt;
	&lt;li&gt;Lee-Ad Gottlieb, Aryeh Kontorovich, Robert Krauthgamer, &quot;Efficient Regression in Metric Spaces via Approximate Lipschitz Extension&quot;, &lt;a href=&quot;http://arxiv.org/abs/1111.4470&quot;&gt;arxiv:1111.4470&lt;/a&gt;
	&lt;li&gt;&lt;a href=&quot;http://www.cs.berkeley.edu/%7Ejordan/&quot;&gt;Michael
Jordan&lt;/a&gt;, &lt;a href=&quot;http://www.cs.berkeley.edu/~jordan/courses/210B-spring08/&quot;&gt;Theoretical
Statistics 210B&lt;/a&gt; [Link is to the Spring 2008 version of the course, which is
the latest I can find right now.]
	&lt;li&gt;Jing Lei, James Robins, Larry Wasserman, &quot;Efficient Nonparametric Conformal Prediction Regions&quot;, &lt;a href=&quot;http://arxiv.org/abs/1111.1418&quot;&gt;arxiv:1111.1418&lt;/a&gt;
	&lt;li&gt;Gabor Lugosi, Shahar Mendelson and Vladimir Koltchinskii, &quot;A note
on the richness of convex hulls of VC classes&quot;, &lt;cite&gt;Electronic Communications
in Probability&lt;/cite&gt; &lt;strong&gt;8&lt;/strong&gt; (2003): 18 [&quot;We prove the existence of
a class A of subsets of R&lt;sup&gt;d&lt;/sup&gt; of VC dimension 1 such that the symmetric
convex hull F of the class of characteristic functions of sets in A is rich in
the following sense. For any absolutely continuous probability measure m on
R&lt;sup&gt;d&lt;/sup&gt;, measurable set B and h &gt;0, there exists a function f in F such
that the measure of the symmetric difference of B and the set where f is
positive is less than h.&quot;  The astonishingly simple proof turns on the Borel
isomorphism theorem, which says that says the Borel sets of any complete,
separable metric space are in one-to-one correspondence with the Borel sets of
the unit interval.]
	&lt;li&gt;David Pollard
		&lt;ul&gt;
		&lt;li&gt;&quot;Asymptotics via Empirical Processes&quot;,
&lt;cite&gt;Statistical Science&lt;/cite&gt; &lt;strong&gt;4&lt;/strong&gt; (1989): 341--354
		&lt;li&gt;&lt;cite&gt;Empirical Processes: Theory and Applications&lt;/cite&gt;
[&lt;a href=&quot;../weblog/algae-2008-07.html#pollard&quot;&gt;Review&lt;/a&gt;]
		&lt;/ul&gt;
	&lt;li&gt;Alexander Rakhlin, Karthik Sridharan, Ambuj Tewari, &quot;Online
Learning: Random Averages, Combinatorial Parameters, and
Learnability&quot;, &lt;a href=&quot;http://arxiv.org/abs/1006.1138&quot;&gt;arxiv:1006.1138&lt;/a&gt;
[This is not an easy paper to read, but the results are quite deep, and it
repays the effort.]
	&lt;li&gt;Philippe Rigollet and Xin Tong, &quot;Neyman-Pearson classification,
convexity and stochastic constraints&quot;, &lt;cite&gt;Journal of Machine Learning
Research&lt;/cite&gt; &lt;strong&gt;12&lt;/strong&gt; (2011):
2831--2855&lt;/a&gt;, &lt;a href=&quot;http://arxiv.org/abs/1102.5750&quot;&gt;arxiv:1102.5750&lt;/a&gt;
	&lt;li&gt;C. Scott and R. Nowak, &quot;A Neyman-Pearson Approach to Statistical
Learning&quot;, &lt;a href=&quot;http://dx.doi.org/10.1109/TIT.2005.856955&quot;&gt;&lt;cite&gt;IEEE
Transactions on Information Theory&lt;/cite&gt; &lt;strong&gt;51&lt;/strong&gt; (2005):
3806--3819&lt;/a&gt; [&lt;a href=&quot;../weblog/645.html&quot;&gt;Comments&lt;/a&gt;]
	&lt;li&gt;Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro and Karthik
Sridharan, &quot;Learnability, Stability and Uniform
Convergence&quot;, &lt;a href=&quot;http://jmlr.csail.mit.edu/papers/v11/shalev-shwartz10a.html&quot;&gt;&lt;cite&gt;Journal
of Machine Learning Research&lt;/cite&gt; &lt;strong&gt;11&lt;/strong&gt; (2010): 2635--2670&lt;/a&gt;
[If one looks at a broader domain than the usual supervised regression or
classification problems, uniform convergence of risks is not required for
learnability, but the existence of a stable learning algorithm is.  Of course
much turns on definitions here, and I'm not completely certain, after only one
reading, that I really buy all of theirs...]
	&lt;li&gt;Nathan Srebro, Karthik Sridharan, Ambuj Tewari, &quot;Smoothness, Low-Noise and Fast Rates&quot;, &lt;a href=&quot;http://arxiv.org/abs/1009.3896&quot;&gt;arxiv:1009.3896&lt;/a&gt;
	&lt;li&gt;J. Michael Steele, &lt;a href=&quot;http://www-stat.wharton.upenn.edu/~steele/Resources/Projects/SequenceProject/SequenceProjectIndex.htm&quot;&gt;The Scary Sequences Project&lt;/a&gt; [Notes
and references on individual-sequence prediction]
	&lt;li&gt;Sara van de Geer, &lt;cite&gt;Empirical Processes in M-Estimation&lt;/cite&gt;
	&lt;li&gt;Ramon van Handel, &quot;The universal Glivenko-Cantelli property&quot;,
&lt;a href=&quot;http://arxiv.org/abs/1009.4434&quot;&gt;arxiv:1009.4434&lt;/a&gt;
	&lt;li&gt;Vladimir Vovk, Alex Gammerman and Glenn Shafer, &lt;cite&gt;Algorithmic
Learning in a Random World&lt;/cite&gt; [&lt;a href=&quot;../weblog/algae-2011-08.html#conformal-prediction&quot;&gt;Mini-review&lt;/a&gt;]
	&lt;li&gt;Xiaojin Zhu, Timothy T. Rogers and Bryan R. Gibson, &quot;Human
Rademacher Complexity&quot;, in &lt;cite&gt;Advances in Neural Information Processing&lt;/cite&gt;, vol. 22 (NIPS 2009) [&lt;a href=&quot;http://books.nips.cc/papers/files/nips22/NIPS2009_0266.pdf&quot;&gt;PDF reprint&lt;/a&gt;]
	&lt;/ul&gt;

&lt;ul&gt;Modesty forbids me to recommend:
	&lt;li&gt;Daniel J. McDonald, CRS and Mark Schervish, &quot;Estimated VC Dimension
for Risk Bounds&quot;, &lt;a href=&quot;http://arxiv.org/abs/1111.3404&quot;&gt;arxiv:1111.3404&lt;/a&gt;
[&lt;a href=&quot;../weblog/839.html&quot;&gt;More&lt;/a&gt;]
	&lt;/ul&gt;


&lt;ul&gt;To read:
	&lt;li&gt;Andris Ambainis, &quot;Probabilistic and Team PFIN-type Learning:
General Properties&quot;, &lt;a
href=&quot;http://arxiv.org/abs/cs.LG/0504001&quot;&gt;cs.LG/0504001&lt;/a&gt;
	&lt;li&gt;Jean-Yves Audibert, &quot;Fast learning rates in statistical inference
through aggregation&quot;, &lt;citE&gt;Annals of Statistics&lt;/cite&gt; &lt;strong&gt;37&lt;/strong&gt;
(2009): 1591--1646, &lt;a
href=&quot;http://arxiv.org/abs/math.ST/0703854&quot;&gt;math.ST/0703854&lt;/a&gt;
	&lt;li&gt;Jean-Yves Audibert and Olivier Bousquet, &quot;Combining PAC-Bayesian
and Generic Chaining
Bounds&quot;, &lt;a href=&quot;http://jmlr.csail.mit.edu/papers/v8/audibert07a.html&quot;&gt;&lt;cite&gt;Journal
of Machine Learning Research&lt;/cite&gt; &lt;strong&gt;8&lt;/strong&gt; (2007): 863--889&lt;/a&gt;
	&lt;li&gt;Andrew R. Barron, Albert Cohen, Wolfgang Dahmen, Ronald A. DeVore, &quot;Approximation and learning by greedy algorithms&quot;, &lt;citE&gt;Annals of Statistics&lt;/cite&gt; &lt;strong&gt;36&lt;/strong&gt; (2008): 64--94, &lt;a href=&quot;http://arxiv.org/abs/0803.1718&quot;&gt;arxiv:0803.1718&lt;/a&gt;
	&lt;li&gt;G. Biau and L. Devroye, &quot;A Note on Density Model Size Testing&quot;,
&lt;cite&gt;IEEE Transactions on Information Theory&lt;/cite&gt; &lt;strong&gt;50&lt;/strong&gt;
(2004): 576--581 [Testing which member of a nested family of model classes you
need to use]
	&lt;li&gt;Gilles Blanchard, Olivier Bousquet, Pascal Massart, &quot;Statistical performance of support vector machines&quot;, &lt;cite&gt;Annals of Statistics&lt;/cite&gt; &lt;strong&gt;36&lt;/strong&gt; (2008): 489--531, &lt;a href=&quot;http://arxiv.org/abs/0804.0551&quot;&gt;arxiv:0804.0551&lt;/a&gt;
	&lt;li&gt;Gilles Blanchard and Francois Fleuret, &quot;Occam's hammer: a link
between randomized learning and multiple testing FDR
control&quot;, &lt;a href=&quot;http://arxiv.org/abs/math.ST/0608713&quot;&gt;math.ST/0608713&lt;/a&gt;
	&lt;li&gt;Arnaud Buhot and Mitra B. Gordon, &quot;Storage Capacity of the
Tilinglike Learning Algorithm,&quot; &lt;a
href=&quot;http://arxiv.org/abs/cond-mat/0008162&quot;&gt;cond-mat/0008162&lt;/a&gt;
	&lt;li&gt;Arnaud Buhot, Mirta B. Gordon and Jean-Pierre Nadal, &quot;Rigorous
Bounds to Retarded Learning,&quot; &lt;a
href=&quot;http://arxiv.org/abs/cond-mat/0201256&quot;&gt;cond-mat/0201256&lt;/a&gt;
	&lt;li&gt;Andrea Caponnetto and Alexander Rakhlin, &quot;Stability Properties of
Empirical Risk Minimization over Donsker
Classes&quot;, &lt;a
href=&quot;http://jmlr.csail.mit.edu/papers/volume7/caponnetto06a/caponnetto06a.pdf&quot;&gt;&lt;cite&gt;Journal
of Machine Learning Research&lt;/cite&gt; &lt;strong&gt;7&lt;/strong&gt; (2006): 2565--2583&lt;/a&gt;
	&lt;li&gt;Olivier Catoni
		&lt;ul&gt;
		&lt;li&gt;&quot;Improved Vapnik Cervonenkis bounds&quot;, &lt;a
href=&quot;http://arxiv.org/abs/math.ST/0410280&quot;&gt;math.ST/0410280&lt;/a&gt;
		&lt;li&gt;&lt;cite&gt;Statistical Learning Theory and Stochastic Optimization&lt;/cite&gt;
		&lt;li&gt;&lt;cite&gt;Pac-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning&lt;/cite&gt; [&lt;a href=&quot;http://projecteuclid.org/euclid.lnms/1199996410&quot;&gt;Full-text open access&lt;/a&gt;]
		&lt;/ul&gt;
	&lt;li&gt;Olivier Chapelle, Bernhard Sch&amp;ouml;lkopf and Alexander Zien
(eds.), &lt;cite&gt;Semi-Supervised Learning&lt;/cite&gt;
[&lt;a href=&quot;http://mitpress.mit.edu/0-262-03358-5&quot;&gt;Blurb&lt;/a&gt;]
	&lt;li&gt;Felipe Cucker and Ding Xuan Zhou, &lt;cite&gt;Learning Theory: An
Approximation Theory Viewpoint&lt;/cite&gt;
[&lt;a href=&quot;http://cambridge.org/052186559X&quot;&gt;Blurb&lt;/a&gt;]
	&lt;li&gt;Arnak Dalalyan and Alexandre B. Tsybakov, &quot;Sparse Regression Learning by Aggregation and Langevin Monte-Carlo&quot;, &lt;a href=&quot;http://arxiv.org/abs/0903.1223&quot;&gt;arxiv:0903.1223&lt;/a&gt;
	&lt;li&gt;C. De Mol, E. De Vito, L. Rosasco, &quot;Elastic-Net Regularization in Learning Theory&quot;, &lt;a href=&quot;http://arxiv.org/abs/0807.3423&quot;&gt;arxiv:0807.3423&lt;/a&gt;
	&lt;li&gt;Joshua V Dillon, Krishnakumar Balasubramanian, Guy Lebanon, &quot;Asymptotic Analysis of Generative Semi-Supervised Learning&quot;, &lt;a href=&quot;http://arxiv.org/abs/1003.0024&quot;&gt;arxiv:1003.0024&lt;/a&gt;
	&lt;li&gt;Vitaly Feldman, &quot;Robustness of Evolvability&quot;
[&lt;a href=&quot;http://www.eecs.harvard.edu/~vitaly/papers/F_EvolveRobust_09.pdf&quot;&gt;PDF
preprint&lt;/a&gt;]
	&lt;li&gt;Majid Fozunbal, Ton Kalker, &quot;Decision Making with Side Information and Unbounded Loss Functions&quot;, &lt;a href=&quot;http://arxiv.org/abs/cs/0601115&quot;&gt;arxiv:cs/0601115&lt;/a&gt;
	&lt;li&gt;Yoav Freund, Yishay Mansour and Robert E. Schapire, &quot;Generalization
bounds for averaged classifiers&quot;, &lt;cite&gt;Annals of
Statistics&lt;/cite&gt; &lt;strong&gt;32&lt;/strong&gt; (2004): 1698--1722, &lt;a
href=&quot;http://arxiv.org/abs/math.ST/0410092&quot;&gt;math.ST/0410092&lt;/a&gt;
	&lt;li&gt;Alexander Gammerman, Vladimir Vovk, &quot;Hedging predictions in machine
learning&quot;, &lt;a href=&quot;http://arxiv.org/abs/cs.LG/0611011&quot;&gt;cs.LG/0611011&lt;/a&gt;
	&lt;li&gt;Carl Gold and Peter Sollich, &quot;Model Selection for Support Vector
Machine Classification,&quot; &lt;a
href=&quot;http://arxiv.org/abs/cond-mat/0203334&quot;&gt;cond-mat/0203334&lt;/a&gt;
	&lt;li&gt;Lee-Ad Gottlied, Leonid Kontorovich, Elchanan Mossel,
&quot;VC bounds on the cardinality of nearly orthogonal function classes&quot;, &lt;a href=&quot;http://arxiv.org/abs/1007.4915&quot;&gt;arxiv:1007.4915&lt;/a&gt;
	&lt;li&gt;&lt;a href=&quot;http://www.cs.cmu.edu/~shanneke/&quot;&gt;Steve Hanneke&lt;/a&gt;
		&lt;ul&gt;
		&lt;li&gt;&quot;Teaching Dimension and the Complexity of 
Active Learning&quot; [&lt;a href=&quot;http://www.cs.cmu.edu/~shanneke/docs/2007/hanneke-teaching-dimension.pdf&quot;&gt;PDF&lt;/a&gt;]
		&lt;li&gt;&quot;A Bound on the Label Complexity of Agnostic Active Learning&quot; [&lt;a href=&quot;http://www.cs.cmu.edu/~shanneke/docs/2007/hanneke-agnostic-active.pdf&quot;&gt;PDF&lt;/a&gt;]
		&lt;li&gt;&quot;The Cost Complexity of Active Learning&quot; [&lt;a href=&quot;http://www.cs.cmu.edu/~shanneke/downloads/cost-complexity-working-notes.pdf&quot;&gt;PDF&lt;/a&gt;]
		&lt;li&gt;&quot;Rates of convergence in active learning&quot;,
&lt;a href=&quot;http://projecteuclid.org/euclid.aos/1291388378&quot;&gt;&lt;cite&gt;Annals of Statistics&lt;/cite&gt; &lt;strong&gt;39&lt;/strong&gt;
(2011): 333--361&lt;/a&gt;
		&lt;/ul&gt;
	&lt;li&gt;Nancy Heckman, &quot;The theory and application of penalized methods or Reproducing Kernel Hilbert Spaces made easy&quot;, &lt;a href=&quot;http://arxiv.org/abs/1111.1915&quot;&gt;arxiv:1111.1915&lt;/a&gt;
	&lt;li&gt;Daniel J. L. Herrmann and Dominik Janzing, &quot;Selection Criterion for
Log-Linear Models Using Statistical Learning
Theory&quot;, &lt;a href=&quot;http://arxiv.org/abs/math.ST/0302079&quot;&gt;math.ST/0302079&lt;/a&gt;
	&lt;li&gt;Don Hush, clint Scovel and Ingo Steinwart, &quot;Stability of Unstable
Learning Algorithms&quot;, &lt;a href=&quot;http://dx.doi.org/10.1007/s10994-007-5004-z&quot;&gt;&lt;cite&gt;Machine Learning&lt;/cite&gt; &lt;strong&gt;67&lt;/strong&gt; (2007): 197--206&lt;/a&gt;
	&lt;li&gt;Sanjay Jain, Daniel Oshershon, James S. Royer and Arun Sharma,
&lt;cite&gt;Systems That Learn: An Introduction to Learning Theory&lt;/cite&gt;
	&lt;li&gt;Yuri Kalnishkan, Volodya Vovk and Michael V. Vyugin, &quot;Loss
functions, complexities, and the Legendre transformation&quot;, &lt;a
href=&quot;http://dx.doi.org/10.1016/j.tcs.2003.11.005&quot;&gt;&lt;cite&gt;Theoretical Computer
Science&lt;/cite&gt; &lt;strong&gt;313&lt;/strong&gt; (2004): 195--207&lt;/a&gt;
	&lt;li&gt;Gerard Kerkyacharian and Dominique Picard, &quot;Thresholding in
Learning Theory&quot;, &lt;a
href=&quot;http://arxiv.org/abs/math.ST/0510271&quot;&gt;math.ST/0510271&lt;/a&gt;
	&lt;li&gt;Jussi Klemela and Enno Mammen, &quot;Empirical risk minimization in inverse problems&quot;, &lt;a href=&quot;http://projecteuclid.org/euclid.aos/1262271621&quot;&gt;&lt;cite&gt;Annals of Statistics&lt;/cite&gt; &lt;strong&gt;38&lt;/strong&gt; (2010): 482--511&lt;/a&gt;
	&lt;li&gt;Vladimir Koltchinskii, &lt;cite&gt;Oracle Inequalities in Empirical
Risk Minimization and Sparse Recovery Problems&lt;/cite&gt; [&lt;a href=&quot;http://www.springer.com/book/978-3-642-22146-0&quot;&gt;Blurb&lt;/a&gt;]
	&lt;li&gt;Natalia Komarova and Igor Rivin, &quot;Mathematics of learning,&quot; &lt;a
href=&quot;http://arXiv.org/abs/math/0105235&quot;&gt;math.PR/0105235&lt;/a&gt;
	&lt;li&gt;Pirkko Kuusela, Daniel Ocone and Eduardo D. Sontag, &quot;Learning
Complexity Dimensions for a Continuous-Time Control System,&quot; &lt;a
href=&quot;http://arxiv.org/abs/math.OC/0012163&quot;&gt;math.OC/0012163&lt;/a&gt;
	&lt;li&gt;John Langford, &quot;Tutorial on Practical Prediction Theory for
Classification&quot;, &lt;a href=&quot;http://jmlr.csail.mit.edu/papers/v6/langford05a.html&quot;&gt;&lt;cite&gt;Journal of Machine Learning Research&lt;/cite&gt;
&lt;strong&gt;6&lt;/strong&gt; (2005): 273--306&lt;/a&gt; [For the PAC-Bayesian result]
	&lt;li&gt;A. Lecchini-Visintini, J. Lygeros, J. Maciejowski, &quot;Simulated
Annealing: Rigorous finite-time guarantees for optimization on continuous
domains&quot;, &lt;a href=&quot;http://arxiv.org/abs/0709.2989&quot;&gt;0709.2989&lt;/a&gt;
	&lt;li&gt;Guillaume Lecu&amp;eacute;, &quot;Simultaneous adaptation to the margin and to
complexity in classification&quot;, &lt;a
href=&quot;http://arxiv.org/abs/math.ST/0509696&quot;&gt;math.ST/0509696&lt;/a&gt;
	&lt;li&gt;Guillaume Lecu&amp;eacute; and Shahar Mendelson
		&lt;ul&gt;
		&lt;li&gt;&quot;Aggregation via empirical risk minimization&quot;, &lt;a href=&quot;http://dx.doi.org/10.1007/s00440-008-0180-8&quot;&gt;&lt;cite&gt;Probability Theory and
Related Fields&lt;/cite&gt; &lt;strong&gt;145&lt;/strong&gt; (2009): 591--613&lt;/a&gt;
		&lt;li&gt;&quot;Sharper lower bounds on the performance of the empirical risk minimization algorithm&quot;, &lt;a href=&quot;http://projecteuclid.org/euclid.bj/1281099877&quot;&gt;&lt;cite&gt;Bernoulli&lt;/cite&gt; &lt;strong&gt;16&lt;/strong&gt; (2010): 605--613&lt;/a&gt;, &lt;a href=&quot;http://arxiv.org/abs/1102.4983&quot;&gt;arxiv:1102.4983&lt;/a&gt;
		&lt;/ul&gt;
	&lt;li&gt;Feng Liang, Sayan Mukherjee, Mike West, &quot;The Use of Unlabeled Data
in Predictive
Modeling&quot;, &lt;a href=&quot;http://arxiv.org/abs/0710.4618&quot;&gt;arxiv:0710.4618&lt;/a&gt;
= &lt;cite&gt;Statistical Science&lt;/cite&gt; &lt;strong&gt;22&lt;/strong&gt; (2007): 189--205
	&lt;li&gt;Gabor Lugosi and Marten Wegkamp, &quot;Complexity regularization via
localized random penalties&quot;, &lt;cite&gt;Annals of
Statistics&lt;/cite&gt; &lt;strong&gt;32&lt;/strong&gt; (2004): 16799--1697 = &lt;a
href=&quot;http://arxiv.org/abs/math.ST/0410091&quot;&gt;math.ST/0410091&lt;/a&gt;
	&lt;li&gt;Malik Magdon-Ismail, &quot;Permutation Complexity Bound on Out-Sample
Error&quot;, NIPS 2010 [&lt;a href=&quot;http://books.nips.cc/papers/files/nips23/NIPS2010_0070.pdf&quot;&gt;PDF reprint&lt;/a&gt;]
	&lt;li&gt;D&amp;ouml;rthe Malzahn and Manfred Opper, &quot;A statistical physics approach for the analysis of machine learning algorithms on real data&quot;, &lt;a href=&quot;http://dx.doi.org/10.1088/1742-5468/2005/11/P11001&quot;&gt;&lt;cite&gt;Journal of Statistical Mechanics: Theory and Experiment&lt;/cite&gt; (2005): P11001&lt;/a&gt;
	&lt;li&gt;Pascal Massart and &amp;Eacute;lodie N&amp;eacute;d&amp;eacute;lec, &quot;Risk
bounds for statistical
learning&quot;, &lt;a href=&quot;http://arxiv.org/abs/math.ST/0702683&quot;&gt;math.ST/0702683&lt;/a&gt;
= &lt;a href=&quot;http://dx.doi.org/10%2E1214/009053606000000786&quot;&gt;&lt;cite&gt;Annals of
Statistics&lt;/cite&gt;
&lt;strong&gt;34&lt;/strong&gt; (2006): 2326--2366&lt;/a&gt;
	&lt;li&gt;Andreas Maurer, &quot;A Note on the PAC Bayesian Theorem&quot;, &lt;a
href=&quot;http://arxiv.org/abs/cs.LG/0411099&quot;&gt;cs.LG/0411099&lt;/a&gt;
	&lt;li&gt;Shahar Mendelson
		&lt;ul&gt;
		&lt;li&gt;&quot;Improving the Sample Complexity Using Global
Data&quot;, &lt;cite&gt;IEEE Transactions on Information Theory&lt;/cite&gt; &lt;strong&gt;48&lt;/strong&gt;
(2002): 1977--1991
		&lt;li&gt;&quot;Lower Bounds for the Empirical Minimization Algorithm&quot;,
&lt;a href=&quot;http://dx.doi.org/10.1109/TIT.2008.926323&quot;&gt;&lt;cite&gt;IEEE Transactions on
Information Theory&lt;/citE&gt; &lt;strong&gt;54&lt;/strong&gt; (2008): 3797--3803&lt;/a&gt; [&quot;simple
argument ... that under mild geometric assumptions .... the empirical
minimization algorithm cannot yield a uniform error rate that is faster than
$1/sqrt{k}$ in the function learning setup&quot;.]
		&lt;/ul&gt;
	&lt;li&gt;Shahar Mendelson and Gideon Schechtman, &quot;The shattering dimension
of sets of linear functionals&quot;, &lt;citE&gt;Annals of
Probability&lt;/cite&gt; &lt;strong&gt;32&lt;/strong&gt; (2004): 1746--1770 = &lt;a
href=&quot;http://arxiv.org/abs/math.PR/0410096&quot;&gt;math.PR/0410096&lt;/a&gt;
	&lt;li&gt;P. Mitra, C. A. Murthy and S. K. Pal, &quot;A probabilistic active
support vector learning algorithm&quot;, &lt;cite&gt;IEEE Transactions on Pattern Analysis
and Machine Intelligence&lt;/cite&gt; &lt;strong&gt;26&lt;/strong&gt; (2004): 413--418
	  &lt;li&gt;Sayan Mukherjee, Partha Niyogi, Tomaso Poggio and
Ryan Rifkin, &quot;Learning theory: stability is sufficient for generalization
and necessary and sufficient for consistency of empirical
risk minimization&quot;, &lt;cite&gt;Advances in Computational Mathematics&lt;/cite&gt;
&lt;strong&gt;25&lt;/strong&gt; (2006): 161--193
[&lt;a
href=&quot;http://cbcl.mit.edu/projects/cbcl/publications/ps/mukherjee-ACM-06.pdf&quot;&gt;PDF&lt;/a&gt;.
Thanks to Shiva Kaul for pointing this out to me.]
	&lt;li&gt;Roland Nilsson, Jose M. Pena, Johan Bjorkegren and Jesper Tegner,
&quot;Consistent Feature Selection for Pattern Recognition in Polynomial Time&quot;,
&lt;cite&gt;Journal of Machine Learning Research&lt;/cite&gt; &lt;strong&gt;8&lt;/strong&gt; (2007): 589--612
	&lt;li&gt;E. Parrado-Hernandez, I. Mora-Jimenez, J. Arenas-Garca, A. R.
Figueiras-Vidal and A. Navia-Vazquez, &quot;Growing support vector classifiers with
controlled complexity,&quot; &lt;a
href=&quot;http://dx.doi.org/10.1016/S0031-3203(02)00351-5&quot;&gt;&lt;cite&gt;Pattern
Recognition Letters&lt;/cite&gt;
&lt;strong&gt;36&lt;/strong&gt; (2003): 1479--1488&lt;/a&gt;
	&lt;li&gt;Vladimir Pestov, &quot;PAC learnability of a concept class under non-atomic measures: a problem by Vidyasagar&quot;, &lt;a href=&quot;http://arxiv.org/abs/1006.5090&quot;&gt;arxiv:1006.5090&lt;/a&gt;
	&lt;li&gt;Maxim Raginsky, &quot;Achievability results for statistical learning
under communication
constraints&quot;, &lt;a href=&quot;http://arxiv.org/abs/0901.1905&quot;&gt;arxiv:0901.1905&lt;/a&gt;
	&lt;li&gt;Liva Ralaivola, Marie Szafranski, Guillaume Stempfel, &quot;Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary \beta-Mixing Processes&quot;, &lt;a href=&quot;http://arxiv.org/abs/0909.1933&quot;&gt;arxiv:0909.1933&lt;/a&gt;
	&lt;li&gt;Sebastian Risau-Gusman and Mirta B. Gordon
		&lt;ul&gt;
		&lt;li&gt;&quot;Generalization Properties of Finite Size Polynomial
Support Vector Machines,&quot; &lt;a
href=&quot;http://arxiv.org/abs/cond-mat/0002071&quot;&gt;cond-mat/0002071&lt;/a&gt; =?
&quot;Hieararchical Learning in Polynomial Support Vector Machines,&quot; &lt;cite&gt;Machine
Learning&lt;/cite&gt; &lt;strong&gt;46&lt;/strong&gt; (2002): 53--70
		&lt;li&gt;&quot;Learning curves for Soft Margin Classifiers,&quot; &lt;a
href=&quot;http://arxiv.org/abs/cond-mat/0203315&quot;&gt;cond-mat/0203315&lt;/a&gt;
		&lt;/ul&gt;
	&lt;li&gt;Igor Rivin, &quot;The performance of the batch learner algorithm,&quot; &lt;a
href=&quot;http://arxiv.org/abs/cs.LG/0201009&quot;&gt;cs.LG/0201009&lt;/a&gt;
	&lt;li&gt;Dana Ron, &quot;Property Testing: A Learning Theory
Perspective&quot;, &lt;A href=&quot;http:/dx.doi.org/10.1561/2200000004&quot;&gt;&lt;cite&gt;Foundations and Trends in Machine Learning&lt;/cite&gt;
&lt;strong&gt;1&lt;/strong&gt; (2008): 307--402&lt;/a&gt;
[&lt;a href=&quot;http://www.eng.tau.ac.il/~danar/Public-pdf/tut.pdf&quot;&gt;PDF preprint&lt;/a&gt;]
	&lt;li&gt;Benjamin I. P. Rubinstein, Aleksandr Simma, &quot;On the Stability of Empirical Risk Minimization in the Presence of Multiple Risk Minimizers&quot;, &lt;a href=&quot;http://arxiv.org/abs/1002.2044&quot;&gt;arxiv:1002.2044&lt;/a&gt;
	&lt;li&gt;Cynthia Rudin, &quot;Stability Analysis for Regularized Least Squares
Regression&quot;, &lt;a href=&quot;http://arxiv.org/abs/cs.LG/0502016&quot;&gt;cs.LG/0502016&lt;/a&gt;
	&lt;li&gt;Lorenzo Rosasco, Ernesto De Vito, Andrea Caponnetto, Michele Piana
and Alessandro Verri, &quot;Are Loss Functions All the Same?&quot;, &lt;a
href=&quot;http://neco.mitpress.org/cgi/content/abstract/16/5/1063&quot;&gt;&lt;cite&gt;Neural
Computation&lt;/cite&gt; &lt;strong&gt;16&lt;/strong&gt; (2004): 1063--1076&lt;/a&gt; [Ans.: pretty
much, apparently, at least if they're reasonably convex]
	&lt;li&gt;Yevgeny Seldin, Nicol&amp;ograve; Cesa-Bianchi, Fran&amp;ccedile;ois Laviolette, Peter Auer, John Shawe-Taylor, Jan Peters, &quot;PAC-Bayesian Analysis of the Exploration-Exploitation Trade-off&quot;, &lt;a href=&quot;http://arxiv.org/abs/1105.4585&quot;&gt;arxiv:1105.4585&lt;/a&gt;
	&lt;li&gt;Glenn Shafer and Vladimir Vovk, &quot;A tutorial on conformal
prediction&quot;,
&lt;a href=&quot;http://arxiv.org/abs/0706.3188&quot;&gt;arxiv:0706.3188&lt;/a&gt;
	&lt;li&gt;Xuhui Shao, Vladimir Cherkassky and William Li, &quot;Measuring the
VC-Dimension Using Optimized Experimental Design,&quot; &lt;citE&gt;Neural
Computation&lt;/cite&gt; &lt;strong&gt;12&lt;/strong&gt; (2000): 1969--1986
	&lt;li&gt;Xiaotong Shen and Lifeng Wang, &quot;Generalization error for multi-class margin classification&quot;, &lt;cite&gt;Electronic Journal of Statistics&lt;/cite&gt;
&lt;strong&gt;1&lt;/strong&gt; (2007): 307--330, &lt;a href=&quot;http://arxiv.org/abs/0708.3556&quot;&gt;arxiv:0708.3556&lt;/a&gt;
	&lt;li&gt;I. Steinwart, &quot;Consistency of Support Vector Machines and Other
Regularized Kernel Classifiers&quot;, &lt;cite&gt;IEEE Transactions on Information
Theory&lt;/cite&gt; &lt;strong&gt;51&lt;/strong&gt; (2005): 128--142
	&lt;li&gt;Ingo Steinwart and Andreas Christmann, &lt;cite&gt;Support
Vector Machines&lt;/cite&gt;
	&lt;li&gt;Joe Suzuki, &quot;On Strong Consistency of Model Selection in
Classification&quot;, &lt;a href=&quot;http://dx.doi.org/10.1109/TIT.2006.883611&quot;&gt;&lt;cite&gt;IEEE
Transactions on Information Theory&lt;/cite&gt; &lt;strong&gt;52&lt;/strong&gt; (2006):
4767--4774&lt;/a&gt; [Based on information-theoretic criteria]
	&lt;li&gt;Sara A. van de Geer
		&lt;ul&gt;
		&lt;li&gt;&quot;On non-asymptotic bounds for estimation in
generalized linear models with highly correlated
design&quot;, &lt;a href=&quot;http://arxiv.org/abs/0709.0844&quot;&gt;0709.0844&lt;/a&gt;
		&lt;li&gt;&quot;High-dimensional generalized linear models and the
lasso&quot;, &lt;cite&gt;Annals of Statistics&lt;/cite&gt; &lt;strong&gt;36&lt;/strong&gt; (2008): 614--645
= &lt;a href=&quot;http://arxiv.org/abs/0804.0703&quot;&gt;arxiv:0804.0703&lt;/a&gt;
		&lt;/ul&gt;
	&lt;li&gt;Aad van der Vaart and Jon A. Wellner, &quot;A note on bounds for VC
dimensions&quot;, pp. 103--107 in Christian Houdr&amp;eacute;, Vladimir Koltchinskii,
David M. Mason and Magda Peligrad (eds.), &lt;cite&gt;High Dimensional Probability V:
The Luminy Volume&lt;/cite&gt; [&quot;We provide bounds for the VC dimension of class of
sets formed by unions, intersections, and products of VC classes of
sets&quot;.  &lt;a href=&quot;http://projecteuclid.org/euclid.imsc/1265119264&quot;&gt;Open
access&lt;/a&gt;]
	&lt;li&gt;V. Vapnik and O. Chapelle, &quot;Bounds on Error Expectation for Support
Vector Machines,&quot; &lt;cite&gt;Neural Computation&lt;/cite&gt; &lt;strong&gt;12&lt;/strong&gt; (2000):
2013--2036
	&lt;li&gt;Ulrike von Luxburg, Bernhard Schoelkopf, &quot;Statistical Learning Theory: Models, Concepts, and Results&quot;, &lt;a href=&quot;http://arxiv.org/abs/0810.4752&quot;&gt;arxiv:0810.4752&lt;/a&gt;
	&lt;li&gt;Huan Xu and Shie Mannor, &quot;Robustness and Generalization&quot;,
&lt;a href=&quot;http://dx.doi.org/10.1007/s10994-011-5268-1&quot;&gt;&lt;cite&gt;Machine
Learning&lt;/cite&gt; &lt;strong&gt;86&lt;/strong&gt; (2012): 391--423&lt;/a&gt;
&lt;a href=&quot;http://arxiv.org/abs/1005.2243&quot;&gt;arxiv:1005.2243&lt;/a&gt;
	&lt;li&gt;Kai Yu and Tong Zhang, &quot;High Dimensional Nonlinear Learning using Local Coordinate Coding&quot;, &lt;a href=&quot;http://arxiv.org/abs/0906.5190&quot;&gt;arxiv:0906.5190&lt;/a&gt;
	&lt;li&gt;Alon Zakai and Ya'acov Ritov, &quot;Consistency and Localizability&quot;,
&lt;a href=&quot;http://jmlr.csail.mit.edu/papers/v10/zakai09a.html&quot;&gt;&lt;cite&gt;Journal of Machine Learning Reserch&lt;/cite&gt; &lt;strong&gt;10&lt;/strong&gt;
(2009): 827--856&lt;/a&gt;
	&lt;li&gt;Chao Zhang and Dachen Tao, &quot;Generalization Bound for Infinitely Divisible Empirical Process&quot;, &lt;a href=&quot;http://jmlr.csail.mit.edu/proceedings/papers/v15/zhang11b.html&quot;&gt;&lt;cite&gt;Journal of Machine Learning Research&lt;/cite&gt; Workshops and Conference Proceedings &lt;strong&gt;15&lt;/strong&gt; (2011): 864--872&lt;/a&gt;
	&lt;li&gt;Tong Zhang
		&lt;ul&gt;
		&lt;li&gt;&quot;Covering Number Bounds of Certain Regularized Linear
Function
Classes&quot;, &lt;a href=&quot;http://jmlr.csail.mit.edu/papers/v2/zhang02b.html&quot;&gt;&lt;cite&gt;Journal
of Machine Learning Research&lt;/cite&gt;
&lt;strong&gt;2&lt;/strong&gt; (2002): 527--550&lt;/a&gt;
		&lt;li&gt;&quot;Learning Bounds for Kernel Regression Using Effective
Data
Dimensionality&quot;, &lt;a
href=&quot;http://neco.mitpress.org/cgi/content/abstract/17/9/2077&quot;&gt;&lt;cite&gt;Neural
Computation&lt;/cite&gt; &lt;strong&gt;17&lt;/strong&gt; (2005): 2077--2098&lt;/a&gt;
		&lt;li&gt;&quot;Information-Theoretic Upper and Lower Bounds for
Statistical
Estimation&quot;, &lt;a href=&quot;http://dx.doi.org/10.1109/TIT.2005.864439&quot;&gt;&lt;cite&gt;IEEE
Transactions on Information Theory&lt;/citE&gt;
&lt;strong&gt;52&lt;/strong&gt; (2006): 1307--1321&lt;/a&gt;
		&lt;/ul&gt;
	&lt;/ul&gt;
</description>
  </item>
  </channel>
</rss>
