<?xml version="1.0"?>
<!-- name="generator" content="blosxom/2.0" -->
<!DOCTYPE rss PUBLIC "-//Netscape Communications//DTD RSS 0.91//EN" "http://my.netscape.com/publish/formats/rss-0.91.dtd">

<rss version="0.91">
  <channel>
    <title>Notebooks   </title>
    <link>http://bactra.org/notebooks</link>
    <description>Cosma's Notebooks</description>
    <language>en</language>

  <item>
    <title>Model Selection</title>
    <link>http://bactra.org/notebooks/2010/01/04#model-selection</link>
    <description>
&lt;P&gt;(Reader, please make your own suitably awful pun about the different senses
of &quot;model selection&quot; here, as a discouragement to those finding this page
through prurient searching.  Thank you.)

&lt;P&gt;In &lt;a href=&quot;statistics.html&quot;&gt;statistics&lt;/a&gt;
and &lt;a href=&quot;learning-inference-induction.html&quot;&gt;machine learning&lt;/a&gt;, &quot;model
selection&quot; is the problem of picking among different mathematical models which
all purport to describe the same data set.  This notebook will not (for now)
give advice on it; as usual, it's more of a place to organize my thoughts and
references...

&lt;P&gt;Classification of approaches to model selection (probably not really
exhaustive but I can't think of others, right now):
&lt;dl&gt;
&lt;dt&gt;Direct optimization of some measure of goodness of fit or risk on training
data.&lt;/dt&gt;
&lt;dd&gt;Seems implicit in a lot of work which points to marginal improvements in
&quot;the proportion of variance explained&quot;, mis-classification rates, &quot;perplexity&quot;,
etc.  Often, also, a recipe for over-fitting and chasing snarks.  What's wanted
is (almost always) some way of measuring the ability to generalize to new data,
and in-sample performance is a biased estimate of this.  Still,
with &lt;em&gt;enough&lt;/em&gt; data, if the gods
of &lt;a href=&quot;ergodic-theory.html&quot;&gt;ergodicity&lt;/a&gt; are kind, in-sample performance
is representative of generalization performance, so perhaps this will work
asymptotically, though in many cases the researcher will never even glimpse
Asymptopia across the Jordan.&lt;/dd&gt;

&lt;dt&gt;Optimize fit with model-dependent penalty&lt;/dt&gt;
&lt;dd&gt;Add on a term to each model which supposed indicates its ability to
over-fit.  (Adjusted R^2, AIC, BIC, ..., all do this in terms of the number of
parameters.)  Sounds reasonable, but I wonder how many actually work better, in
practice, than direct optimization.  (See Domingos for some depressing evidence
on this score.)&lt;/dd&gt;
&lt;dd&gt;Classical two-part &lt;a href=&quot;mdl.html&quot;&gt;minimum description length&lt;/a&gt;
methods were penalties; I don't yet understand one-part MDL.&lt;/dd&gt;

&lt;dt&gt;Penalties which depend on the model &lt;em&gt;class&lt;/em&gt;&lt;/dt&gt;
&lt;dd&gt;Measure the capacity of a class of models to over-fit;
penalize &lt;em&gt;all&lt;/em&gt; models in that class accordingly, regardless of their
individual properties.  Outstanding example: Vapnik's &quot;structural risk
minimization&quot; (provably consistent under some circumstances).  Only
sporadically coincides with *IC-type penalties based on the number of
parameters.&lt;/dd&gt;

&lt;dt&gt;Cross-validation&lt;/dt&gt;
&lt;dd&gt;Estimate the ability to generalize to different data by, in fact, using
different data. Maybe the &quot;industry standard&quot; of machine learning.  Query, how
are we to know how much different data to use?&lt;/dd&gt;

&lt;dd&gt;Query, how are we to cross-validate when we have complex, relational data?
That is, I understand how to do it for independent samples, and I even
understand how to do it for &lt;a href=&quot;time-series.html&quot;&gt;time series&lt;/a&gt;, but I
do not understand how to do it
for &lt;a href=&quot;network-data-analysis.html&quot;&gt;networks&lt;/a&gt;, and I don't think I am
alone in this.  (Well, I understand how to do it for Erdos-Renyi networks,
because that's back to independent samples...)&lt;/dd&gt;

&lt;dt&gt;The method of sieves&lt;/dt&gt;
&lt;dd&gt;Directly optimize the fit, but within a constrained
class of models; relax the constraint as the amount of data grows.  If the
constraint is relaxed slowly enough, should converge on the truth.  (Ordinary
parametric inference, within a single model class, is a limiting case where the
constraint is relaxed infinitely slowly, and we converge on the pseudo-truth
within that class [provided we have a consistent estimator].)&lt;/dd&gt;

&lt;dt&gt;Encompassing models&lt;/dt&gt;
&lt;dd&gt;The sampling distribution of any estimator of any model class is a function
of the true distribution.  If the true model clss has been well-estimated, it
should be able to predict what other, &lt;em&gt;wrong&lt;/em&gt; model classes will
estimate, but not vice versa.  In this sense the true model class &quot;encompasses
the predictions&quot; of the wrong ones.  (&quot;Truth is the criterion both of itself
and of error.&quot;)&lt;/dd&gt;

&lt;dt&gt;General or covering models&lt;/dt&gt;
&lt;dd&gt;Come up with a single model class which includes all the interesting model
classes as special cases; do ordinary estimation within it.  Getting a
consistent estimator of the additional parameters this introduces is often
non-trivial, and interpretability can be a problem.&lt;/dd&gt;

&lt;dt&gt;Model averaging&lt;/dt&gt;
&lt;dd&gt;Don't try to pick the best or correct model; use them all with different
weights.  Chose the weighting scheme so that if one is best, it will tend to be
more and more influential.  Often I think the improvement is not so much from
using multiple models as from smoothing, since estimates of
&lt;em&gt;the single best model&lt;/em&gt; are going to be more noisy than estimates
of &lt;em&gt;a bunch of models which are all pretty good&lt;/em&gt;.  (This leads
to &lt;a href=&quot;ensemble-ml.html&quot;&gt;ensemble methods&lt;/a&gt;.)

&lt;dt&gt;Adequacy testing&lt;/dt&gt;
&lt;dd&gt;The correct model should be able to encode the data as uniform IID noise.
Test whether &quot;residuals&quot;, in the appropriate sense, are IID uniform.  Reject
models which can't hack it.  Possibly none of the models on offer is adequate;
this, too, is informative.  Or: models make specific probabilistic assumptions
(IID Gaussian noise, for example); test those.  Mis-specification testing.&lt;/dd&gt;
&lt;/dl&gt;

&lt;P&gt;The machine-learning-ish literature on model selection doesn't seem to ever
talk about setting up experiments to select among models; or do I just not read
the right papers there?  (The statistical literature on experimental design
tends to talk about &quot;model discrimination&quot; rather than &quot;model selection&quot;.)

&lt;ul&gt;Recommended, big-picture:
	&lt;li&gt;Leo Breiman, &quot;Heuristics of Instability and Stabilization in Model
Selection,&quot; &lt;a href=&quot;http://dx.doi.org/10.1214/aos/1032181158&quot;&gt;&lt;cite&gt;Annals of Statistics&lt;/cite&gt; &lt;strong&gt;24&lt;/strong&gt; (1996):
2350--2383&lt;/a&gt;
	&lt;li&gt;Gerda Claeskens and Nils Lid Hjort, &lt;cite&gt;Model Selection
and Model Averaging&lt;/cite&gt;
	&lt;li&gt;&lt;a href=&quot;http://www.cs.washington.edu/homes/pedrod/&quot;&gt;Pedro
Domingos&lt;/a&gt;, &quot;The Role of Occam's Razor in Knowledge Discovery,&quot; &lt;cite&gt;Data
Mining and Knowledge Discovery,&lt;/cite&gt; &lt;strong&gt;3&lt;/strong&gt; (1999) [&lt;a
href=&quot;http://www.cs.washington.edu/homes/pedrod/dmkd99.ps.gz&quot;&gt;Online&lt;/a&gt;]
	&lt;li&gt;Trever Hastie, Robert Tibshirani and Jerome Friedman, &lt;cite&gt;The
Elements of Statistical Learning: Data Mining, Inference, and Prediction&lt;/cite&gt;
	&lt;li&gt;C. R. Rao, Y. Wu, Sadanori Konishi and Rahul Mukerjee, &quot;On Model
Selection&quot;, in P. Lahiri (ed.), &lt;cite&gt;Model Selection&lt;/cite&gt;, pp. 1--64
[Thorough review paper, if from a rather old-school statistical-theory
perspective.  The rest of the volume is too Bayesian to be of interest to
me.  &lt;a href=&quot;http://www.jstor.org/stable/4356163&quot;&gt;JSTOR&lt;/a&gt;]
	&lt;li&gt;Aris Spanos, &quot;Curve-Fitting, the Reliability of Inductive
Inference and the Error-Statistical Approach&quot; [&lt;a
href=&quot;http://www.econ.vt.edu/Faculty/CVs_&amp;_Research/Aris%20Spanos%20-%20Working%20Papers/spanoscurve-fitting.pdf&quot;&gt;PDF
preprint&lt;/a&gt;]
	&lt;li&gt;V. N. (=Vladimir Naumovich) Vapnik, &lt;cite&gt;The Nature of
Statistical Learning Theory&lt;/cite&gt; [&lt;a href=&quot;../reviews/vapnik-nature/&quot;&gt;Review:
A Useful Biased Estimator&lt;/a&gt;]
	&lt;li&gt;Quang H. Vuong, &quot;Likelihood Ratio Tests for Model Selection and
Non-Nested Hypotheses&quot;, &lt;cite&gt;Econometrica&lt;/cite&gt; &lt;strong&gt;57&lt;/strong&gt; (1989):
307--333
	&lt;/ul&gt;

&lt;ul&gt;Recommended, close-ups:
	&lt;li&gt;Sylvain Arlot
		&lt;ul&gt;
		&lt;li&gt;&quot;V-fold cross-validation improved: V-fold
penalization&quot;,
&lt;a href=&quot;http://arxiv.org/abs/0802.0566&quot;&gt;arxiv:0802.0566&lt;/a&gt; [Seeing
cross-validation as a penalization method, and improving it accordingly by
strengthening the penalty term]
		&lt;li&gt;&quot;Model selection by resampling penalization&quot;,
&lt;a href=&quot;http://arxiv.org/abs/0906.3124&quot;&gt;arxiv:0906.3124&lt;/a&gt; =
&lt;cite&gt;Electronic Journal of Statistics&lt;/cite&gt; &lt;strong&gt;3&lt;/strong&gt; (2009):
557--624
		&lt;/ul&gt;
	&lt;li&gt;A. C. Atkinson and A. N. Donev, &lt;cite&gt;Optimum Experimental
Design&lt;/cite&gt; [&lt;a href=&quot;../reviews/atkinson-donev/&quot;&gt;Review&lt;/a&gt;]
	&lt;li&gt;Leo Breiman and Philip Spector, &quot;Submodel Selection and Evaluation
in Regression: The X-Random Case&quot;, &lt;cite&gt;International
Statistical Review&lt;/cite&gt; &lt;strong&gt;60&lt;/strong&gt; (1992): 291--319
[&lt;a href=&quot; http://www.jstor.org/stable/1403680&quot;&gt;JSTOR&lt;/a&gt;]
	&lt;li&gt;Prabir Burman, Edmond Chow and Deborah Nolan, &quot;A cross-validatory method for dependent data&quot;, &lt;a href=&quot;http://dx.doi.org/10.1093/biomet/81.2.351&quot;&gt;&lt;cite&gt;Biometrika&lt;/cite&gt; &lt;strong&gt;81&lt;/strong&gt; (1994): 351--358&lt;/a&gt; [&lt;a href=&quot;http://www.jstor.org/stable/2336965&quot;&gt;JSTOR&lt;/a&gt;]
	&lt;li&gt;Patrick S. Carmack, William R. Schucany, Jeffrey S. Spence, Richard
F. Gunst, Qihua Lin and Robert W. Haley, &quot;Far Casting Cross Validation&quot;
[Leave-one-out CV, with a constant-radius window skipped around each hold-out
point as well; this is designed to deal with correlations in time or in
space.  &lt;a href=&quot;http://smu.edu/statistics/TechReports/TR352.pdf&quot;&gt;PDF
preprint&lt;/a&gt;]
	&lt;li&gt;George Casella and Guido Consonni, &quot;Reconciling Model Selection and
Prediction&quot;, &lt;a href=&quot;http://arxiv.org/abs/0903.3620&quot;&gt;arxiv:0903.3620&lt;/a&gt; [&quot;It
is known that there is a dichotomy in the performance of model selectors. Those
that are consistent (having the &quot;oracle property&quot;) do not achieve the
asymptotic minimax rate for prediction error.  We look at this phenomenon
closely, and argue that the set of parameters on which this dichotomy occurs is
extreme, even pathological, and should not be considered when evaluating model
selectors.  We characterize this set, and show that, when such parameters are
dismissed from consideration, consistency and asymptotic minimaxity can be
attained simultaneously.&quot;  Comment: I agree; they show that you need a truly
bizarre sequence of local alternatives to get this behavior.]
	&lt;li&gt;Nicolo Cesa-Bianchi and Gabor Lugosi, &lt;citE&gt;Prediction, Learning,
and Games&lt;/cite&gt;
[&lt;a href=&quot;../weblog/algae-2008-07.html#prediction&quot;&gt;Mini-review&lt;/a&gt;.  For
avoiding model selection in favor of adaptively-weighted combinations of
models.]
	&lt;li&gt;Snigdhansu Chatterjee, Nitai D. Mukhopadhyay, &quot;Risk and resampling
under model
uncertainty&quot;, &lt;a href=&quot;http://arxiv.org/abs/0805.3244&quot;&gt;arxiv:0805.3244&lt;/a&gt; [an
interesting approach to model averaging with provably good frequentist
properties, via bootstrapping --- for a trivial linear-Gaussian problem; not
clear to me how to generalize]
	&lt;li&gt;Bruce E. Hansen, &quot;Challenges for Econometric Model
Selection&quot;, &lt;cite&gt;Econometric Theory&lt;/cite&gt; &lt;strong&gt;21&lt;/strong&gt; (2005): 60--68
[&quot;Standard econometric model selection methods are based on four fundamental
errors in approach: parametric vision, the assumption of a true
[data-generating process], evaluation based on fit, and ignoring the impact of
model uncertainty on inference. Instead, econometric model selection methods
should be based on a semiparametric vision, models should be viewed as
approximations, models should be evaluated based on their purpose, and model
uncertainty should be incorporated into inference
methods.&quot;  &lt;a href=&quot;http://www.ssc.wisc.edu/~bhansen/papers/et_05.html&quot;&gt;PDF&lt;/a&gt;]
	&lt;li&gt;Marcus Hutter, &quot;The Loss Rank Principle for Model Selection&quot;,
&lt;a href=&quot;http://arxiv.org/abs/math.ST/0702804&quot;&gt;math.ST/0702804&lt;/a&gt; [This is a
simplified form of &lt;a href=&quot;../reviews/mayo-error/&quot;&gt;Deborah Mayo's
&quot;severity&quot;&lt;/a&gt;.]
	&lt;li&gt;Pascal Lavergne and Quang H. Vuong, &quot;Nonparametric Selection of
Regressors: The Nonnested Case&quot;, &lt;cite&gt;Econometrica&lt;/cite&gt; &lt;strong&gt;64&lt;/strong&gt;
(1996): 207--219 [Picking which variables belong in a regression, by looking at
the error of non-parametric kernel
regressions.  &lt;a href=&quot;http://www.jstor.org/stable/2171929&quot;&gt;JSTOR&lt;/a&gt;]
	&lt;li&gt;&lt;a href=&quot;http://web.me.com/pascal.massart/Site/Home.html&quot;&gt;Pascal Massart&lt;/a&gt;, &lt;cite&gt;Concentration Inequalities and Model
Selection&lt;/cite&gt; [Using &lt;a href=&quot;empirical-process-theory.html&quot;&gt;empirical process theory&lt;/a&gt; to get finite-sample, i.e.,
non-asymptotic, risk bounds for various forms
of model selection.  Available for free as
a &lt;a href=&quot;http://eprints.pascal-network.org/archive/00002827/&quot;&gt;large PDF
preprint&lt;/a&gt;.]
	&lt;li&gt;Charles Mitchell and Sara van de Geer, &quot;General Oracle Inequalities
for Model
Selection&quot;, &lt;a href=&quot;http://dx.doi.org/10.1214/08-EJS254&quot;&gt;&lt;cite&gt;Electronic
Journal of Statistics&lt;/cite&gt; &lt;strong&gt;3&lt;/strong&gt; (2009): 176--204&lt;/a&gt; [Analyzes
a data-set splitting scheme (like cross-validation with only one &quot;fold&quot;)]
	&lt;li&gt;&lt;a href=&quot;http://www.mcmaster.ca/economics/racine/&quot;&gt;Jeffrey S. Racine&lt;/a&gt;
		&lt;ul&gt;
		&lt;li&gt;&quot;Feasible Cross-Validatory Model Selection for General Stationary Processes&quot;, &lt;cite&gt;Journal of Applied Econometrics&lt;/cite&gt;
&lt;strong&gt;12&lt;/strong&gt; (1997): 169--179
[&lt;a href=&quot;http://www.jstor.org/stable/2284910&quot;&gt;JSTOR&lt;/a&gt;.  This is closely
related to (maybe algebraically just a special case of?) the familiar trick
from splines of writing the CV criterion in terms of the
hat/influence/projection matrix.]
		&lt;li&gt;&quot;Consistent cross-validatory model-selection for dependent
data: hv-block
cross-validation&quot;, &lt;a href=&quot;http://dx.doi.org/10.1016/S0304-4076(00)00030-0&quot;&gt;&lt;cite&gt;Journal
of Econometrics&lt;/cite&gt; &lt;strong&gt;99&lt;/strong&gt; (2000): 39--61&lt;/a&gt;
		&lt;/ul&gt;
	&lt;li&gt;David J. Spiegelhalter, Nicola G. Best, Bradley P. Carlin and
Angelika van der Linde, &quot;Bayesian Measures of Model Complexity and
Fit&quot;, &lt;cite&gt;Journal of the Royal Statistical Society
B&lt;/cite&gt; &lt;strong&gt;64&lt;/strong&gt; (2002): 583--639
[&lt;a href=&quot;http://www.soe.ucsc.edu/~draper/DIC.pdf&quot;&gt;PDF reprint&lt;/a&gt;]
	&lt;li&gt;Ryan J. Tibshirani and Robert Tibshirani, &quot;A bias correction for
the minimum error rate in
cross-validation&quot;, &lt;a href=&quot;http://dx.doi.org/10%2E1214/08-AOAS224&quot;&gt;&lt;cite&gt;Annals
of Applied Statistics&lt;/citE&gt; &lt;strong&gt;3&lt;/strong&gt; (2009): 822--829&lt;/a&gt;
= &lt;a href=&quot;http://arxiv.org/abs/0908.2904&quot;&gt;arxiv:0908.2904&lt;/a&gt;
	&lt;li&gt;Sara van de Geer, &lt;cite&gt;Empirical Process Theory in
&lt;/cite&gt;M&lt;cite&gt;-Estimation&lt;/cite&gt;
	&lt;li&gt;Mark J. van der Laan and Sandrine Dudoit, &quot;Unified Cross-Validation
Methodology for Selection Among Estimators and a General Cross-Validated
Adaptive Epsilon-Net Estimator: Finite Sample Oracle Inequalities and Examples&quot;
[&lt;a href=&quot;http://www.bepress.com/ucbbiostat/paper130&quot;&gt;PDF working paper&lt;/a&gt;,
i.e., a 100-page tome.  The first part proves that multi-fold cross-validation
and the like will work for selecting the best estimator out of a finite set of
estimators (provided the loss function is nicely bounded and the data are IID).
The second part ingeniously turns this into a complete estimation procedure, by
effectively creating a discrete sieve and then using CV to say which part of
the sieve to use.  This is a very cool set of results, but (1) the limitations
to bounded loss functions make me nervous, and (2) the formulas appearing in
the finite-sample and even asymptotic bounds are &lt;em&gt;ugly&lt;/em&gt;.  On the other
hand, they &lt;em&gt;have&lt;/em&gt; finite-sample bounds! &amp;mdash; I wonder if the
bounded-and-IID restrictions could be lifted using the techniques in Jiang's
&quot;On Uniform Deviation Bounds&quot; (link and description
under &lt;a href=&quot;learning-theory.html&quot;&gt;Learning Theory&lt;/a&gt;), or those
in &lt;a href=&quot;../weblog/algae-2009-04.html#weak&quot;&gt;Dedecker et al.'s &lt;cite&gt;Weak
Dependence&lt;/cite&gt;&lt;/a&gt;.]
	&lt;li&gt;Aad W. van der Vaart, Sandrine Dudoit and Mark J. van der Laan,
&quot;Oracle inequalities for multi-fold cross
validation&quot;, &lt;a href=&quot;http://dx.doi.org/10.1524/stnd.2006/24.3.351&quot;&gt;&lt;cite&gt;Statistics
and Decisions&lt;/cite&gt; &lt;strong&gt;24&lt;/strong&gt; (2006): 351--371&lt;/a&gt; [Streamlined and
improved versions of the key results from the van der Laan/Dudoit tome.  Thanks
to Prof. van der Vaart for a reprint]
	&lt;/ul&gt;

&lt;ul&gt;To read:
	&lt;li&gt;Sylvain Arlot, &quot;Suboptimality of penalties proportional to the
dimension for model selection in heteroscedastic
regression&quot;, &lt;a href=&quot;http://arxiv.org/abs/0812.3141&quot;&gt;arxiv:0812.3141&lt;/a&gt;
	&lt;li&gt;Sylvain Arlot and Pascal Massart, &quot;Data-driven Calibration of
Penalties for Least-Squares
Regression&quot;, &lt;a href=&quot;http://jmlr.csail.mit.edu/papers/v10/arlot09a.html&quot;&gt;&lt;cite&gt;Journal
of Machine Learning Research&lt;/cite&gt; &lt;strong&gt;10&lt;/strong&gt; (2009): 245--279&lt;/a&gt;
	&lt;li&gt;Maria Maddalena Barbieri and James O. Berger, &quot;Optimal Predictive
Model Selection&quot;, &lt;a
href=&quot;http://arxiv.org/abs/math.ST/0406464&quot;&gt;math.ST/0406464&lt;/a&gt; = &lt;citE&gt;Annals
of Statistics&lt;/cite&gt; &lt;strong&gt;32&lt;/strong&gt; (2004): 870--897 [Unfortunately,
Bayesian]
	&lt;li&gt;Andrew Barron, Lucien Birg&amp;eacute;, and Pascal Massart, &quot;Risk
bounds for model selection via penalization&quot;, &lt;citE&gt;Probability Theory and
Related Fields&lt;/cite&gt; &lt;strong&gt;113&lt;/strong&gt; (1999): 301--413
	&lt;li&gt;Lucien Birg&amp;eacute;
		&lt;ul&gt;
		&lt;li&gt;&quot;The Brouwer Lecture 2005: Statistical estimation with
model
selection&quot;, &lt;a href=&quot;http://arxiv.org/abs/math.ST/0605187&quot;&gt;math.ST/0605187&lt;/a&gt;
		&lt;li&gt;&quot;Model selection for Poisson processes&quot;,
&lt;a href=&quot;http://arxiv.org/abs/math/0609549&quot;&gt;math/0609549&lt;/a&gt;
		&lt;/ul&gt;
	&lt;li&gt;Lucien Birg&amp;racute; and Pascal Massart
		&lt;ul&gt;&quot;Minimal Penalties for Gaussian
Model Selection&quot;, &lt;a
href=&quot;http://dx.doi.org/10.1007/s00440-006-0011-8&quot;&gt;&lt;cite&gt;Probability Theory and
Related Fields&lt;/cite&gt; &lt;strong&gt;138&lt;/strong&gt; (2007): 33--73&lt;/a&gt;
		&lt;li&gt;&quot;From model selection to adaptive estimation&quot;, pp. 55--87
in Pollard, Torgersen and Yang (eds.), &lt;cite&gt;Fetschrift for Lucien Le Cam:
Research Papers in Probability and Statistics&lt;/cite&gt; (1997)
		&lt;/ul&gt;
	&lt;li&gt;Borowiak, &lt;cite&gt;Model Discrimination for Nonlinear Regression
Models&lt;/cite&gt;
	&lt;li&gt;P. Burman, &quot;A comparative study of ordinary cross-validation,
v-fold cross-validation and the repeated learning-testing methods&quot;,
&lt;cite&gt;Biometrika&lt;/cite&gt; &lt;strong&gt;76&lt;/strong&gt; (1989): 503--514
	&lt;li&gt;Alain Celisse, &quot;Model selection in density estimation via
cross-validation&quot;, &lt;a href=&quot;http://arxiv.org/abs/0811.0802&quot;&gt;arxiv:0811.0802&lt;/a&gt;
	&lt;li&gt;A. E. Clark and C. G. Troskie, &quot;Time Series and Model Selection&quot;,
&lt;a href=&quot;http://dx.doi.org/10.1080/03610910701884153&quot;&gt;&lt;cite&gt;Communications in Statistics: Simulation and computing&lt;/citE&gt;
&lt;strong&gt;37&lt;/strong&gt; (2008): 766--771&lt;/a&gt; [Simulation study of the accuracy of
different information criteria]
	&lt;li&gt;Kevin A. Clarke, &quot;A Simple Distribution-Free 
Test for Nonnested Hypotheses&quot; [&lt;a href=&quot;http://www.rochester.edu/college/psc/clarke/ClarkePA.pdf&quot;&gt;PDF preprint&lt;/a&gt;]
	&lt;li&gt;Guilhem Coq, Olivier Alata, Marc Arnaudon and Christian Olivier,
&quot;An improved method for model selection based on Information Criteria&quot;, 
&lt;a href=&quot;http://arxiv.org/abs/math.ST/0702540&quot;&gt;math.ST/0702540&lt;/a&gt;
	&lt;li&gt;Pedro Domingos
		&lt;ul&gt;
		&lt;li&gt;&quot;Process-Oriented Estimation of Generalization Error&quot; [&lt;a href=&quot;http://www.cs.washington.edu/homes/pedrod/papers/ijcai99.pdf&quot;&gt;PDF&lt;/a&gt;]
		&lt;li&gt;&quot;A Process-Oriented Heuristic for Model Selection&quot;
[&lt;a
href=&quot;http://www.cs.washington.edu/homes/pedrod/papers/mlc98.pdf&quot;&gt;PDF&lt;/a&gt;]
		&lt;/ul&gt;
	&lt;li&gt;Sandrine Dudoit and Mark J. van der Laan, &quot;Asymptotics of Cross-Validated Risk Estimation in Estimator Selection and Performance Assessment&quot;,
&lt;cite&gt;Statistical Methodology&lt;/cite&gt; &lt;strong&gt;2&lt;/strong&gt; (2005): 131--154
[&lt;a href=&quot;http://www.bepress.com/ucbbiostat/paper126/&quot;&gt;preprint&lt;/a&gt;]
	&lt;li&gt;Hugo Jair Escalante, Manuel Montes, Luis Enrique Sucar, &quot;Particle
Swarm Model Selection&quot;,
&lt;a href=&quot;http://jmlr.csail.mit.edu/papers/v10/escalante09a.html&quot;&gt;&lt;cite&gt;Journal
of Machine Learning Research&lt;/cite&gt; &lt;strong&gt;10&lt;/strong&gt; (2009): 405--440&lt;/a&gt;
	&lt;li&gt;Jianqing Fan and Runze Li, &quot;Variable Selection via Nonconcave
Penalized Likelihood and its Oracle Properties&quot;, &lt;cite&gt;Journal of
the American Statistical Association&lt;/cite&gt; &lt;strong&gt;96&lt;/strong&gt; (2001): 1348--1360 [&lt;a href=&quot;http://www.orfe.princeton.edu/~jqfan/papers/01/penlike.pdf&quot;&gt;PDF reprint&lt;/a&gt; via Prof. Fan]
	&lt;li&gt;Magalie Fromont, &quot;Model selection by bootstrap penalization for
classification&quot;, &lt;a
href=&quot;http://dx.doi.org/10.1007/s10994-006-7679-y&quot;&gt;&lt;cite&gt;Machine
Learning&lt;/cite&gt;
&lt;strong&gt;66&lt;/strong&gt; (2007): 165--207&lt;/a&gt;
	&lt;li&gt;Christophe Giraud, &quot;Estimation of Gaussian graphs by model
selection&quot;, &lt;a href=&quot;http://arxiv.org/abs/0710.2044&quot;&gt;arxiv:0710.2044&lt;/a&gt;
	&lt;li&gt;Alexander Goldenshluger and Eitan Greenshtein, &quot;Asymptotically
minimax regret procedures in regression model selection and the magnitude of
the dimension
penalty&quot;, &lt;a href=&quot;http://dx.doi.org/10.1214/aos/1015957473&quot;&gt;&lt;cite&gt;Annals of
Statistics&lt;/cite&gt; &lt;strong&gt;28&lt;/strong&gt; (2000): 1620--1637&lt;/a&gt; [Hmmm.  Not sure
how relevant this will be to anything I'd need to do, given the assumptions
they load on.  Via Kevin Kelly.]
	&lt;li&gt;Christian Gourieroux and Alain Monfort, &quot;Testing, Encompassing, and
Simulating Dynamic Econometric Models&quot;, &lt;cite&gt;Econometric Theory&lt;/cite&gt;
&lt;strong&gt;11&lt;/strong&gt; (1995): 195--228 [&lt;a href=&quot;http://www.jstor.org/pss/3532571&quot;&gt;JSTOR&lt;/a&gt;]
	&lt;li&gt;Michael Kearns and Dana Ron, &quot;Algorithmic Stability and
Sanity-Check Bounds for Leave-One-Out Cross-Validation,&quot; &lt;a
href=&quot;http://neco.mitpress.org/cgi/content/abstract/11/6/1427&quot;&gt;&lt;cite&gt;Neural
Computation&lt;/cite&gt; &lt;strong&gt;11&lt;/strong&gt; (1999): 1427--1453&lt;/a&gt;
	&lt;li&gt;Nicholas M. Kiefer and Hwan-Sik Choi, &quot;Robust Model Selection in
Dynamic Models with an Application to Comparing Predictive Accuracy&quot;
[&lt;A href=&quot;http://papers.ssrn.com/sol3/papers.cfm?abstract_id=945144&quot;&gt;SSRN&lt;/a&gt;]
	&lt;li&gt;Sadanori Konishi and Genshiro Kitagawa, &quot;Asymptotic theory for
information crteria in model selection --- functional approach,&quot; &lt;a
href=&quot;http://dx.doi.org/10.1016/S0378-3758(02)00462-7&quot;&gt;&lt;cite&gt;Journal of
Statistical Planning and Inference&lt;/cite&gt; &lt;strong&gt;114&lt;/strong&gt; (2003):
45--61&lt;/a&gt;
	&lt;li&gt;&lt;a href=&quot;http://www.stat.yale.edu/~hl284/&quot;&gt;Hannes Leeb&lt;/a&gt;,
&quot;Conditional Predictive Inference Post Model Selection&quot;, &lt;cite&gt;Annals of
Statistics&lt;/cite&gt; &lt;strong&gt;37&lt;/strong&gt; (2009): 2838--2876
= &lt;a href=&quot;http://arxiv.org/abs/0908.3615&quot;&gt;arxiv:0908.3615&lt;/a&gt; [I heard Leeb
give a talk on this, but I should read the paper]
	&lt;li&gt;Hannes Leeb and Benedikt M. Poetscher
		&lt;ul&gt;
		&lt;li&gt;&quot;Can One Estimate The
Unconditional Distribution of Post-Model-Selection Estimators?&quot;,
&lt;a href=&quot;http://arxiv.org/abs/0704.1584&quot;&gt;arxiv:0704.1584&lt;/a&gt; [They claim the
answer is &quot;No&quot;.]
		&lt;li&gt;&quot;Model Selection and Inference: Facts and Fiction&quot;,
&lt;a href=&quot;http://dx.doi.org/10+10170 S0266466605050036&quot;&gt;&lt;cite&gt;Econometric
Theory&lt;/cite&gt; &lt;strong&gt;21&lt;/strong&gt; (2005): 21--59&lt;/a&gt;
[&lt;a href=&quot;http://www.stat.yale.edu/~hl284/ETAnniv.pdf&quot;&gt;PDF reprint&lt;/a&gt;]
		&lt;/ul&gt;
	&lt;li&gt;F. Liang and A. Barron, &quot;Exact Minimax Strategies for Predictive
Density Estimation, Data Compression, and Model Selection&quot;, &lt;a
href=&quot;http://dx.doi.org/0.1109/TIT.2004.836922&quot;&gt;&lt;cite&gt;IEEE Transactions on
Information Theory&lt;/cite&gt; &lt;strong&gt;50&lt;/strong&gt; (2004): 2708--2726&lt;/a&gt;
	&lt;li&gt;Abraham Meidan and Boris Levin, &quot;Choosing from Competing Theories
in Computerised Learning&quot;, &lt;cite&gt;Minds and Machines&lt;/citE&gt; &lt;strong&gt;12&lt;/strong&gt;
(2002): 119--129
	&lt;li&gt;Nicolai Meinshausen and Peter Buehlmann, &quot;Stability Selection&quot;,
&lt;a href=&quot;http://arxiv.org/abs/0809.2932&quot;&gt;arxiv:0809.2932&lt;/a&gt; [&quot;Estimation of
structure, such as in graphical modeling, cluster analysis or variable
selection, is notoriously difficult, especially for high-dimensional data. We
introduce the new method of stability selection.&quot;]
	&lt;li&gt;Grayham E. Mizon and Massimiliano Marcellino (eds.),
&lt;cite&gt;Progressive Modelling:  Non-nested Testing and Encompassing&lt;/cite&gt;
[&lt;a href=&quot;http://www.oup.com/us/catalog/general/subject/Economics/Econometrics/?view=usa&amp;ci=9780199257324&quot;&gt;Blurb, table of contents&lt;/a&gt;]
	&lt;li&gt;Ali Mohammad-Djafari, &quot;Model selection for inverse problems: Best
choice of basis functions and model order selection,&quot; &lt;a
href=&quot;http://arxiv.org/abs/physics/0111020&quot;&gt;physics/0111020&lt;/a&gt;
	&lt;li&gt;M. Pavlic and M. J. van der Laan, &quot;Fitting of mixtures with
unspecified number of components using cross validation distance
estimate&quot;, &lt;cite&gt;Computational Statistics and Data
Analysis&lt;/cite&gt; &lt;strong&gt;41&lt;/strong&gt; (2003): 413--428
	&lt;li&gt;Zacharias Psaradakis, Martin Sola, Fabio Spagnolo and Nicola Spagnolo, &quot;Selecting nonlinear time series models using information criteria&quot;,
&lt;a href=&quot;http://dx.doi.org/10.1111/j.1467-9892.2009.00614.x&quot;&gt;&lt;cite&gt;Journal of
Time Series Analysis&lt;/cite&gt;
&lt;strong&gt;30&lt;/strong&gt; (2009): 369--394&lt;/a&gt;
	&lt;li&gt;Pradeep Ravikumar, Martin J. Wainwright, John D. Lafferty,
&quot;High-Dimensional Graphical Model Selection Using $\ell_1$-Regularized Logistic
Regression&quot;, &lt;a href=&quot;http://arxiv.org/abs/0804.4202&quot;&gt;arxiv:0804.4202&lt;/a&gt;
	&lt;li&gt;Douglas Rivers and Quang H. Vuong, &quot;Model selection tests for
nonlinear dynamic
models&quot;, &lt;a href=&quot;htttp://dx.doi.org/10.1111/1368-423X.t01-1-00071&quot;&gt;The
Econometrics Journal&lt;/cite&gt; &lt;strong&gt;5&lt;/strong&gt; (2002): 1--39&lt;/a&gt;
	&lt;li&gt;Yiyuan She, &quot;Thresholding-based Iterative Selection
Procedures for Model Selection and Shrinkage&quot;, &lt;a href=&quot;http://arxiv.org/abs/0812.5061&quot;&gt;arxiv:0812.5061&lt;/a&gt;
	&lt;li&gt;David Shilane, Richard H. Liang and Sandrine Dudoit, &quot;Loss-Based
Estimation with Evolutionary Algorithms and Cross-Validation&quot;,
UC Berkeley Biostatistics Working Paper 227 [&lt;a href=&quot;http://www.bepress.com/ucbbiostat/paper227/&quot;&gt;Abstract, PDF&lt;/a&gt;]
	&lt;li&gt;Aris Spanos
		&lt;ul&gt;
		&lt;li&gt;&quot;Statistical Induction, Severe Testing, and Model
Validation&quot; [&lt;a href=&quot;http://www.error06.econ.vt.edu/spanos.pdf&quot;&gt;Preprint&lt;/a&gt;]
		&lt;li&gt;&quot;Statistical Model Specification vs. Model Selection: Akaike-type Criteria and the Reliability of Inference&quot; [preprint kindly
provided by Prof. Spanos]
		&lt;/ul&gt;
	&lt;li&gt;Tina Toni and Michael P. H. Stumpf
		&lt;ul&gt;
		&lt;li&gt;&quot;Parameter Inference and
Model Selection in Signaling Pathway Models&quot;, &lt;a href=&quot;http://arxiv.org/abs/0905.4468&quot;&gt;arxiv:0905.4468&lt;/a&gt;
		&lt;li&gt;&quot;Simulation-based model selection for dynamical systems in systems and population biology&quot;, &lt;a href=&quot;http://arxiv.org/abs/0911.1705&quot;&gt;arxiv:0911.1705&lt;/a&gt;
		&lt;/ul&gt;
	&lt;li&gt;Masayuki Uchida and Nakahiro Yoshida, &quot;Information Criteria in
Model Selection for Mixing Processes&quot;, &lt;cite&gt;Statistical Inference for
Stochastic Processes&lt;/cite&gt; &lt;strong&gt;4&lt;/strong&gt; (2001): 73--98 [&quot;The emphasis is
put on the use of the asymptotic expansion of the distribution of an estimator
based on the conditional Kullback-Leibler divergence for stochastic processes.
Asymptotic properties of information criteria and their improvement are
discussed.&quot;]
	&lt;li&gt;Tim van Erven, Peter Grunwald and Steven de Rooij, &quot;Catching Up Faster by Switching Sooner: A Prequential Solution to the AIC-BIC Dilemma&quot;, &lt;a href=&quot;http://arxiv.org/abs/0807.1005&quot;&gt;arxiv:0807.1005&lt;/a&gt;
	&lt;li&gt;Geert Verbeke, Geert Molenberghs, Caroline Beunckens, &quot;Formal and
Informal Model Selection with Incomplete Data&quot;, &lt;cite&gt;Statistical
Science&lt;/citE&gt; &lt;strong&gt;23&lt;/strong&gt; (2008): 201--218
= &lt;a href=&quot;http://arxiv.org/abs/0808.3587&quot;&gt;arxiv:0808.3587&lt;/a&gt;
	&lt;li&gt;Zijun Wang, &quot;Finite Sample Performances of the Model Selection Approach in Nonparametric Model Specification for Time Series&quot;, &lt;a href=&quot;http://dx.doi.org/10.1080/03610920802531314&quot;&gt;&lt;cite&gt;Communications in Statistics: Theory and Methods&lt;/cite&gt; &lt;strong&gt;38&lt;/strong&gt;
(2009): 2302--2330&lt;/a&gt;
	&lt;li&gt;Peng Zhau and Bin Yu, &quot;On Model Selection Consistency of Lasso&quot;,
&lt;a
href=&quot;http://jmlr.csail.mit.edu/papers/volume7/zhao06a/zhao06a.pdf&quot;&gt;&lt;cite&gt;Journal
of Machine Learning Research&lt;/cite&gt; &lt;strong&gt;7&lt;/strong&gt; (2006): 2541--2563&lt;/A&gt;
	&lt;/ul&gt;
</description>
  </item>
  </channel>
</rss>