February 07, 2012

It's Not the Heat that Gets to You, It's the Sustained Conjunction of Heat with Elevated Levels of Atmospheric Pollutants (Advanced Data Analysis from an Elementary Point of View)

In which spline regression becomes a matter of life and death in Chicago.

Assignment

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at February 07, 2012 10:31 | permanent link

Splines (Advanced Data Analysis from an Elementary Point of View)

Kernel regression controls the amount of smoothing indirectly by bandwidth; why not control the irregularity of the smoothed curve directly? The spline smoothing problem is a penalized least squares problem: minimize mean squared error, plus a penalty term proportional to average curvature of the function over space. The solution is always a continuous piecewise cubic polynomial, with continuous first and second derivatives. Altering the strength of the penalty moves along a bias-variance trade-off, from pure OLS at one extreme to pure interpolation at the other; changing the strength of the penalty is equivalent to minimizing the mean squared error under a constraint on the average curvature. To ensure consistency, the penalty/constraint should weaken as the data grows; the appropriate size is selected by cross-validation. An example with the data, including confidence bands. Writing splines as basis functions, and fitting as least squares on transformations of the data, plus a regularization term. A brief look at splines in multiple dimensions. Splines versus kernel regression.

Reading: Notes, chapter 7; Faraway, section 11.2.

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at February 07, 2012 10:30 | permanent link

February 02, 2012

Heteroskedasticity, Weighted Least Squares, and Variance Estimation (Advanced Data Analysis from an Elementary Point of View)

Weighted least squares estimates. Heteroskedasticity and the problems it causes for inference. How weighted least squares gets around the problems of heteroskedasticity, if we know the variance function. Estimating the variance function from regression residuals. An iterative method for estimating the regression function and the variance function together. Locally constant and locally linear modeling. Lowess.

Reading: Notes, chapter 6; Faraway, section 11.3.

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at February 02, 2012 10:30 | permanent link

January 31, 2012

Books to Read While the Algae Grow in Your Fur, January 2012

Attention conservation notice: I have no taste.

Stephen Greenblatt, The Swerve: How the World Became Modern
A rather rambling and formless, if amiable and enthusiastic, popular history of Lucretius's De Rerum Natura and its rediscovery during the Renaissance. The grandiosity of the subtitle is not, thankfully, insisted upon in the text, which in fact says rather little about the quite interesting history of how Lucretius was taken up, and Epicurean ideas were elaborated on, in early modern Europe. Passages of novelistic you-are-there detail, which Greenblatt admits are totally made up, are mercifully brief and fairly clearly marked as such. (Such claims of influence as he does make strike me as very thinly supported, though not clearly wrong.) Enjoyable, if slight, if you are prepared to care very deeply about books, and to sympathize with philosophical materialism.
(I am not sure why Greenblatt writes that the only manuscripts we have from the ancient world are those from Herculaneum preserved by the eruption of Mt. Vesuvius. In Egypt and other desert countries, manuscripts have survived from Roman, Ptolemaic and even earlier times, some of them rather famous. But he is not a classicist, and one hopes he is a bit more careful about his own period.)
Margaret C. Jacob, Strangers Nowhere in the World: The Rise of Cosmopolitanism in Early Modern Europe
On the positive side, the subject is important, and there were lots of interesting anecdotes and suggestions. Against that, it is far too scatter-shot and lacks not only a single global argument, but even much cohesion within individual chapters. It is also far too limited in scope, to the Enlightenment and its immediate predecessors in the 17th century. But if one wanted to look even at what was distinctive about that sort of cosmopolitanism, it's very strange to not even try to compare it to Latinate humanism and earlier medieval traditions, or the way the travels of learned artists spread styles and ideas during the Renaissance and before. (Comparison with any other part of the world is of course too much to expect of a Europeanist, even one interested in cosmopolitanism.) Finally, Jacob makes causal claims — e.g., that alchemical ideas in early-modern natural philosophy were displaced by mechanical ones because the latter were less politically troubling to monarchies — with a sweep and assurance totally out of proportion to anything she presents by way of evidence or argument. Over-all of little value to me, but perhaps of more use to specialists in the period.
Amar Bhidé, A Call for Judgment: Sensible Finance for a Dynamic Economy
Full-length review: Hayek contra Chicago.
Rachel Loden, Dick of the Dead
Not as good as her superb Hotel Imperium, but still great:
The Idiad

Shall I write a poem about you
And your epic struggle against stupidity?
Feh. But if the brain is a city
I too have rooms in the swampy part, surrounded by crocodiles.
The monarch butterflies sail down from the Canadian Rockies
To overwinter in Pacific Grove, pair off and fly away;
They bruise me. I get crankier.
If you are coming down through the narrows of the Saugatuck
Please text me beforehand,
And I will come out to meet you
As far as Palookaville.

Gerda Claeskens and Nils Lid Hjort, Model Selection and Model Averaging
Full-length review: How Can You Choose Just One?.
Shorter me: the best available review of model selection from a statistical standpoint. Presumes a reader with some knowledge of asymptotic statistics.
Shirley Jackson, The Haunting of Hill House
Exactly as good, as monstrous, and as ambiguous, as I remember it (unlike The Sundial). One mark of its excellence is that its things that go bump in the night are perfectly convincing, and yet the real horrors are all those of the all-too-human mind. I am not sure what point there is to other haunted house stories, really.
ObLinkage: Kit Whitfield on the first paragraph of the novel. Whitfield is exactly right about the way "small, unnerving echoes whisper back and forth along her pages". (Take, please take, the ending, for example.)
Patrick O'Brian, The Letter of Marque; The Thirteen Gun Salute; The Nutmeg of Consolation; Clarissa Oakes / The Truelove
Books to Read While the Algae Grow in Your Fur; Writing for Antiquity; The Great Transformation; The Commonwealth of Letters; Scientifiction and Fantastica; Enigmas of Chance; The Dismal Science

Posted by crshalizi at January 31, 2012 23:59 | permanent link

How the Hyracotherium Got Its Mass (Advanced Data Analysis from an Elementary Point of View)

In which we consider evolutionary trends in body size, aided by regression modeling and the bootstrap.

Assignment

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at January 31, 2012 19:11 | permanent link

The Bootstrap (Advanced Data Analysis from an Elementary Point of View)

Quantifying uncertainty by looking at sampling distributions. The bootstrap principle: sampling distributions under a good estimate of the truth are close to the true sampling distributions. Parametric bootstrapping. Non-parametric bootstrapping. Many examples. When does the bootstrap fail?

Reading: Notes, chapter 5 (R for figures and examples; pareto.R; wealth.dat)<; R for in-class examples

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at January 31, 2012 19:10 | permanent link

You think you want big data? You can't handle big data! (Next Week at the Statistics Seminar)

Fortunately, however, the methods of those who can handle big data are neither grotesque nor incomprehensible, and we will hear about them on Monday.

Alekh Agarwal, "Computation Meets Statistics: Trade-offs and Fundamental Limits for Large Data Sets"
Abstract: The past decade has seen the emergence of datasets of unprecedented scale, with both large sample sizes and dimensionality. Massive data sets arise in various domains, among them computer vision, natural language processing, computational biology, social networks analysis and recommendation systems, to name a few. In many such problems, the bottleneck is not just the number of data samples, but also the computational resources available to process the data. Thus, a fundamental goal in these problems is to characterize how estimation error behaves as a function of the sample size, number of parameters, and the computational budget available.
In this talk, I present three research threads that provide complementary lines of attack on this broader research agenda: (i) lower bounds for statistical estimation with computational constraints; (ii) interplay between statistical and computational complexities in structured high-dimensional estimation; and (iii) a computational budgeted framework for model selection. The first characterizes fundamental limits in a uniform sense over all methods, whereas the latter two provide explicit algorithms that exploit the interaction of computational and statistical considerations.
Joint work with John Duchi, Sahand Negahban, Clement Levrard, Pradeep Ravikumar, Peter Bartlett, and Martin Wainwright.
Time and place: 4--5 pm on Monday, 6 February 2012, in Scaife Hall 125

As always, the talk is free and open to the public.

Posted by crshalizi at January 31, 2012 19:00 | permanent link

"The Cut and Paste Process" (This Week at the Statistics Seminar)

Attention conservation notice: Only of interest if you (1) care about combinatorial stochastic processes and their statistical applications, and (2) will be in Pittsburgh on Wednesday afternoon.

It is only in very special weeks, when we have been very good, that we get two seminars.

Harry Crane, "The Cut-and-Paste Process"
Abstract: In this talk, we present the cut-and-paste process, a novel infinitely exchangeable process on the state space of partitions of the natural numbers whose samples paths differ from previously studied exchangeable coalescent (Kingman 1982; Pitman 1999) and fragmentation (Bertoin 2001) processes. Though it evolves differently, the cut-and-paste process possesses some of the same properties as its predecessors, including a unique equilibrium measure, associated measure-valued process, a Poisson point process construction and transition probabilities which can be described in terms of Kingman's paintbox process. A parametric subfamily is related to the Chinese restaurant process and we illustrate potential applications of this model to phylogenetic inference based on RNA/DNA sequence data. There are some natural extensions of this model to Bayesian inference, hidden Markov models and tree-valued Markov processes which we will discuss.
We also discuss how this process and its extensions fit into the more general framework of statistical modeling of structure and dependence via combinatorial stochastic processes, e.g. random partitions, trees and networks, and the practical importance of infinite exchangeability in this context.
Time and place: 4--5 pm on Wednesday, 1 February 2012, in Scaife Hall 125

As always, the talk is free and open to the public.

Enigmas of Chance

Posted by crshalizi at January 31, 2012 18:45 | permanent link

January 28, 2012

Scientific Community to Elsevier: Drop Dead

Attention conservation notice: Associate editor at a non-profit scientific journal endorses a call for boycotting a for-profit scientific journal publisher.

I have for years been refusing to publish in or referee for journals publisher by Elsevier; pretty much all of the commercial journal publishers are bad deals1, but they are outrageously worse than most. Since learning that Elsevier had a business line in putting out publications designed to look like peer-reviewed journals, and calling themselves journals, but actually full of paid-for BS, I have had a form letter I use for declining requests to referee, letting editors know about this, and inviting them to switch to a publisher which doesn't deliberately seek to profit by corrupting the process of scientific communication.

I am thus extremely happy to learn from Michael Nielsen that Tim Gowers is organizing a general boycott of Elsevier, asking people to pledge not to contribute to its journals, referee for them, or do editorial work for them. You can sign up here, and I strongly encourage you to do so. There are fields where Elsevier does publish the leading journals, and where this sort of boycott would be rather more personally costly than it is in statistics, but there is precedent for fixing that. Once again, I strongly encourage readers in academia to join this.

(To head off the inevitable mis-understandings, I am not, today, calling for getting rid of journals as we know them. I am saying that Elsevier is ripping us off outrageously, that conventional journals can be published without ripping us off, and so we should not help Elsevier to rip us off.)

Disclaimer, added 29 January: As I should have thought went without saying, I am speaking purely for myself here, and not with any kind of institutional voice. In particular, I am not speaking for the Annals of Applied Statistics, or for the IMS, which publishes it. (Though if the IMS asked its members to join in boycotting Elsevier, I would be very happy.)

1: Let's review how scientific journals work, shall we? Scientists are not paid by journals to write papers: we do that as volunteer work, or more exactly, part of the money we get for teaching and from research grants is supposed to pay for us to write papers. (We all have day-jobs.) Journals are edited by scientists, who volunteer for this and get nothing from the publisher. (New editors get recruited by old editors.) Editors ask other scientists to referee the submissions; the referees are volunteers, and get nothing from the publisher (or editor). Accepted papers are typeset by the authors, who usually have to provide "camera-ready" copy. The journal publisher typically provides an electronic system for keeping track of submitted manuscripts and the refereeing process. Some of them also provide a minimal amount of copy-editing on accepted papers, of dubious value. Finally, the publisher actually prints the journal, and runs the server distributing the electronic version of the paper, which is how, in this day and age, most scientists read it. While the publisher's contribution isn't nothing, it's also completely out of proportion to the fees they charge, let alone economically efficient pricing. The whole thing would grind to a halt without the work done by scientists, as authors, editors and referees. That work, to repeat, is paid for either by our students or by our grants, not by the publisher. This makes the whole system of for-profit journal publication economically insane, a check on the dissemination of knowledge which does nothing to encourage its creation. Elsevier is simply one of the worst of these parasites.

Manual trackback: Cosmic Variance; Open A Vein; AgroEcoPeople

Learned Folly

Posted by crshalizi at January 28, 2012 11:15 | permanent link

January 27, 2012

Changing How Changes Change (Next Week at the Statistics Seminar)

Attention conservation notice: Only of interest if you (1) care about covariance matrices and (2) will be in Pittsburgh on Monday.

Since so much of multivariate statistics depends on patterns of correlation among variables, it is a bit awkward to have to admit that in lots of practical contexts, correlations matrices are just not very stable, and can change quite drastically. (Some people pay a lot to rediscover this.) It turns out that there are more constructive responses to this situation than throwing up one's hands and saying "that sucks", and on Monday a friend of the department and general brilliant-type-person will be kind enough to tell us about them:

Emily Fox, "Bayesian Covariance Regression and Autoregression"
Abstract: Many inferential tasks, such as analyzing the functional connectivity of the brain via coactivation patterns or capturing the changing correlations amongst a set of assets for portfolio optimization, rely on modeling a covariance matrix whose elements evolve as a function of time. A number of multivariate heteroscedastic time series models have been proposed within the econometrics literature, but are typically limited by lack of clear margins, computational intractability, and curse of dimensionality. In this talk, we first introduce and explore a new class of time series models for covariance matrices based on a constructive definition exploiting inverse Wishart distribution theory. The construction yields a stationary, first-order autoregressive (AR) process on the cone of positive semi-definite matrices.
We then turn our focus to more general predictor spaces and scaling to high-dimensional datasets. Here, the predictor space could represent not only time, but also space or other factors. Our proposed Bayesian nonparametric covariance regression framework harnesses a latent factor model representation. In particular, the predictor-dependent factor loadings are characterized as a sparse combination of a collection of unknown dictionary functions (e.g., Gaussian process random functions). The induced predictor-dependent covariance is then a regularized quadratic function of these dictionary elements. Our proposed framework leads to a highly-flexible, but computationally tractable formulation with simple conjugate posterior updates that can readily handle missing data. Theoretical properties are discussed and the methods are illustrated through an application to the Google Flu Trends data and the task of word classification based on single-trial MEG data.
Time and place: 4--5 pm on Monday, 30 January 2012, in Scaife Hall 125

As always, the talk is free and open to the public.

Enigmas of Chance

Posted by crshalizi at January 27, 2012 14:25 | permanent link

January 26, 2012

Smoothing Methods in Regression (Advanced Data Analysis from an Elementary Point of View)

The constructive alternative to complaining about linear regression is non-parametric regression. There are many ways to do this, but we will focus on the conceptually simplest one, which is smoothing; especially kernel smoothing. All smoothers involve local averaging of the training data. The bias-variance trade-off tells us that there is an optimal amount of smoothing, which depends both on how rough the true regression curve is, and on how much data we have; we should smooth less as we get more information about the true curve. Knowing the truly optimal amount of smoothing is impossible, but we can use cross-validation to select a good degree of smoothing, and adapt to the unknown roughness of the true curve. Detailed examples. Analysis o how quickly kernel regression converges on the truth. Using smoothing to automatically discover interactions. Plots to help interpret multivariate smoothing results. Average predictive comparisons.

Readings: Notes, chapter 4 (R); Faraway, section 11.1

Optional readings: Hayfield and Racine, "Nonparametric Econometrics: The np Package"; Gelman and Pardoe, "Average Predictive Comparisons for Models with Nonlinearity, Interactions, and Variance Components" [PDF]

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at January 26, 2012 10:30 | permanent link

Advantages of Backwardness (Advanced Data Analysis from an Elementary Point of View)

In which we try to discern whether poor countries grow faster.

Assignment, R, penn-select.csv data set

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at January 26, 2012 09:30 | permanent link

January 24, 2012

Model Evaluation: Error and Inference (Advanced Data Analysis from an Elementary Point of View)

Goals of statistical analysis: summaries, prediction, scientific inference. Evaluating predictions: in-sample error, generalization error; over-fitting. Cross-validation for estimating generalization error and for model selection. Justifying model-based inferences; Luther and Süleyman.

Reading: Notes, chapter 3 (R for examples and figures).

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at January 24, 2012 10:30 | permanent link

The Truth About Linear Regression (Advanced Data Analysis from an Elementary Point of View)

Multiple linear regression: general formula for the optimal linear predictor. Using Taylor's theorem to justify linear regression locally. Collinearity. Consistency of ordinary least squares estimates under weak conditions. Linear regression coefficients will change with the distribution of the input variables: examples. Why R2 is usually a distraction. Linear regression coefficients will change with the distribution of unobserved variables (omitted variable problems). Errors in variables. Transformations of inputs and of outputs. Utility of probabilistic assumptions; the importance of looking at the residuals. What "controlled for in a linear regression" really means.

Reading: Notes, chapter 2 (R for examples and figures); Faraway, chapter 1 (continued).

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at January 24, 2012 10:15 | permanent link

January 22, 2012

Dungeons and Debtors

Attention conservation notice: A silly idea about gamifying credit cards, which would be evil if it worked.

To make a profit in an otherwise competitive industry, it helps if you can impose switching costs on your customers, making them either pay to stop doing business with you, or give up something of value to them. There are whole books about this, written by respected economists1.

This is why credit card companies are happy to offer rewards for use: accumulating points on a card, which would not move with you if you got a new card and transferred the balance, is an attempt to create switching costs. Unfortunately, from the point of view of the banks, people will redeem their points from time to time, so some money must be spent on the rewards. The ideal would be points which people would value but which would never cost the bank anything.

Item: Computer games are, deliberately, addictive. Social games are especially addictive.

Accordingly, if I were an evil and unscrupulous credit card company (but I repeat myself), I would create an online game, where people could get points either from playing the game, or from spending money with my credit card. For legal reasons, I think it would probably be best to allow the game to technically be open to everyone, but with a registration fee which is, naturally, waived for card-holders. Of course, the game software would be set up to announce on Facebook (etc.) whenever the player/debtor leveled up. I would also be tempted to award double points for fees, and triple for interest charges, but one could experiment with this. If they close their credit card account, they have to start the game over from the beginning.

The fact that online acquaintances can't tell whether the debtor is advancing through spending or through game-play helps keep the reward points worth having. It's true that the credit card company has to pay for the game's design (a one-time start-up cost) and the game servers, but these are fairly cheap, and the bank never has to cash out points in actual dollars or goods. The debtors themselves do all the work of investing the points with meaning and value. They impose the switching costs on themselves.

My plan is sheer elegance in its simplicity, and I will be speaking to an attorney about a business method patent first thing Monday.

1: Much can be learned about our benevolent new-media overlords from the fact that this book carries a blurb from Jeff Bezos of Amazon, and that Varian now works for Google.

Modest Proposals;

Posted by crshalizi at January 22, 2012 10:15 | permanent link

Three-Toed Sloth:   Hosted, but not endorsed, by the Center for the Study of Complex Systems