June 16, 2009

Because You Really Wished I'd Write More About Books

... I present for your amusement my reviews of Flynn on the Flynn Effect, and Tett on synthetic collateralized debt obligations and their kin. The former is the original version of something which had to be drastically cut down to fit into American Scientist; the latter appears nowhere else. (But, y'know, make me an offer.)

Anyone who tries to take this as an opportunity to drag me back into arguing about IQ will be ignored.

Self-centered

Posted by crshalizi at June 16, 2009 15:37 | permanent link

On the Certainty of the Bayesian Fortune-Teller

Attention conservation notice: 2300 words of technical, yet pretentious and arrogant, dialogue on a point which came up in a manuscript-in-progress, as well as in my long-procrastinated review of Plight of the Fortune Tellers. Why don't you read that book instead?

Q: You really shouldn't write in library books, you know; and if you do, your marginalia should be more helpful, or less distracting, than just "wrong wrong wrong!"

A: No harm done; my pen and I are both transparent rhetorical devices. And besides, Rebonato is wrong in those passages.

Q: Really? Isn't his point that it's absurd to pretend you could actually estimate a something like a probability of an interest rate jump so precisely that there's a real difference between calling it 0.500 000 and calling it 0.499 967? Isn't it yet more absurd to think that you could get the 99.5 percent annual value-at-risk — the amount of money you'd expect to lose once in two thousand years — down to four significant figures, from any data set, let alone one that covers just five years and so omits "not only the Black Death, the Thirty Years' War, the Barbarian invasions, and the fall of the Roman Empire, but even the economic recession of 1991 — the only meaningful recession in the last twenty years" (as of 2006), to say nothing of the "famous corporate loan book crises of the Paleochristian era" (p. 218)?

A: Of course all that's absurd, and Rebonato is right to call people on it. By the time his book came out it was too late to do much good, but if people had paid attention to such warnings I dare say we wouldn't be quite so badly off now, and they had better listen in the future.

Q: So what's your problem. Oh, wait, let me guess: you're upset because Rebonato's a Bayesian, aren't you? Don't bother, I can tell that that's it. Look, we all know that you've got objections to that approach, but at this point I'm starting to think that maybe you have issues. Isn't this sort of reflexive hostility towards a whole methodology — something you must run into every day of work — awkward and uncomfortable? Embarrassing, even? Have you thought about seeking help?

A: Actually, I have a serious point to make here. What Rebonato wants is entirely right-headed, but it fits very badly with his Bayesianism, because Bayesian agents are never uncertain about probabilities; at least, not about the probability of any observable event.

Q: But isn't Bayesianism about representing uncertainty, and making decisions under uncertainty?

A: Yes, but Bayesian agents never have the kind of uncertainty that Rebonato (sensibly) thinks people in finance should have.

Q: Let me try to pin you down in black and white. [Opens notebook] I have here on one side of the page our old friend, the well-known the probability space Omega F. Prob. Coming out of it, in the middle, is a sequence of random variables X1, X2, ... , Xn, ... , which have some joint distribution or other. (And nothing really depends on its being a sequence, I could use a random field on a network or whatever you like, add in covariates, etc.) On the other side of the random variables, looking at them, I have a standard-issue Bayesian agent. The agent has a hypothesis space, each point m of which is a probability distribution for the random sequence. This hypothesis space is measurable, and the agent also has a probability measure, a.k.a. prior distribution, on this space. The agent uses Bayes's rule to update the distribution by conditioning, so it has a sequence of measures D0, D1, etc.

A: I think you are missing an "As you know, Bob", but yes, this is the set-up I have in mind.

Q: Now I pick my favorite observable event f, a set in the joint sigma-field of the Xi. For each hypothesis m, the probability m(f) is well-defined. The Bayesian thinks this is a random variable M(f), since it has a distribution D on the hypothesis space. How is that not being uncertain about the probability of f?

A: Well, in the first place —

Q: I am not interested in quibbles about D being a Dirac delta function.

A: Fine, assume that D doesn't put unit mass on any single hypothesis, and that it gives non-zero weight to hypotheses with different values of m(f). But remember how Bayesian updating works: The Bayesian, by definition, believes in a joint distribution of the random sequence X and of the hypothesis M. (Otherwise, Bayes's rule makes no sense.) This means that by integrating over M, we get an unconditional, marginal probability for f:

Pn(f) = EDn[M(f|X1=x1, X2=x2, ... , Xn=xn)]

Q: Wait, isn't that the denominator in Bayes's rule?

A: Not quite, that equation defines a measure — the predictive distribution — and the denominator in Bayes's rule is the density of that measure (with n=0) at the observed sequence.

Q: Oh, right, go on.

A: As an expectation value, Pn(f) is a completely precise number. The Bayesian has no uncertainty whatsoever in the probabilities it gives to anything observable.

Q: But won't those probabilities change over time, as it gets new data?

A: Yes, but this just means that the random variables aren't independent (under the Bayesian's distribution over observables). Integrating m with respect to the prior D0 gives us the infinite-dimensional distribution of a stochastic process, one which is not (in general) equal to any particular hypothesis, though of course it lies in their convex hull; the simple hypotheses are extremal points. If the individual hypothesis are (laws of) independent, identically-distributed random sequences, their mixture will be exchangeable. If the individual hypotheses are ergodic, their mixture will be asymptotically mean-stationary.

Q: Don't you mean "stationary" rather than "asymptotically mean-stationary"?

A: No; see chapter 25 here, or better yet that trifler's authority.

Q: You were saying.

A: Right. The Bayesian integrates out m and gets a stochastic process where the Xi are dependent. As far as anything observable goes, the Bayesian's predictions, and therefore its actions, are those of an agent which treats this stochastic process as certainly correct.

Q: What happens if the Bayesian agent uses some kind of hierarchical model, or the individual hypotheses are themselves exchangeable/stationary?

A: The only thing that would change, for these purposes, is the exact process the Bayesian is committed to. Convex mixtures of convex mixtures of points in C are convex mixtures of points in C.

Q: So to sum up, you're saying that the Bayesian agent is uncertain about the truth of the unobservable hypotheses (that's their posterior distribution), and uncertain about exactly which observable events will happen (that's their predictive distribution), but not uncertain about the probabilities of observables.

A: Right. (Some other time I'll explain how that helps make Bayesian models testable.) And — here's where we get back to Rebonato — all the things he is worried about, like values-at-risk and so forth, are probabilities of observable events. Put a Bayesian agent in the risk-modeling situation he talks about, and it won't just say that the 99.5% VaR is 109.7 million euros rather than 110 million, it will give you as many significant digits as you have time for.

Q: So let me read you something form p. 194--195:

Once frequentists accept (at a given statistical level of confidence) the point estimate of a quantity (say, a percentile), they tend to act as if the estimated number were the true value of the parameter. Remember that, for a frequentist, a coin cannot have a 40% chance of being biased. Either the coin is fair or it is biased. Either we are in a recession or we are not. We simply accept or reject these black-or-white statements at a certain confidence level... A Bayesian approach automatically tells us that a parameter (say, a percentile) has a whole distribution of possible values attached to it, and that extracting a single number out of this distribution (as I suggested above, the average, the median, the mode, or whatever) is a possibly sensible, but always arbitrary, procedure. No single number distilled from the posterior distribution is a primus inter pares: only the full posterior distribution enjoys this privileged status, and it is our choice what use to make of it.

This seems entirely reasonable; where do you think it goes wrong?

A: You mean, other than the fact that point estimates do not have "statistical levels of confidence", and that Rebonato has apparently forgotten about actual confidence intervals?

Q: Let's come back to that.

A: He is running together parameters of the unobserved hypotheses, and the properties of the predictive distribution on which the Bayesian acts. I can take any function I like of the hypothesis, g(m) say, and use it as a parameter of the distribution. If I have enough parameters gi and they're (algebraically) independent of each other, there's a 1-1 map between hypotheses and parameter vectors — parameter vectors are unique names for hypotheses. I could make parts of those names be readily-interpretable aspects of the hypothetical distributions, like various percentiles or biases. The distribution over hypotheses then gives me a distribution over percentiles conditional on the hypothesis M. But we don't know the true hypothesis, and on the next page Rebonato goes on to cast "ontological" doubt about whether it even exists. (How he can be uncertain about the state of something he thinks doesn't exist is a nice question.) We only have the earlier observations, so we need to integrate or marginalize out M, and this collapses the distribution of percentiles down to a single exact value for that percentile.

Q: Couldn't we avoid that integration somehow?

A: Integrating over the posterior distribution is the whole point of Bayesian decision theory.

Q: Let's go back to the VaR example. If you try estimating the size of once-in-two-thousand-year losses from five years of data, your posterior distribution has got to be pretty diffuse.

A: Actually, it can be arbitrarily concentrated by picking the right prior.

Q: Fine, for any reasonable prior it needs to be pretty diffuse. Shouldn't the Bayesian agent be able to use this information to avoid recklessness?

A: That depends on the loss function. If the loss involves which hypothesis happens to be true, sure, it'll make a difference. (That's how we get the classic proof that if the loss is the squared difference between the true parameter and the point estimate, the best decision is the posterior mean.) But if the loss function just involves what observable events actually take place, then no. Or, more exactly, it might make sense to show more caution if your posterior distribution is very diffuse, but that's not actually licensed by Bayesian decision theory; it is "irrational" and sets you up for a Dutch Book.

Q: Should I be worried about having a Dutch Book made against me?

A: I can't see why, but some people seem to find the prospect worrying.

Q: So what should people do?

A: I wish I had a good answer.. Many of Rebonato's actual suggestions — things like looking at a range of scenarios, robust strategies, not treating VaR as the only thing you need, etc. — make a lot of sense. (When he is making these practical recommendations, he does not counsel people to engage in a careful quantitative elicitation of their subject prior probabilities, and then calculate posterior distributions via Bayes's rule; I wonder why.) I would also add that there are such things as confidence intervals, which do let you make probabilistic guarantees about parameters.

Q: What on earth do you mean by a "probabilistic guarantee"?

A: That either the right value of the parameter is in the confidence set, or you get very unlucky with the data (how unlucky depends on the confidence level), or the model is wrong. Unlike coherence, coverage connects you to reality. This is basically why Haavelmo told the econometricians, back in the day, that they needed confidence intervals, not point estimates.

Q: So how did the econometricians come to make fetishes of unbiased point-estimators and significance tests of equality constraints?

A: No doubt for the same reason they became convinced that linear and logistic regression was all you'd ever need to deal with any empirical data ever.

Q: Anyway, that "get the model right" part seems pretty tricky.

A: Everyone is going to have to deal with that. (You certainly still have to worry about mis-specification with Bayesian updating.) You can test your modeling assumptions, and you can weaken them so you are less susceptible to mis-specification.

Q: Don't you get weaker conclusions — in this case, bigger confidence intervals — from weaker modeling assumptions?

A: That's an unavoidable trade-off, and it's certainly not evaded by going Bayesian (as Rebonato knows full well). With very weak, and therefore very defensible, modeling assumptions, the confidence interval on, say, the 99.5% VaR may be so broad that you can't devise any sensible strategy which copes with that whole range of uncertainty, but that's the math's way of telling you that you don't have enough data, and enough understanding of the data, to talk about once-in-two-thousand-year events. I suppose that, if they have financial engineers in the stationary state, they might eventually be able to look back on enough sufficiently-converged data to do something at the 99% or even 99.5% level.

Q: Wait, doesn't that suggest that there is a much bigger problem with all of this? The economy is non-stationary, right?

A: Sure looks like it.

Q: So how can we use statistical models to forecast it?

A: If you want someone to solve the problem of induction, the philosophy department is down the stairs and to the left.

Enigmas of Chance; The Dismal Science

Posted by crshalizi at June 16, 2009 09:50 | permanent link

Chaos, Complexity and Inference: 2009 Syllabus

See this earlier post or the course homepage for more. This should be an RSS feed for this page, so you can follow updates, which will generally be posted after lectures.

Final exam
take-home, due 12 May
Lecture 28, (Thursday, 30 April): Agents 4: A real-world example of agents on networks
Notes
Hedstrom and Aberg, "Quantitative research, agent-based modelling and theories of the social", ch. 6 (pp. 114--144) in Hedstrom, Dissecting the Social [PDF]; (*) Guttorp, ch. 4
Lecture 27, (Tuesday, 28 April): Networks 5: Community Discovery
No slides
Readings: Clauset, Moore and Newman, arxiv:0811.0484 (code); Girvan and Newman, arxiv:cond-mat/0112110; Hofman and Wiggins, arxiv:0709.3512; Newman, arxiv:physics/0605087; Reichardt and Bornholdt, arxiv:cond-mat/0603718;
Reichardt and White, arxiv:0708.0958 [discussion]
Lecture 26, (Thursday, 23 April): Complex networks 4: inference for network models
Slides
Reading: Hanneke and Xing, "Discrete Temporal Models of Social Networks" [PDF]; Handcock and Jones, "Likelihood-based inference for stochastic models of sexual network formation" [PDF]; Hunter, Goodreau and Handcock, "Goodness of Fit of Social Network Models" [PDF]; Middendorf, Ziv and Wiggins, "Inferring Network Mechanisms: The Drosophila melanogaster Protein Interaction Network", arxiv:q-bio/0408010; Newman, "Structure and Function", sections IV B and V; Newman, Strogatz and Watts, "Random graphs with arbitrary degree distributions and their applications", arxiv:cond-mat/0007235; Wiuf, Brameier, Hagberg and Stumpf, "A likelihood approach to analysis of network data", Proceedings of the National Academy of Sciences (USA) 103 (2006): 7566--7570 [discussion]
Lecture 25, (Tuesday, 21 April): Agents 3: social complexity
Slides
Reading: Flake, ch. 17; Miller and Page, chs. 10--11; Krugman, ch. 6; Salganik, Dodds and Watts, "Experimental study of inequality and unpredictability in an artificial cultural market", Science 311 (2006):854--856; (*) Skyrms and Pemantle, "A Dynamic Model of Social Network Formation", arxiv:math.PR/0404101
Lecture 24, (Thursday, 9 April): Complex networks 3: contagion on networks
slides
Reading: Guttorp, sec. 2.11; Newman, "Structure and Function", sec. VIII; Davis, Trapman, Leirs, Begon and Heesterbeek, "The abundance threshold for plague as a critical percolation phenomenon", Nature 454 (2008): 634--637; Bell, Maiden, Munoz-Solomando and Reddy, "'Mind control experiences' on the Internet: Implications for the psychiatric diagnosis of delusions", Psychopathology 39 (2006): 87--91 [PDF];
Smith and Novella, "HIV Denial in the Internet Era", PLoS Medicine 4:8 (2007): e256; (*) Kenah and Robins, "Second look at the spread of epidemics on networks", arxiv:q-bio.QM/0610057
Lecture 23, (Tuesday, 7 April): Agents 2: collective phenomena and self-organization
Slides
Reading: Flake, ch. 16; Miller and Page, ch. 9; Krugman, introduction and ch. 1; Healy, Walking to School; Perlstein, The Meaning of Box 722
Lecture 22, (Thursday, 2 April): Agent-based models 1
Slides
Reading: Miller and Page, chs. 6 and 7; Flake, ch. 12
Lecture 21, (Tuesday, 30 March): Complex networks 2: growth models
Slides
Reading: Newman, "Structure and function", sec. VII
Lecture 20, (Thursday, 26 March): Complex networks 1: basics, network properties
Slides
Definition of networks and basic terms. Pictures, emphasizing sex and death. Bipartite networks and the Galois lattice construction. The small world problem. What makes a network complex? The Erdos-Renyi model: first properties and weaknesses.
Watts, "The 'New' Science of Networks", Annual Review of Sociology 30 (2004): 243--270
Newman, "The Structure and Function of Complex Networks", SIAM Review 45 (2003): 167--256, arxiv:cond-mat/0303516 (through sec. VI, but skipping or skimming IV B and V)
Lecture 19, (Tuesday, 24 March): Inference from simulations 2: direct and indirect estimation
Slides and R
The method of simulated moments. Example with the logistic map. Issues with the MSM. The trade-off between tractable models and good models. The indirect inference trick: don't match moments, match a bad-but-tractable model. Examples of indirect inference. Convergence properties.
Reading: Smith, "Indirect Inference"; Kendall et al., "Population Cycles in the Pine Looper Moth: Dynamical Tests of Mechanistic Hypotheses"; (*) Gourieroux, Monfort and Renault, "Indirect Inference" [JSTOR]
Lecture 18, (Thursday, 19 March): Inference from simulations 1
The virtues and limits of simulations, especially stochastic simulations. Active testing: deliberately trying to break your model to gain (or lose!) confidence in it. Bootstrapping once again.
Reading: ch. 5 and appendix B of Miller and Page; Miller, "Active Nonlinear Tests (ANTs) of Complex Simulation Models"
Lecture 17, (Tuesday, 17 March): Inference in general: error statistics and severe testing
Slides
Errors and reliable inference. Statistics as principle argument: mostly arguments that certain errors are (or are not) absent or small, unless we're very unlucky. Kinds of errors in statistical inferences. Construction of reliable hypotheses: confidence sets, "probably approximately correct", more. Severe testing. Null hypotheses. Testing for mis-specification, especially by looking at residuals. The two chief world systems.
Reading: Mayo and Cox, "Frequentist statistics as a theory of inductive inference", arxiv:math.ST/0610846; Mayo and Spanos, "Methodology in Practice: Statistical Misspecification Testing" [PDF]; idem, "Severe Testing as a Basic Concept in a Neyman-Pearson Philosophy of Induction" [journal]; Spanos, "Curve-Fitting, the Reliability of Inductive Inference and the Error-Statistical Approach" [journal]
Lecture 16, (Thursday, 5 March): Heavy-tailed distributions 4: Comparing models
Slides, R
Goodness-of-fit tests: Glivenko-Cantelli, "the fundamental theorem of statistics"; the Kolmgorov-Smirnov test; Monte Carlo version of the K-S test when parameters are estimated from data; general remarks on goodness-of-fit. Relative distribution methods for comparing distributions: their virtues; using relative distributions to check power laws; HTTP file size example. Vuong's likelihood-ratio test for model selection. The flight of the albatross. Zombie-hunting.
Readings: Clauset et al. continued; Handcock and Morris, "Relative Distribution Methods"; Freckleton and Sutherland, "Do power laws imply self-regulation?"; idem, "Do in-hospital waiting lists show self-regulation?" (thanks to Nick Watkins for point out those papers); Edwards et al. "Revisiting Lévy flight search patterns of wandering albatrosses, bumblebees and deer"; (*) Vuong, "Likelihood Ratio Tests for Model Selection and Non-Nested Hypotheses"
Lecture 15 (Tuesday, 3 March) Heavy-tailed distributions 3: Estimation
Slides
The right way to do it: maximum likelihood estimation for power laws; properties of the MLE. Nonparametric bootrstrapping for error estimation. The wrong way to do it: linear regression on a log-log plot; why it should never be used. Determining the scaling range. Examples with city sizes. Non-parametric density estimation, and special considerations for heavy-tailed distributions.
Readings: Clauset, Shalizi and Newman, "Power law distributions in empirical data", arxiv:0706.1062; (*) Markovitch and Krieger, "Nonparametric estimation of long-tailed density functions and its application to the analysis of World Wide Web traffic"
Lecture 14, (Thursday, 26 February): Heavy-tailed distributions 2: how they arise
Slides
"How the distributions got their tails". Deflating explanations: measuring ex instead of x; measuring x1/a instead of x; the central limit theorem for multiplicative terms. Everyday explanations: exponential growth from random times; why isn't Acoma, N.M., not the largest city in the US?; the Yule-Simon mechanism for random growth. Exciting and mysterious explanations: why physicists expect everything to follow a Gaussian, except at phase transitions; self-organized criticality; a skeptical note.
Readings: Newman, "Power laws", section IV; Krugman, ch. 3 and the last part of ch. 8; Mitzenmacher, "A Brief History of Generative Models for Power Law and Lognormal Distributions"; video of Mitzenmacher giving a talk on this material; Sornette, "Mechanism for Powerlaws without Self-Organization", arxiv:cond-mat/0110426
Lecture 13, (Tuesday, 24 February): Heavy-tailed distributions 1: what they are
Slides and R
Highly skewed and heavy tailed distributions: definitions. Real-data examples: words, cities, money, papers, phone calls, the Internet, earthquakes, wars, terrorism, solar flares ("die, puny humans!"). Models: the pure power laws (Pareto and Zipf distributions). Properties of power laws: scale-freedom, 80-20, order statistics. Inequality of condition in America. Modifications: generalized Pareto, Yule-Simon. Non-power-law distributions: truncated power laws, log-normal, stretched exponential/Weibull distribution.
Newman, "Power laws, Pareto distributions and Zipf's law", arxiv:cond-mat/0412004 (through section III)
Lecture 12, (Thursday, 19 February): Self-organization 2
Slides
Readings: Shalizi, Klinkner and Haslinger, "Quantifying Self-Organization with Optimal Predictors", arxiv:nlin.AO/0409024; Shalizi, Haslinger, Rouquier, Klinkner and Moore, "Automatic Filters for the Detection of Coherent Structure in Spatiotemporal Systems", arxiv:nlin.CG/0508001
Lecture 11, (Tuesday, 17 February): Cellular automata 2: excitable media; mechanisms and abstractions
Slides
The heart as an assemblage of excitable elements. Excitable media as an abstract mechanism. Philosophical excursus on "abstraction" and "mechanism". Examples of excitable media. Chemical oscillators: Belousov-Zhabotinsky type. Chemical oscillators: Turing type. City growth as Turing morphogenesis. Slime mold morphogenesis. The spiral ice-canyons of Mars. Cellular automata models of excitability: Greenberg-Hastings, cyclic CA. Some math for these models.
Readings: Fisch, Gravner and Griffeath, "Threshold-range scaling of excitable cellular automata" [PDF]; idem, "Cyclic Cellular Automata in Two Dimensions" [zipped PostScript]; Greenberg and Hastings, "Spatial Patterns for Discrete Models of Diffusion in Excitable Media" [JSTOR]; Griffeath, "Self-Organization of Random Cellular Automata: four snapshot" [zipped PostScript]; Krugman, ch. 2 and first part of ch. 8; ch. 4 of Guttorp
Lecture 10, (Thursday, 12 February): Cellular automata 1
Slides
What cellular automata are; basic elements. Examples: diffusion-limited aggregation, lattice gases, majority vote, Life. Mechanical self-reproduction.
Readings: Flake, ch. 15; Miller and Page, ch. 8
Lecture 9, (Tuesday, 10 February): Self-organization 1, some examples
Slides
Levels of description; micro/macro distinctions, aggregation. Emergence of higher levels from lower levels; how to be a good reductionist. Self-organization. Excursus into thermodynamic entropy and the second law of thermodynamics. How self-organization is compatible with the second law. Some examples from fluids. Classroom demo with fluids.
Reading: Miller and Page, ch. 4
Office of Charles and Ray Eames, Powers of Ten, narration by Philip Morrison
Lecture 8, (Thursday, 5 February): Randomness and determinism
Slides
Data compression and description length. Algorithmic information content as the length of the shortest effective description. Random sequences are incompressible; incompressible sequences have all the testable properties of randomness. "Any signal distinguishable from noise is insufficiently compressed." Algorithmic information content is uncomputable. Dynamical systems as sources of randomness; Brudno's theorem connecting algorithmic information content to entropy rate and so to Lyapunov exponents. "Any determinism distinguishable from randomness is insufficiently complicated."
Side-note: Algorithmic Information Content and Marginal Entropies
Readings: Flake, ch. 14; Smith, ch. 11; Poincaré, "Chance", from Science and Method [PDF]
Lecture 7 (Tuesday, 3 February): Information Theory
Slides
Entropy of random variables, interpretations (randomness, variability, uncertainty, coding). Mutual information, statistical independence, Markov properties. Relative entropy, a.k.a. Kullback-Leibler divergence. Relative entropy measures how easy it is to tell two distributions apart: statistical uses in sampling, large deviations, estimation (via Fisher information). Divergence is is negative expected log likelihood; maximum likelihood attempts to minimize divergence. Entropy rates measure dynamical randomness; connection between metric entropy rate, topological entropy rate, and Lyapunov exponents. Asymptotic equipartition; convergence of log likelihoods to their expected values.
Reading: Feldman, "Information Theory" [PDF]
M.C. Hawking, "Entropy", from Fear of a Black Hole [lyrics; mp3 (radio-safe Brief History of Rhyme version)]
Ray and Charles Eames, A Communications Primer (via Britta Gustafson)
Lecture 6 (Thursday, 29 January): Statistical Inference for Discrete Random Sequences
Slides
Inference for Markov chains: the likelihood function; the maximum likelihood estimate of the transition matrix; its convergence; MLE for parameterized Markov chains. Error estimation and hypothesis testing for Markov chains. Parametric bootrstrapping. Higher-order Markov chains. Random functions of Markov chains; stochastic finite automata and their inference. Variable length Markov chains. CSSR.
Handout: Maximum Likelihood Estimation for Markov Chains
Readings :Guttorp, 2.7--2.9 and 2.12 (I, II, III); Smith, ch. 10; Foulkes, "A Class of Machines Which Determine the Statistical Structure of a Sequence of Characters" [PDF]; (*) Smith, "The Maintenance of Uncertainty" [PDF]
Lecture 5 (Tuesday, 27 January): Symbolic Dynamics
Slides
Discrete coarse-grainings of the continuous state space. Symbol sequences determined by initial conditions. The shift map; moving the complexity from the update rule to the space. Motivation: "Continuous math is hard; let's go discretize." Refinement of partitions under dynamics. Generating partitions; discretizing continuous dynamics without loss of information. Symbolic dynamics as discrete stochastic processes; the full logistic map as the Bernoulli process. Allowed and forbidden words. The topological entropy rate; how many things can your process do? Formal languages as tools for describing discrete sequences ("we have ways of making your dynamics talk"). Regular expressions and their grammars. Regular expressions and their machines ("circles and arrows and a paragraph on the back of each one"). Examples of languages and machines from the logistic map. Finite-type vs. sofic processes.
Handout: More on the topological entropy rate
Readings: Daw, Finney and Tracy, "A review of symbolic analysis of experimental data" [reprint]
Lecture 4 (Thursday, 22 January): Attractor Reconstruction and Prediction
Slides, lorenz-generator.pl
"Geometry from a time series": how "time-delay embedding" and Takens's theorem let us reconstruct attractors without knowing equations. An example with the Lorenz system. Attractor reconstruction as a form of inference. Change of coordinates, change of observables. Choice of reconstruction settings, false nearest neighbor method. Prediction in the reconstructed state space. Nearest neighbor prediction; an example with the Lorenz system. Picking predictive settings via cross-validation.
Side-notes: "Smooth change of coordinates"; Nonlinear predictors
Readings: Kantz and Schreiber, chs. 3--4; Smith, chs. 7--9.
Lecture 3 (Tuesday, 20 January): Attractors and Mixing
Slides and R
Attractors as generalizations of fixed points and limit cycles. Geometry of attractors: where do trajectories go? The Hénon map, our first multi-dimensionsional dynamical system. Multiple dimensions and memory. Stretching and folding again. Stable and unstable directions. The Lorenz equations, our first continuous-time dynamical system. Lyapunov exponents to measure stability and instability. Probability on attractors. Invariant distributions again. Subtleties of convergence to the invariant distribution. Mixing and decay of correlations. A central limit theorem for mixing processes. CLT behavior of the logistic map.
Readings: Flake, ch. 11; Miller and Page, chs. 1--3; Smith, chs. 4--6.
Lecture 2 (Thursday, 15 January): More dynamics
Slides and R
Stability of fixed points and cycles. Bifurcations; bifurcation diagrams; the period-doubling route to chaos; periodic windows within chaos. Devaney's definition of chaos; how it implies sensitive dependence; stretching and folding. The Arnold cat map. Intermittency. More ergodicity: the evolution of densities and the Frobenius-Perron operator; time averages; the world's simplest ergodic theorem.
Readings: Flake, ch. 10 and sec. 11.1; Guttorp, ch. 1; Smith, chs. 1--3.
The Arnold Cat Map Movie (starring Marlowe the Cat, directed by Evelyn Sander)
Lecture 1 (Tuesday, 13 January): Introduction to the course; starting with dynamics
Slides and R
Administrivia. What is a model? What is a simulation? What is a dynamical system? What is chaos? First acquaintance with the logistic map. Fixed points and cycles. Exploring some properties of chaos.

Corrupting the Young; Complexity; Enigmas of Chance

Posted by crshalizi at June 16, 2009 09:24 | permanent link

May 31, 2009

Books to Read While the Algae Grow in Your Fur, May 2009

Chelsea Cain, Heartsick and Sweetheart
Psychological thriller with psycho killers and local color for Portland. (How accurate that is, I couldn't say.) Very well written, especially the characterization of the plucky-yet-self-destructive reporter, and the extremely creepy situation of the lead detective. There are some graphic scenes of torture, which honestly I skimmed through because I really couldn't take it. — Updated: The sequel is also good, but not, I think, quite as good. The explanation it gives for the central relationship makes sense, but I feel that relationship worked better in the previous book, where it was left mysterious (at least to me; maybe I'm slow).
Andrea Camilleri, August Heat
Jason Aaron and R. M. Guera, Scalped: Casino Boogie, Dead Mothers, The Gravel in Your Gut
Indian Country noir. Getting better, which is to say harder to read, as it goes. (Reading volume 1 would help.)
House of Mystery, vol. 1: Room and Boredom
Tales from the bar, a la Jorkens or the White Hart. Only the bar is in the dreamlands, or somewhere else between the worlds, and some of its regulars are actually trapped there, in a house one of them seem to have designed in her dreams... (I will be very upset, yet somewhat impressed, if they turn out to be winging the story, rather than taking it somewhere.)
Jamie McKelvie, Suburban Glamour
Neil Gaiman and Michael Zulli, The Last Temptation and The Facts in the Case of the Departure of Miss Finch
Faith Erin Hicks, Zombies Calling
The most charming, life-affirming, self-referential zombie movie ever; naturally, it's a comic book.
Severance
Shockingly smart, scary, and blackly funny, horror movie. (It reinforces every negative impression I have had about team-building exercises.)
Larry Samuelson, Evolutionary Games and Equilibrium Selection
How to figure out which equilibrium your game will end up at, under some not-too-laughably-implausible assumptions about individual decision-making, learning and imitation. I suspect that a lot of the results could be generalized to much broader classes of models, particularly in the large-population limit. (Cf. Kurtz below.)
Thomas G. Kurtz, Approximation of Population Processes
Like a kindergarten version of Ethier and Kurtz. This means that the reader is still expected to understand measure-theoretic probability and Markov processes pretty well, but not to necessarily care about the intricacies of the various Skorokhod topologies...
Sarah Graves, Dead Cat Bounce, Triple Witch, Wicked Fix, Repair to Her Grave
Brain-candy mysteries. Good for when one is lying in bed with the flu.
It's surely not an original observation, but there is a severe problem with setting a mystery series in a small town. These four books cover, if I recall correctly, two years of narrative time, featuring about three homicides a piece, in a town of under 2000 inhabitants, meaning the murder rate is about 3 per 1000 per year, which is 50 times the national mean, and almost half of the Lancet survey's point estimate of the 2003--2006 death rate from the invasion of Iraq. (Anyone who takes that as an invitation to try to drag me into the Lancet controversy will be ignored.) The only remotely plausible explanation is that the series' amateur sleuth is really a serial killer, and the novels are her elaborately-coded confessions. If anyone has approached the problem from this angle, I'd be interested in hearing about it. (The closest approach I can think of is Dexter.)
Jane Haddam, Living Witness
Haddam tackles Intelligent Design creationism, with her usual compelling characterization. (Some of the characters whose viewpoints the reader takes on are rather unpleasant people.) — It strikes me that Haddam has been writing this series for longer than some of my students have been alive, and I wonder whether (if it weren't for the marketing hook) the stories she's telling nowadays really need the continuity of the recurring detective...
(Minor continuity/background errors: here in Pennsylvania, we don't have "package stores", we have state-run wine and spirit stores; "Leibniz" is mis-spelled as "Liebniz"; in one passage the only number in Gregor's speed-dial is Bennis's, later the only two numbers are his voice-mail, then Bennis's. Obviously none of these matter at all.)
Justine Musk, Blood Angel
Brain-candy. — I believe this was a first novel, which would make sense of the fact that there's enough material in here for at least three books...

Books to Read While the Algae Grow in Your Fur; Scientifiction and Fantastica; Enigmas of Chance; The Dismal Science

Posted by crshalizi at May 31, 2009 23:59 | permanent link

April 30, 2009

Books to Read While the Algae Grow in Your Fur, April 2009

Jérôme Dedecker, Paul Doukhan, Gabriel Lang, José Rafael León R., Sana Louhichi and Clémentine Prieur, Weak Dependence: With Examples and Applications
As every school-child knows, a sequence of probability measures Dn converges "weakly" or "in distribution" on a limit D if Dnf converges on Df for every bounded and continuous test function f. In particular, if Dn is the joint distribution of two random variables from a time series separated by a time-lag n, with marginal distributions P and Q, then the series is asymptotically independent if Dn converges in distribution on the product measure P*Q.
(This is "weak" convergence because the Dn can look very different from the limiting D. A classic example: Dn puts probability 1/n on the points 1/n, 2/n, ... n/n = 1. This converges weakly to the uniform distribution on the unit interval [0,1], despite apparent obstacles like the latter giving probability 1 to the irrational numbers, while all the Dn give them probability 0.)
A large literature has grown up since the 1950s which concerns itself with properties of asymptotically-independent stochastic processes, usually measuring the approach to independence by means of various "mixing coefficients", along the lines introduced by Rosenblatt. As the name implies, if these coefficients go to zero asymptotically, then the process is strongly mixing. Having these mixing coefficients go to zero is very nice, since it implies that many of the nice properties of IID sequences (central limit theorems, PAC-learning results, etc.), carry over to the dependent-but-mixing sequences. Unfortunately, strong mixing is a very strong property, which is hard to prove and in many cases is known to fail — even when there is asymptotic independence!
This book, which summarizes and extends a series of recent papers by the authors (in various combinations) is about proving limit theorems — laws of large numbers, central limit theorems, empirical-process convergence, etc. — for asymptotically-independent but non-mixing processes, which the authors call "weakly dependent". Their theorems in chs. 6--10 can, in large part, be seen as instances of the trick that I learned as "blocking", and which they attribute to S. N. Bernstein. To calculate the time average of some function of a (stationary) process, for instance, divide the series up into a series of long blocks, plus padding or filler in between them. The global average is, trivially, equal to the average of the within-block averages, plus a remainder term for contributions from the filler. By stationarity, the blocks all have the same distribution, and by weak dependence the blocks are nearly independent — more nearly with longer fillers. So one can show that the global average will have the same behavior as it would in the IID case, if one can control (1) the corrections due to the remaining dependence among the blocks, and (2) the remainder due to the fillers between the blocks. Controlling (2) is fairly straightforward; the new stuff here comes from controlling (1), the residual dependence.
Following lines laid down in mixing theory, the authors tackle (1) by constructing deviation inequalities, bounds on the probability that sample values differ from expectations by more than a certain amount. Typically, these inequalities (chs. 4 and 5) have a more-or-less combinatorial flavor, though they also use coupling, and a lot of manipulation of cumulants (without making the connection to large deviations theory via cumulant generating functions). — As usual with such inequalities, it's rarely clear whether the various factors of 33 (or whatever) are for real, or just the best one can do with the current method of proof; but they suffice for many purposes, and I suppose it's better to have the constants be explicit than to bury them inside O() or o().
In mixing theory, the analogous bounds are stated in terms of the mixing coefficients. Here the new bounds are stated in terms of a range of new dependence measures (ch. 2), all denoted by utterly un-mnenomic Greek letters (again following the mixing-theory convention). Some of them play directly off the definition of convergence in distribution in terms of expectations of test functions: how big can the difference between Dnf and P*Qf get, when f is restricted to some suitably well-behaved, normalized class of test functions? The other set of measures are "projective". The conditional expectation of a function f, given X1, X2, etc., is the projection of f on to the space of functions calculable from X1, X2, etc.; the unconditional expectation of f is the projection of f on to the space of constant functions, which are calculable from nothing. (By "calculable from" I mean "measurable in the minimal sigma-algebra of".) If f is independent of X, then its conditional expectation is still its projection on to the constants. The other class of dependence coefficients, therefore, looks at how far conditional expectations depart from constancy — either maximizing over some family of test-functions, or by making future values of the time series itself the function in question.
Applications to non-parametric regression, spectral-density estimation, and various econometric problems occupy chs. 11--13, and ch. 3 shows that many standard and some non-standard time series models are weakly dependent, with estimates of the dependence coefficients.
— The book shows definite signs of its patch-up job origins. The prose is disjointed and slightly repetitive. The notation, too, is not always consistent; sometimes the generic dependence measure is c, sometimes it's \mu, and it can switch back and forth from one line to the next. A bigger annoyance is terminology. What the authors (and, it must be said, many probabilists) call "mixing" is more properly strong mixing, i.e., convergence of joint distributions to product measure in the strong, not the weak, topology. (If the Adversary can pick any pair of events, and you can show that that pair become independent as the separation between them grows, you have proved ordinary mixing. If the Adversary gets to pick a different pair of events with each separation, and you can still show asymptotic independence, that's strong mixing.) In dynamical systems and ergodic theory, "mixing" unmodified refers to weak mixing. (See for instance the detailed treatment in Lasota and Mackey.) Thus for someone of my background, the Bernoulli shift is one of the canonical mixing processes, but they give it as their first example of a non-mixing process! (Their proof, incidentally, is wrong, but easily fixed.) In fact, I'd say that the failure to conect what they have done to the work in dynamics on decay of correlations (e.g.) is a real lost opportunity for both fields.
In the end, however, these are all minor complaints. The authors have brought together a great deal of original and important work, as a result of which the classical probabilistic limit theorems can now be rigorously applied to a much wider range of stochastic models. It's cool stuff and I would be very happy to teach it. I strongly recommend the book to statisticians interested in inference for stochastic processes, learning theorists interested in dependent data, probabilists interested in new applications, and theoretical econometricians.
Warren Ellis and Paul Duffield, Freakangels, vol. 2
Jack Campbell, Lost Fleet, vol. 5: Relentless
Brain-candy, in the process of moving from the Anabasis to De Bello Civili. (Which does not mean it's a historical pastiche.) Earlier morsels: 1--3, 4.
David Freidel, Linda Schele and Joy Parker, Maya Cosmos: Three Thousand Years on the Shaman's Path
A detailed and intelligent, if very conjectural, exploration of ancient Maya cosmology, based on deciphering surviving inscriptions, artifacts, and extrapolation from the modern Maya. Marred by irritating and silly stabs at cultural relativism (mostly at the beginning and the end).
Peter D. Harrison, The Lords of Tikal: Rulers of an Ancient Maya City
History of the greatest Maya city from the viewpoint of its rulers (the only view we really have any access to), from the beginning to the end. Clear on the difference between what we actually know and what the Mayanists are merely guessing at. Very good on architecture.
Bruce Kitchens and Selim Tuncel, Finitary Measures for Subshifts of Finite Type and Sofic Systems
As you know, Bob, a shift dynamical system has a state-space consisting of infinite sequences (one or two sided), say x drawn from a finite alphabet. The time-evolution operator T, or shift, simply moves the sequence one step to the right, i.e., (Tx)n = xn+1. This moves all the complexity and distinctions between different systems into the set of initial conditions, and space of allowed sequences, rather than in the time-evolution rule. The full shift has all possible sequences among its initial conditions (and so the space never shrinks); sub-shifts restrict the set of initial conditions. Suppose that the allowed values of xn depend on xn-1, xn-2, xn-k, but not on earlier entries in x, and that this k is the same for all n and all sequences. Then we have a subshift of finite type, the type or order of the subshift being k. These are the symbolic-dynamical equivalents of Markov chains of order k. As with Markov chains, any higher-order subshift of finite type can be converted to a first-order subshift. (Define a new alphabet where each symbol is a block of length k from the original alphabet.) A sofic system can be defined in one of two equivalent ways. One introduces the notion of a follower set, the set of all one-sided infinite sequences which can succeed a given history; a subshift is sofic if it has only finitely many follower sets. Alternately, introduce the notion of factor maps — continuous functions from one sequence space to another which commute with the shift. A sofic system is the image of a subshift of finite type under a factor map. A strictly sofic system is one which is sofic, but not a subshift of finite type.
Said another way, for any sofic system there is a set of states, possibly hidden, where the current state determines what symbols can be seen next, and the current state plus the next symbol determines the next state. Sofic systems are (the languages generated by) deterministic finite automata.
All of this is at the purely topological, possible-or-not level. One can of course also add probability measures on sequences. A Markov measure is one under which the distribution of Xn depends on Xn-1, but not, given that, on previous symbols. (Similarly for higher-order chains.) In this monograph, Kitchens and Tuncel ask, what measures are to Markov measures as sofic systems are to subshifts of finite type? They call these measures finitary, and characterize them in several ways, including (1) measures where the number of follower distributions is finite, (2) factor-map images of Markov measures, (3) ones where the conditional distribution of the future is always independent of the remote past given a finite segment of the immediate past, though one of possibly variable, context-dependent length. (The last is what motivates the name "finitary".) They develop the theory of finitary measures in considerable detail, and in parallel to the usual theory of Markov measures, as that is found in ergodic theory and the thermodynamic formalism. Statistical aspects (the reconstruction of finitary measures from sample data) are not addressed — fortunately for me. Prior acquaintance with symbolic dynamics and the thermodynamic formalism is absolutely essential, and a lot of familiarity with manipulating semi-groups would help.
(Your best bet for obtaining this is directly from the publisher. However, the community would be better served by putting it on the arxiv.)
Misty Massey, Mad Kestrel
Brain-candy. More piracy and less romance than blurb implies.
John McGowan, Democracy's Children: Intellectuals and the Rise of Cultural Politics
Nowhere near as cohesive as the subtitle suggests. Rather it's a set of essays on the conditions, activities and aspirations of academic in the humanities (especially literature as such) in the US from the mid-1980s through the mid-1990s, along with some thoughts about how they ought to try relating to the larger society they're part of. (I am quite sympathetic to the latter.) I wish McGowan would write a book about "intellectuals and the rise of cultural politics"; I think it would be very interesting.
(His mid-1990s remarks on the Web are more amusing in retrospect than they would have been at the time, since he's gone from sniffing at it to guest-blogging for Bérubé. To be fair, the idea that English departments would need specialists in hypertext does seem rather dated — because we have all become specialists in hypertext.)
C. J. Sansom, Revelation
Michelle Sagara, Cast in Fury

Books to Read While the Algae Grow in Your Fur; Scientifiction and Fantastica; Writing for Antiquity; Enigmas of Chance; Mathematics; The Progressive Forces; The Commonwealth of Letters

Posted by crshalizi at April 30, 2009 23:59 | permanent link

April 28, 2009

That Word Does Not Exist In Any Language

Attention conservation notice: 1500 words on a dispute at the intersection of paleography and computational linguistics, two areas in which I have no qualifications; also a sermon on a text from Lichtenberg: "We must not seek to abstract from the busts of the great Greeks and Romans rules for the visible form of genius as long as we cannot contrast them with Greek blockheads."

Over the weekend, I read Mark Liberman's post at Language Log about the new Rao et al. paper in Science, claiming to show information-theoretically that the symbols recovered on artifacts from the Harappan civilization in the Indus Valley are in fact a form of writing, as had long been supposed but was denied a few years ago by Farmer, Sproat and Witzel.

What Rao et al. claimed to show is that the sequences of Indus symbols possess information-theoretic properties which are distinctive of written language, as opposed to other symbols sequences, say ones which are completely random (IID, their "type 1 nonlinguistic") or completely deterministic (their "type 2 nonlinguistic"). Specifically, they examined the conditional entropy of sequence pairs (i.e., the entropy of the next symbol given the previous one). The claim is that the Indus symbols have the same pattern for their conditional entropy as writing systems, which is clearly distinguishable from non-linguistic symbol sequences by these means.

As someone who is very, very into information theory (especially conditional entropy), I was intrigued, but also very puzzled by Mark's account, from which it seemed that Rao et al. had a huge gap where the core of their paper should be. Actually reading the paper convinced me that Mark's account was correct, and there was a huge logical fallacy. I'll reproduce Figure 1A from the paper and explain what I mean.

Rao et al. worked with a corpus of Indus Valley inscriptions, which recognizes 417 distinct symbol types. This is their "Indus" line. The other language lines come from different corpora in the indicated language. In each corpus, they filtered out the less common symbols, and then fit a first-order Markov chain. (Transition probabilities were estimated with a smoothing estimator rather than straight maximum likelihood.) Then they calculated the conditional entropy of the chain, using the estimated transition probabilities and the observed symbol frequencies (rather than say the invariant distribution of the chain); that's the vertical axis. The horizontal axis shows how many symbol types were retained --- i.e., "100 tokens" means that only the 100 most common symbols in the corpus were kept, and the chain was fit to those sequences. (This is not explained in the paper but was made clear in later correspondence between Sproat and the authors.) There are two lines for English, depending on whether "token" was taken to mean "character" (differentiating upper and lower case) or to mean "word".

The bottom curve shows the estimated conditional entropy from a purely deterministic sequence; the actual conditional entropy is in fact zero, so I presume that the upward trend is an artifact of the smoothed transition probabilities. The top curve, on the other hand, is from a uniform IID sequence --- here the real conditional entropy is the same as the marginal entropy, but both grow as N increases because the size of the symbol set grows. (I.e., this is an artifact of keeping only the most common symbols.)

Here is the flaw: there is no demonstration that only linguistic sequences have this pattern in their conditional entropies. Rao et al. have shown that two really extreme non-linguistic processes don't, but that's not a proof or even a plausibility argument. I would settle for an argument that non-linguistic processes have to be really weird to show this pattern, but even that is lacking. In Mayo's terms, they have not shown that this test has any severity.

Of course the fact that they haven't shown their test is severe doesn't mean that it isn't, in fact, severe. So, by way of procrastinating, I spent some time yesterday constructing a counter-example. My starting point was what Mark had done, generating a sequence of IID draws from a geometric distribution (rather than a uniform one) and subjecting it to the same analysis as Rao et al. As it happens, I had already written a function in R to fit Markov chains and calculate their log-likelihood, and here the conditional entropy is the negative log likelihood over the sequence length. (Admittedly this is only true using the maximum likelihood estimates for transition probabilities, rather than smoothed estimates as Rao et al. do, but my simulations had so much data this shouldn't matter.) Setting the rate of the geometric distribution to 0.075, here were my first results.

Mark Liberman and Richard Sproat did almost the same thing pretty much simultaneously, as you can see from the updates to Mark's post.

This was not entirely satisfactory, since (as Rao et al. point out in the online supplementary materials), there is a big gap between the marginal and conditional entropies for writing and for the Indus symbols. This was also, however, not too hard to finesse. In addition to the geometric sequence, I generated a Markov chain which alternated between the values +1 and -1, but where each positive or negative sign was 99% likely to be followed by the same sign. (That is, the signs were highly persistent.) I then multiplied the IID geometric variables (plus 1) by the Markov signs. This gave a larger set of symbols, but where knowing the sign of the current symbol (which "register" or "sub-vocabulary" it came from) was quite informative about the sign of the next symbol. (I added 1 to the geometric variables to exclude 0=-0, keeping the sub-vocabularies distinct.)


Pluses: marginal entropy; circles: conditional entropy

A third experiment takes after the fact that the Indus symbol sequences are all extremely short, at most a dozen characters or so. In stead of having a Markov chain for the sign, I used another, independent set of random draws, uniform on the integers from 2 to 6, to divide the sequence into blocks, and gave all the symbols in each block the same (coin-toss) sign.


Pluses: marginal entropy; circles: conditional entropy

(Because I'm doing everything with a single long sequence, I artificially introduce transitions from positive to negative signs, which lowers the gap between the conditional and unconditional entropies. If I wanted to do this properly, I'd re-write my Markov estimator so it used many short sequences; but that would be too much like real work.)

The mechanism producing the gap between conditional and unconditional entropies is that the marginal distribution of symbols is a mixture of several pure distributions, and which mixture component we draw from now influences which component we'll draw from next (so the sequence can be Markov, exchangeable, etc.). Given the mixture components, the symbols are independent and the conditional and unconditional entropies are equal. Without that knowledge, the first symbol in effect is a cue for figuring out the mixture component, reducing the entropy of the second. There is nothing specifically linguistic about this; any hidden Markov model does as much. It would, for instance, work if the "symbols" were characters in randomly-selected comic books, who cluster (though slightly imperfectly); if that's too low-brow, think about Renaissance paintings, and the odds of seeing St. John the Baptist as opposed to a swan in one which contains Leda.

I have made no attempt to match the quantitative details Rao et al. report for the Indus symbols, just the qualitative patterns. Were I to set out seriously to do so, I'd get rid of the geometric distribution, and instead use a hidden Markov model with more than two states, each state having a distinct output alphabet, the distribution of which would be a Zipf (as used by Liberman or Sproat) or a Yule-Simon. (I might also try block-exchangeable sequences, as in my third example.) But this would approach real work, rather than a few hours of procrastination, and I think the point is made. Perhaps the specific results Rao et al. report can only be replicated by making distributional assumptions which are very implausible for anything except language, but I'd say that the burden of proof is on them. If, for instance, they analyzed lots of real-world non-linguistic symbol systems (like my comic books) and showed that all of them had very different conditional entropy curves than did actual writing, that would be a start.

I should in the interest of full disclosure say that a number of years ago Farmer and I corresponded about his work on the development of pre-modern cosmologies, which I find interesting and plausible (though very conjectural). But if anything I hope Farmer et al. are wrong and the Indus Valley civilization was literate, and I'd be extra pleased if the language were related to Tamil. Unfortunately, if that's the case it will need to be shown some other way, because these conditional entropy calculations have, so far as I can see, no relevance to the question at all.

My code (in R) is here if you want to play with this, or check if I'm doing something stupid. In the unlikely event you want more, I suggest reading the reply of Farmer et al., Rahul Siddharthan (especially the comments), or Fernando Pereira; the last is probably the wisest.

Manual trackback: Metadatta; Language Hat

Enigmas of Chance; Writing for Antiquity

Posted by crshalizi at April 28, 2009 22:30 | permanent link

April 10, 2009

Next Week at the Statistics Seminar: Bayes, Bayes, Baked Beans, Sausage and Bayes

Next week at the CMU statistics seminar, we give you all the Bayes you could want and more:

Tom Griffiths, "Connecting human and machine learning via probabilistic models of cognition"
Abstract: Human performance defines the standard that machine learning systems aspire to in many areas, such as forming new concepts, making scientific discoveries, and learning language. This suggests that studying human cognition may be a good way to develop better learning algorithms, as well as providing basic insights into how the mind works. However, in order for ideas to flow easily from psychology to computer science and vice versa, we need a common language for describing human and machine learning. I will summarize recent work exploring the hypothesis that probabilistic models of cognition, which view learning as a form of statistical inference, provide such a language, including results that illustrate how novel ideas from statistics can inform cognitive science.
Date and place: Monday, 13 April 2009, 4-5 pm in Porter Hall 125C
Peter Orbanz, "Conjugate Projective Limits"
Abstract: Bayesian nonparametric models are essentially Bayesian models on infinite-dimensional spaces. Most work along these lines in statistics focusses on probability models over the simplex. In machine learning, the problem has recently received much attention as well, and attempts have been made to define models on a wider range of infinite-dimensional objects, including measures, functions, and infinite permutations and graphs.
In my talk, I will discuss the construction of nonparametric Bayesian models from finite-dimensional Bayes equations, roughly analogous to the Kolmogorov extension of systems of measures to their projective limits. I will present an extension theorem applicable to regular conditional probabilities. This can be used to study whether "conditional" properties of the finite-dimensional marginal models, such as conjugacy and sufficiency, carry over to the infinite-dimensional projective limit model, and to determine the functional form of the nonparametric Bayesian posterior if the model is conjugate.
Date and Place: Friday, 17 April 2009, 1-2 pm, Porter Hall A18A
Both talks are free and open to the public.

Enigmas of Chance; Minds, Brains, and Neurons

Posted by crshalizi at April 10, 2009 13:08 | permanent link

In Another Green World

I will be completely offline from the 11th to 20th, while I contemplate whether certain referees deserve to be fed to jaguars, or whether it would not be more humane to sacrifice them to the great feathered serpent. (I mean humane to the cats, of course.)

Self-centered

Posted by crshalizi at April 10, 2009 13:00 | permanent link

April 03, 2009

Next Week at the Statistics Seminar: "Methods and Models for Time-Dependent Relational Data"

Next week at the CMU statistics seminar:

Andrew C. Thomas, "Methods and Models for Time-Dependent Relational Data"
Abstract: In recent years there has been a proliferation of relational data, including studies of social networks, where outcomes are measured over time. Methods that use binary outcomes traditionally rely on Markov chain assumptions that indicate the generation, maintenance and severing of network ties, often relying on dichotomized perceptions of relationships. I present a hierarchical latent variable model for time-dependent relations that makes use of dynamic programming techniques for its solution, and discuss the expansion of this model to non-binary ties, with several real and synthetic examples of social and professional interactions.
Date and place: Monday, 6 April 2009, 4-5 pm in Porter Hall 125C
Free and open to the public.

Networks; Enigmas of Chance

Posted by crshalizi at April 03, 2009 13:37 | permanent link

March 31, 2009

Books to Read While the Algae Grow in Your Fur, March 2009

Diana Pharaoh Francis, The Cipher and The Black Ship
Fantasy brain-candy. The first book does not, despite the title, involve cryptography.
Fall of Cthulhu, vol. 4, Godwar
Pointless if you have not been following along (1, 2, 3).
Carrie Vaughn, Kitty Raises Hell
Patricia Briggs, Bone Crossed
Dean Baker, Plunder and Blunder: The Rise and Fall of the Bubble Economy
The best thing I've read on the current crisis: short, plainly written, and totally accurate. (You can get a sense of its contents here.)
Robert Sharer with Loa Traxler, The Ancient Maya, 6th edition
Massive (~800 pp.) textbook on Maya archaeology, supplemented with ethnohistory and ethnography. Covers the whole period from first settlement through the Spanish conquests, though naturally emphasizing the Classic period (+250 to +900 or +1100, depending on where you are).
Taylor Anderson, Into the Storm, Crusade and Maelstrom
More enjoyable than a trilogy of military SF novels which could be summarized as "what these lemurs need is a boatload of vintage honkeys" has any right being.
Felix Gilman, Thunderer: A Novel of High Fantasy
The city itself as the enchanted realm, with lost, mad and exploited gods, airships, music, feral children, and philosophes writing an encyclopedia. (He realizes that the ecology makes no sense.)

Books to Read While the Algae Grow in Your Fur; Scientifiction and Fantastica; The Dismal Science; The Progressive Forces; The Continuing Crises; Writing for Antiquity; Cthulhiana

Posted by crshalizi at March 31, 2009 23:59 | permanent link

March 27, 2009

Another Idle Question

How many of the people currently pushing or exploiting conspiracy theories about the introduction of a global currency also claim to support returning to the gold standard?

(And where's Dan Sperber when we need him?)

The Dismal Science; The Running Dogs of Reaction; Psychoceramics

Posted by crshalizi at March 27, 2009 21:10 | permanent link

March 26, 2009

Some Bayesian Finger-Puzzle Exercises, or: Often Wrong, Never In Doubt

Attention conservation notice: Clearing out my drafts folder. 600+ words on some examples that I cut from a recent manuscript. Only of interest to (bored) statisticians.

The theme here is to construct some simple yet pointed examples where Bayesian inference goes wrong, though the data-generating processes are well-behaved, and the priors look harmless enough. In reality, however, there is no such thing as an prior without bias, and in these examples the bias is so strong that Bayesian learning reaches absurd conclusions.

Example 1

The data Xi, i=1,2,3,..., come from a 50/50 mixture of two Gaussians, with means at -1 and +1, both with standard deviation 1. (They are independent and identically distributed.) The prior, by coincidence, is a 50/50 mix of two Gaussians, located at -1 and +1, both with standard deviation 1. So initially the posterior predictive distribution coincides exactly with the actual data-generating distribution. After n observations x1, ... xn, whose sum is z, the log-likelihood ratio L(+1)/L(-1) is e2z. Hence the posterior probability that the expectation is +1 is 1/(1+e-2z), and the posterior probability that the expectation is -1 is 1/(1+e2z). The sufficient statistic z itself follows an unbiased random walk, meaning that as n grows it tends to get further and further away from the origin, with a typical size growing roughly like n1/2. It does keep returning to the origin, at intervals dictated by the arc sine law, but it spends more and more of its time very far away from it. The posterior estimate of the mean thus wanders from being close to +1 to being close to -1 and back erratically, hardly ever spending time near zero, even though (from the law of large numbers) the sample mean converges to zero.

This figure shows typical sample paths for z, for the posterior probability of the +1 mode, and for the relative entropy of the predictive distribution from the data-generating distribution. (The latter is calculated by Monte Carlo since I've forgotten how to integrate, so some of the fuzziness is MC noise.) Here is the R code.

click for full-size PDF

Exercise 1: Confirm those calculations for the likelihood ratio and so for the posterior.

Exercise 2: Find the expected log-likelihood of an arbitrary-mean unit-variance Gaussian under this data-generating distribution.

Example 2

Keep the same data-generating distribution, but now let the prior be the conjugate prior for a Gaussian, namely another Gaussian, centered at zero. The posterior is then another Gaussian, which is a function of the sample mean, since the latter is a sufficient statistic for the problem.

Exercise 3: Find the mean and variance of the posterior distribution as functions of the sample mean. (You could look them up, but that would be cheating.)

As we get more and more data, the sample mean of converges almost surely to zero (by the law of large numbers), which here drives the mean and variance of the posterior to zero almost surely as well. In other words, the Bayesian becomes dogmatically certain that the data are distributed according to a standard Gaussian with mean 0 and variance 1. This is so even though the sample variance almost surely converges to the true variance, which is 2. This Bayesian, then, is certain that the data are really not that variable, and any time now will start settling down.

Exercise 4: Suppose that we take the prior from the previous example, set it to 0 on the interval [-1,+1], and increase the prior everywhere else by a constant factor to keep it normalized. Show that the posterior density at every point except -1 and +1 will go to zero. (Hint: use exercise 2 and see here.)

Update in response to e-mails, 27 March: No, I'm not saying that actual Bayesian statisticians are this dumb. A sensible practitioner would, as Andy Gelman always recommends, run a posterior predictive check, and discover that his estimated model looks nothing at all like the data. But that sort of thing is completely outside the formal apparatus of Bayesian inference. What amuses me in these examples is that the formal machinery becomes so certain while being so wrong, while starting from the right answer (and this while Theorem 5 from my paper still applies!). See the second post by Brad DeLong, linked to below.

Manual trackback: Brad DeLong; and again Brad DeLong (with a simpler version of example 1!); The Statistical Mechanic

Enigmas of Chance

Posted by crshalizi at March 26, 2009 10:45 | permanent link

March 24, 2009

Where Did the Steelworkers Go?

Attention conservation notice: back-of-the-envelope calculations about why the US has only about a fifth as many steelworkers now as it did in 1960. Not backed by any actual knowledge of the steel industry. Utterly untimely, it was, I think, prompted by a comment thread on Unfogged, but so long ago I can't remember which.

In 1960, US primary steel production was 91 million tons, of which 2.95 million tons were exported; it also imported 3.24 million tons. This part of the industry employed 530,000 people in all capacities, for an annual output of 170 tons/employee.

In 2007, US primary steel production was 98.1 million tons, with exports of 10.1 million tons and imports of 30.2 million tons. Employment was only 97,540 people, coming to 1005 tons/employee.

Exports and imports in 1960 were a wash, nearly enough, so let's suppose trade patterns had remained comparable and say that all of the net imports were to be made up by higher domestic production: (20.1 million tons)/(1005 tons/worker) = 20,000 extra workers. This would be a substantial increase, but it would still leave employment in steel at only 22% of its 1960 level. Where did the other four-fifths of the industry go?

The most obvious explanation is productivity. The industry in 2007 produced more than it did in 1960, with many fewer employees. In fact, output per employee grew 5.9 times over that period. A six-fold increase in productivity divided by a slight rise in total demand equals a roughly five-fold fall in employment.

Now, this calculation understates the effect of trade because it only considers net imports of steel. But steel is used as an input to producing many other things, and a washing machine made of steel shows up in this sort of official statistic as an import of a manufactured good, not an import of steel. So to really see what US steel production would be if we retained 1960 trade patterns, we'd need to see what the change in the (foreign*) steel content of US net imports has been. Since I don't have Leontief input-output matrices for the US and its trading partners in the two years, I can't do this.

Failing actual knowledge, I'll turn to guesswork. Suppose the steel content of imports was equal to net direct imports; this seems high, but what do I know? This would just add another 20,000 jobs, and bring us up to 26% of the size of the industry in 1960. To get the same level of employment in steel production now as in 1960, the net increase in the foreign steel content of our imports would have to satisfy

(530,000 workers) - (117,540 workers for domestic production and direct imports) = (increase in net indirect imports)/(1005 tons/worker)
or 414,522,300 tons, i.e., about 3.5 times total production plus net direct imports. This is highly implausible.

I conclude that domestic employment in steel production has collapsed largely because increases in productivity have not been matched by increases in demand. If someone can point out where this reasoning goes wrong, I'd appreciate it.

*: Foreign steel content, because if the washing machine is made abroad of steel exported by the US, replacing that washing machine with a US-made one will not increase the demand for American steel.

Sources: 2007 employment figure from BLS (NAICS code 3311). 1960 employment figure from Table 1 on p. 2 of Lebergott. (It does not, however, appear to be affected by some of the well-known problems with Lebergott's series for the 1930s.) Annual production, import and export figures from USGS.

Manual trackback: The Inverse Square Blog; Nothing Funny About Feldspar (with more facts; go read)

The Dismal Science

Posted by crshalizi at March 24, 2009 10:49 | permanent link

March 22, 2009

Special Function Invocation

O Hive Mind, o Lazy Web, Urania's child, I invoke thee! Is there a name for the function
\[ 
f_n(\theta) = \sum_{k=0}^{n}{{n \choose k} \theta^k {(1-\theta)}^{n-k} \log{k!}} 
 \]
i.e., for $ \mathbb{E}[\log{X!}] $ when X is binomially distributed?

Enigmas of Chance

Posted by crshalizi at March 22, 2009 21:52 | permanent link

March 17, 2009

Idle Question of the Day

Exactly what bad consequences would follow if laws were passed by the relevant countries rendering credit default swap contracts void henceforth? (That is, canceling all the outstanding wagers because the bookies went bust.)

Update, 22 March: Well, one bad consequence would evidently be agreeing with Ben Stein. A bit from that link (by Felix Salmon, not Stein) is worth quoting:

There's a good chance, just for starters, that every major bank in America would go bust overnight: after all, they've been packaging up and selling off the credit risk on their multi-trillion-dollar loan portfolios [for years]. If Stein got his way, all that credit risk would suddenly reappear on the banks' balance sheets, and there's nothing they could do about it. Genius. Remember that those super-senior CDOs were the safest bits of the credit that they sold off. Just imagine what their balance sheets would look like if all the risky bits reappeared.

The issue he's raising is that if the banks can't say that they're covered for the risk of their loans defaulting (via the credit default swaps), they need to hold more capital as a protection against default. So as a legal or regulatory issue, ending the swaps would make the banks worse off. Substantively, however, this only makes sense if the swaps would, in fact, protect banks in the event of defaults — if they actually shifted the risk to the swap-sellers. Since we have just had pretty dramatic demonstrations that this is not something to be counted on, it's not at all clear to me that the banks ought to be able to keep that risk off their balance sheets. (In other words, the real value of the swaps to the banks is zero, or next to zero.) In any case, this objection could be countered by combining ending credit default swaps with public guarantees of the banks' existing positions — which is effectively what's happening anyway, only without making it harder to repeat the mistake in the future.

More broadly, ending credit default swaps would mean that those who sold such swaps would lose their stream of payments (a flow) but gain back their collateral and reserves (a stock); conversely buyers of default protection would gain a cash flow but take a hit to their capital stocks. Right now one imagines that even those selling the "end of the world trade" might prefer to get out of the game; I'd be interested to see an estimate of the effects of this on the stability of the financial world right now.

There is also the possibility that eliminating the swaps would deprive us of information about how risky different debts are. The value we should place on this, however, depends on how well these markets actually succeed in aggregating information about risk. I'd say there is abundant cause for skepticism about this — especially when things are, in fact, dangerous. Economic theory does not, in fact, provide any reason to think that such markets will be dominated by those with the most accurate beliefs (or even that the market as a whole will be more accurate than the best-informed trader), unless you assume a complete set of markets, which is a reductio ad absurdum if ever there were one. (When markets are incomplete, more markets are not necessarily better.)

To be clear, I am not asserting that credit default swaps should be ended. I honestly don't think I know enough to have an opinion about that, and while I'm obviously skeptical about their value, some serious and credible people (i.e., ones who do not have a vested interest in the matter) who've studied them in more depth see merit to them. If this is what a world with efficiently-allocated risk looks like, though, I'd hate to see a messed-up one.

(Thanks to readers D.H. and son1 for comments and pointers.)

The Dismal Science; The Continuing Crises; Modest Proposals

Posted by crshalizi at March 17, 2009 22:22 | permanent link

Three-Toed Sloth:   Hosted, but not endorsed, by the Center for the Study of Complex Systems