Attention Conservation Notice: Over 3000 words of technical commentary on a paper on the statistical analysis of networks. Does a poor job of explaining things to those without background knowledge of networks, statistical inference and Markov chains. Includes some geeky jokes and many equations.
Yesterday in the Statistical Learning for Networks seminar, we discussed the following paper; what follows are a mix of my notes from before and after the discussion.
It's perhaps not completely clear from the abstract that their method works for a particular class of network growth models, which they call "duplication-attachment" (DA) models. These are pure growth models, where the network only expands, and never loses nodes or edges. (The network is assumed to be undirected, without self-loops or multiple edges between a given pair of nodes.) At each time step, we add one node. This is randomly chosen, with fixed probability, to be either a duplication or an attachment. If it's an attachment, the new node attaches to an old one, chosen uniformly over the graph (possibly with some fixed probability < 1). If it's a duplication event, we pick an existing node to duplicate, and the new one gets a certain probability of copying each of its model's links (independently), and a different probability of being linked to its model. It is entirely possible that a new node is added with no links to any existing node. Notice that nodes (and edges) have no intrinsic properties in this model; their names are arbitrary. Any two isomorphic graphs should therefore be assigned the same probability.
(The motivation for such models is that gene duplication is apparently fairly common, at least over evolutionary time, which would duplicate the interactions between genes, or their proteins. Attachment, here, is supposed to summarize all the other processes of network growth. There are several models of the evolution of protein interaction networks, e.g. those of Sole et al., 2002, and Vazquez et al., 2003, which are basically of the duplication-attachment type, and yield networks which at least qualitatively match some features of real-world examples, like degree distributions. These papers are not cited here.)
From any starting graph, it is easy to run the model forward and generate
new, larger random graphs; the probabilities involved are all pretty simple
uniform and binomial things. In fact, the current state of the graph
completely determines the distribution of future graphs, so this is a Markov
chain. The transition probabilities are fixed by the duplication and
attachment parameters, collectively
, and
these, together with a distribution over starting graphs, give us
a fully-specified stochastic model.
Normally, statistical inference for Markov chains is fairly straightforward, because most of the classical conveniences which make inference for IID data tractable are still available. (This is, after all, what led Markov to his chains!) So why then does the paper not end on the second page, with a citation "See Billingsley (1961)"? Because normally, when we observe a Markov chain, we observe a sequence of values from the chain, and that lets us estimate the transition parameters. Here, however, we have only the end-state, the final graph, as our data. The Markov chain will let us assign likelihoods to paths, but we don't know where we started from, and we don't know where we went from there, just how we ended up here.
Suppose we did know where we started from, some graph of
size
,
. (Remember here that each step of the chain adds one node,
so t really counts nodes, not clock-time. This is why it's natural to
start at
, as opposed to 0 or 1. The paper does not seem to
explain this point.) If we knew our initial state, then in principle we could
figure out the probability of reaching our final state from it, as a sum over
all possible paths:
is set of all growing sequences of
graphs which start at
and end
at
. This mathematical expression is a
mouthful, admittedly, but it's probably clearer in a picture.
to
the final, observed graph
. The chain tells us the probability
of each such path. Since we had to take one, and only one, of these paths, the
total probability of making the journey is the sum of the probabilities of all
the individual paths.
At this point, any physicists in the audience should be nodding their heads; what I've just said is that the likelihood, from a given starting configuration, is a sum over histories, or a path integral. Along with the authors, I'll return to how to evaluate this path integral presently, but first we need to figure out how to get that starting configuration.
If we had a known distribution over starting graphs, we could (in principle) just evaluate the likelihood conditional on each starting graph, and then take a weighted sum over graphs. This, however, is not what the authors do. (I'm really not sure where one would find such a distribution, other than another model for graph development. Bayesian practice would suggest picking something which led to easy computations, but this makes a mockery of any pretense to either modeling nature, or to representing incomplete prior knowledge.) Instead, they try to use the known dynamics of the DA model to fix on an unambiguous starting point, and do everything conditional on that.
They observe that you can take any graph, and, for each node, identify the
other nodes it could have been copied from. (If A could have been copied
directly from B, then A's neighbors must be a subset of B's [ignoring the link
between A and B, if any].) So, from any starting graph, you can recursively
remove nodes that could have arise through simple duplications. In general, at
each stage in this recursion there will be multiple nodes which could be
removed, and their choice is arbitrary. Remarkably enough, no
matter which choices one makes, the recursion always terminates at
the same graph. (More exactly, any two end-points are isomorphic to
each other, and so identical for statistical purposes.) The proof is basically
a fixed point theorem about a partial order defined on graphs through
deletion-of-duplicates, but they confine it to the supplementary
materials, so you can take it on trust (and they don't use
such lattice-theoretic language even there). This graph
--- the data, minus everything that could be pure duplication --- is what they
take as their starting point. This is the
to the data's
. Everything is then done conditional on
.
OK, we have our initial condition and our final condition, and we have our Markov chain, so all we've got to do so is evaluate the integral over paths linking the two.
Problem: There are too many paths. In the worst case, the number of paths is going to grow factorially with the number of nodes in the observed graph. Even though along each path we've just got to do some straight-forward multiplication, simply enumerating all the paths and summing over them will take us forever. (The authors discuss some algorithmic tricks for speeding up the exact calculation, but still get something super-exponential!) Thus, evaluating the path integral for the likelihood is intractable, even for a single parameter value.
Solution: Don't look at all the paths. Rather, sample some paths, say N of them, evaluate the likelihood along each, and average. Hope that this converges quickly (in N) to the exact value of the integral. This is, after all, how physicists approach many path integrals.
Problem: Even if N is fairly small, we need to examine
many settings of the parameter
. It could still kill us to
have to sample N distinct paths for each parameter value.
Solution: Use importance sampling. Draw a single path,
valid for all parameter values, and evaluate the likelihood in terms of the
value of an "importance weight" along this path. The weight has to be a
function of
, but it should be the only thing which is. We
do this here by writing the likelihood,
, as an
expectation with respect to a reference measure, which the authors write
. This reference measure is given by
another Markov chain, called the "driving chain"; despite its name, it
is not a member of the DA family of chains. The trick here is
that one sample of possible paths, generated according to this chain,
can be used to (approximately) evaluate the likelihood at all
parameter settings of the DA model.
The crucial equation is [3] on p. 7568
,
but this is wrong.) Let's unpack this a bit.
is the probability of producing the
graph
through the addition of the
node
, with parameter
setting
. (N.B.,
must be a "removable"
node, one which could have been added by duplication.)
is this transition probability, summed over all
possible
. The first two factors in S are what we
want, the probability we'd get moving forward along the path according to the
parameter
. The third term is the reciprocal of the
transition probabilities according to the driving chain. Its only job is to
cancel those probabilities out.
The algorithm for generating the ith sample path is then as
follows. Start with the observed graph
. Count backwards,
Pick a
node
to delete, with probability
proportional to
. (Once
again, this limits us to the "removable" nodes, the ones which could have been
produced by duplication.) Set
to be the result of deleting that node. Keep going
back until we hit the irreducible core,
. (We will always hit
this core, by the fixed point theorem proved in the supplementary results.)
Then
.
So, to summarize: We can generate a sample of paths connecting the observed final graph to the unobserved initial graph, according to the driving chain, and then approximate the likelihood for any parameter value by multiplying the importance weights along those paths and summing over paths. (The importance weights themselves even factor nicely.) We have thus solved the problem of evaluating a path integral, when we've got only one end of too many possible paths.
The trick used here to pull this off depended on having a uniquely-defined
starting point for all parameter settings, namely the
defined
through undoing duplications. (According to the authors, they took this from
papers on the coalescent process in population genetics, but I have not been
able to track down their references.) Strictly speaking, everything is
conditional on that starting point. Of this, more below.
Left unaddressed by the above is the question of how many paths we need to sample. Remember, the whole point is to not have to look at every possible path! If it turns out that accurate approximations to the likelihood require us to sample some substantial fraction, then this is all for nothing. However, the authors' figures reveal something quite remarkable. Whether N is 10 or 1000, the approximate likelihood changes very little (at least on a log scale), even with real data. This suggests that we don't, actually, need a lot of paths, but why?
For each i,
is an independent
realization of a random variable, whose distribution depends
only
(holding fixed the driving chain, and the initial and
final graphs). Since they are IID, we can apply the central limit theorem,
which tells us that their mean should converge at rate
.
Since that's noticeably smaller for N=1000 than for N=10, it must be the case
that the variance of the likelihood along the individual sample paths is
already pretty small. Why?
The lame physics answer is, "the principle of least action". There will be
optimal, most-probable paths, and they will dominate the sum, the others
tending to make negligible contributions. With high probability, a random sample
will pick out the most probable paths. Q.E.D. This argument could, perhaps,
be made less lame through an application of large deviations results,
specifically conditional-limit-theorem- (or "Gibbs's conditioning principle"-)
type results for Markov chains, which roughly say that if something improbable
(a passage from
to
) happens, it does so in
the least-improbable possible way, and deviations from that trajectory are
exponentially suppressed. <televangelist>In the name of Cramér,
in the name of Varadhan, in the name of Freidlin and Wentzell, I call on you,
Brother Argument, to be healed! Arise and prove! And, cf. Eyink
(2000).</televangelist>
An information-theoretic answer is to invoke the asymptotic equipartition
principle, a.k.a. the Shannon-McMillan-Breiman theorem. This says that
if
are generated according to a (well-behaved)
stochastic process, whose distribution is
, and
is a sufficiently well-behaved model, then
is the entropy rate of the data-generating process
, and
is the relative entropy rate
between the data source and the model, i.e., the asymptotic growth rate of the
Kullback-Leibler divergence. (For details, see Algoet and Cover, 1988, or
Gray, 1990). So
The biggest unclarity of all, probably, is the role of
.
Recall that we reached this by removing nodes which could have been
added by pure duplication. There is, however, no particular reason to think
that the actual growth of the graph ever passed through this state. It has the
advantage of giving us a unique starting point for the chain, but there are,
potentially, others. One, of course, is the trivial network consisting of a
single node! Another possibility (which came up in the discussion, I think
mostly due to Anna Goldenberg) is to
first remove potential duplicates, as the authors do, and then remove nodes
which have only one link to them, as clear attachments. This process of
unwinding the attachments could potentially be iterated, until no "danglers"
are left. This, too, is a uniquely-defined point. We can then go back to
removing nodes which are, now, potential duplicates, and so on. Someone (I
forget who) suggested that this might always terminate at the one-node network;
it would be nice to either show this or give a counter-example. But if there
is some principled reason, other than tractability, to use
their
, I can't figure out what it is from this paper.
Only using a growing network, and in particular only focusing on growth through duplication, certainly simplifies the computational problem, by reducing the number of possible paths which could terminate in the observed graph. Deletion of nodes and edges is however going to be very important in more biologically-plausible models, to say nothing of models of social or technological networks. Presumably the trick of using a backward-looking chain which stops when it hits a unique starting configuration could still be used with deletions --- I think the authors are hinting as much in their conclusion --- but it's not clear to me that a unique starting point is appropriate. With biological interaction networks, for example, one might argue that, e.g., metazoans have been around for a long time, so the distribution of networks ought to be close to stationary, and so starting configurations should be drawn from an invariant distribution of the appropriate chain...
This raises two further points, which are not un-related: the asymptotics of the DA model, and the biological utility of such a model. Run for a long time, the DA model will produce graphs of unbounded size, but it's not immediately obvious what these graphs will look like. In particular, what will be their degree distribution? The Barabasi-Albert model (Albert and Barabasi, 2002) produces scale-free distributions, <boosterism>because it uses the same mechanism as Herbert Simon's classic paper</boosterism> (Bornholdt and Ebel, 2001). This relies on a rich-get-richer dynamic, where nodes with high degree are more likely to attract new edges. My initial thought was that this wasn't present in the DA model, because targets for attachment and targets for duplication are both chosen uniformly. However, someone in the discussion --- I think, though I may be mis-remembering, that it was Tanzy Love --- pointed out that while high-degree nodes are no more likely to be copied than low-degree nodes, edges into high-degree nodes are more likely to be copied than edges into low-degree nodes. This is because if a node has degree k, there are k other nodes whose duplicated could end up linking to it. It may even be the case that this is falls under the theorems in Simon... Presumably the asymptotics would only become harder to handle if we added events deleting nodes or edges.
As for the biological utility, I'll repeat that none of the nodes have any identity of their own; only their role in the network of relations represented by the edges has any bearing on the model. "If this be structuralism, make the most of it": by turning it into a neutral model for the evolution of biological networks. After all, there is no reason here to duplicate certain nodes or edges, it's all just uniform chance. One key use of neutral models is to provide a background against which to detect adaptation; how could we do that here?
References:
Albert, Réka and Albert-László Barabási (2002),
"Statistical Mechanics of Networks", Reviews of Modern Physics
74 (2002): 47--97 = cond-mat/0106096
Algoet, Paul H., and Thomas M. Cover (1988), "A Sandwich Proof of the
Shannon-McMillan-Breiman Theorem", The Annals of
Probability 16: 899--909
Billingsley, Patrick (1961). Statistical Inference for Markov
Processes. Statistical Research Monographs, vol. 2. Chicago:
University of Chicago Press.
Bornholdt, Stefan and Holger Ebel (2001), "World-Wide Web scaling exponent
from Simon's 1955 model", Physical Review E 64:
035104 = cond-mat/0008465
Eyink, Gregory L. (2000), "A Variational Formulation of Optimal Nonlinear
Estimation", Methodology and Computing in Applied Probability
(submitted) = physics/0011049
Gray, Robert M. (1990), Entropy and Information Theory.
Berlin:
Springer-Verlag. Full text
free online.
Simon, Herbert A. (1955), "On a Class of Skew Distribution Functions",
Biometrika 42: 425--440
Solé Ricard V., Romualdo Pastor-Satorras, Eric Smith and Thomas
B. Kepler (2002), "A model of large-scale proteome evolution", Advances
in Complex Systems 5 (2002):
43--54 = cond-mat/0207311
Vázquez, A. and A. Flammini and A. Maritan and A. Vespignani (2003),
"Modeling of protein interaction networks", Complexus
1:
38--44 = cond-mat/0108043
Posted by crshalizi at November 29, 2006 02:30 | permanent link