<?xml version="1.0"?>
<!-- name="generator" content="blosxom/2.0" -->
<!DOCTYPE rss PUBLIC "-//Netscape Communications//DTD RSS 0.91//EN" "http://my.netscape.com/publish/formats/rss-0.91.dtd">

<rss version="0.91">
  <channel>
    <title>Notebooks   </title>
    <link>http://bactra.org/notebooks</link>
    <description>Cosma's Notebooks</description>
    <language>en</language>

  <item>
    <title>Clustering</title>
    <link>http://bactra.org/notebooks/2012/04/15#clustering</link>
    <description>
&lt;P&gt;A topic in &lt;a href=&quot;data-mining.html&quot;&gt;data mining&lt;/a&gt; and
&lt;a href=&quot;statistics.html&quot;&gt;statistics&lt;/a&gt;: given a big bunch of data points,
assign them to a discrete set of groups in a way which somehow reflects the
natural divisions among them, without knowing in advance what the groups are.
This is the unsupervised counterpart to classification.  (You see where the
connection with &lt;a href=&quot;learning-inference-induction.html&quot;&gt;induction&lt;/a&gt; comes
in.)  This is an important subject, but one of the topics I most dislike
teaching in data mining, because the students' natural question is always &quot;how
do I know when my clustering algorithm is giving me a &lt;em&gt;good&lt;/em&gt; solution?&quot;,
and it's very hard to give them a reasonable answer.  I think this is because
most other data-mining problems are basically &lt;em&gt;predictive&lt;/em&gt;, and so one
can ask how good the prediction is; what's the best way to turn clustering into
a prediction problem?  (Probabilistic mixture models suggest themselves, of
course.)

&lt;ul&gt;Recommended, big picture:
	&lt;li&gt;David Hand, Heikki Mannila and Padhraic Smyth, &lt;cite&gt;Principles of
Data Mining&lt;/cite&gt; [The textbook I teach from; also a book I learned a lot from.
&lt;a href=&quot;../weblog/algae-2010-01.html#data-mining&quot;&gt;Comments&lt;/a&gt;]
	&lt;li&gt;Isabelle Guyon, Ulrike von Luxburg, and Robert C. Williamson,
&quot;Clustering: Science or Art?&quot; [&lt;a href=&quot;http://clusteringtheory.org/opinions/opinion-artorscience.pdf&quot;&gt;Online PDF&lt;/a&gt;]
	&lt;li&gt;Jacob Kogan, &lt;cite&gt;Introduction to Clustering Large and
High-Dimensional
Data&lt;/cite&gt; [&lt;a href=&quot;../weblog/algae-2009-08.html#kogan&quot;&gt;Comments&lt;/a&gt;]
	&lt;/ul&gt;

&lt;ul&gt;Recommended, close-ups:
	&lt;li&gt;Margaret Ackerman and Shai Ben-David, &quot;Measures of Clustering Quality: A Working Set of Axioms for Clustering&quot;, NIPS 2008 [&lt;a href=&quot;http://www.cs.uwaterloo.ca/~mackerma/ClusteringQualityPaper.pdf&quot;&gt;PDF&lt;/a&gt;.  A rebuttal
to Kleinberg's paper on clustering.  Thanks to &quot;arthegall&quot; and Ed Vielmetti
for pointers.]
	&lt;li&gt;David Arthur and Sergei Vassilvitskii, &quot;&lt;tt&gt;k-means++&lt;/tT&gt;: The Advantages of Careful Seeding&quot; [&lt;a href=&quot;http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf&quot;&gt;PDF preprint&lt;/a&gt; via Dr. Arthur]
	&lt;li&gt;S&amp;eacute;bastien Bubeck, Ulrike von Luxburg, &quot;Nearest Neighbor
Clustering: A Baseline Method for Consistent Clustering with Arbitrary
Objective
Functions&quot;, &lt;a
href=&quot;http://jmlr.csail.mit.edu/papers/v10/bubeck09a.html&quot;&gt;&lt;cite&gt;Journal of
Machine Learning Research&lt;/cite&gt; &lt;strong&gt;10&lt;/strong&gt; (2009): 657--698&lt;/a&gt;
	&lt;li&gt;Jon Kleinberg, &quot;An Impossibility Theorem for Clustering&quot;,
&lt;cite&gt;Advances in Neural Information Processing Systems&lt;/cite&gt; 15 (NIPS 2002)
[Thanks to Aaron Clauset for pointing me to this.  PDF
from &lt;a href=&quot;http://www.cs.cornell.edu/home/kleinber/nips15.pdf&quot;&gt;Kleinberg&lt;/a&gt;
and from &lt;a href=&quot;http://books.nips.cc/papers/files/nips15/LT17.pdf&quot;&gt;NIPS&lt;/a&gt;.
But see Ackerman and Ben-David, above.]
	&lt;/ul&gt;

&lt;ul&gt;Modesty forbids me to recommend:
	&lt;li&gt;My &lt;a href=&quot;http://www.stat.cmu.edu/~cshalizi/350&quot;&gt;lecture notes
for my data mining class&lt;/a&gt; [However, many of them are based on lecture notes
originally written
by &lt;a href=&quot;http://research.microsoft.com/en-us/um/people/minka/&quot;&gt;Tom
Minka&lt;/a&gt;, and modesty does not forbid me from recommending his work.]
	&lt;/ul&gt;

&lt;ul&gt;To read:
	&lt;li&gt;L. Angelini, L. Nitti, M. Pellicoro and Sebastiano Stramaglia,
&quot;Cost functions for pairwise data clustering,&quot; &lt;a
href=&quot;http://arxiv.org/abs/cond-mat/0103414&quot;&gt;cond-mat/0103414&lt;/a&gt;
	&lt;li&gt;Hendrik Blockeel, Luc De Raedt and Jan Ramon, &quot;Top-down induction
of clustering trees,&quot;
&lt;a href=&quot;http://arxiv.org/abs/cs.LG/0011032&quot;&gt;cs.LG/0011032&lt;/a&gt;
	&lt;li&gt;Joachim M. Buhmann, &quot;Information theoretic model validation for clustering&quot;, &lt;a href=&quot;http://arxiv.org/abs/1006.0375&quot;&gt;arxiv:1006.0375&lt;/a&gt;
	&lt;li&gt;Gunnar Carlsson, Facundo M&amp;eacute;moli, &quot;Characterization,
Stability and Convergence of Hierarchical Clustering
Methods&quot;, &lt;a href=&quot;http://jmlr.csail.mit.edu/papers/v11/carlsson10a.html&quot;&gt;&lt;cite&gt;Journal
of Machine Learning Research&lt;/cite&gt; &lt;strong&gt;11&lt;/strong&gt; (2010): 1425--1470&lt;/a&gt;
	&lt;li&gt;Ricardo Fraiman, Badih Ghattas, Marcela Svarc, &quot;Clustering using Unsupervised Binary Trees: CUBT&quot;, &lt;a href=&quot;http://arxiv.org/abs/1011.2624&quot;&gt;arxiv:1011.2624&lt;/a&gt;
	&lt;li&gt;Clara Granell, Sergio Gomez, Alex Arenas, &quot;Mesoscopic analysis of networks: applications to exploratory analysis and data clustering&quot;, &lt;a href=&quot;http://arxiv.org/abs/1101.1811&quot;&gt;arxiv:1101.1811&lt;/a&gt;
	&lt;li&gt;Madjid Khalilian, Norwati Mustapha, &quot;Data Stream Clustering: Challenges and Issues&quot;, &lt;a href=&quot;http://arxiv.org/abs/1006.5261&quot;&gt;arxiv:1006.5261&lt;/a&gt;
	&lt;li&gt;Daniel Korenblum and David Shalloway, &quot;Macrostate Data Clustering&quot;,
&lt;a href=&quot;http://dx.doi.org/10.1103/PhysRevE.67.056704&quot;&gt;&lt;cite&gt;Physical Review E&lt;/cite&gt; &lt;strong&gt;67&lt;/strong&gt; (2003): 056704&lt;/a&gt;
[This sounds a lot like spectral clustering and diffusion maps]
	&lt;li&gt;Marta Luksza, Michael Lassig and Johannes Berg, &quot;Significance Analysis and Statistical Mechanics: An Application to Clustering&quot;, &lt;a href=&quot;http://dx.doi.org/10.1103/PhysRevLett.105.220601&quot;&gt;&lt;cite&gt;Physical Review Letters&lt;/cite&gt; &lt;strong&gt;105&lt;/strong&gt; (2010): 220601&lt;/a&gt;
	&lt;li&gt;Marina
Meila, &lt;a href=&quot;http://dx.doi.org/10.1007/s10994-011-5267-2&quot;&gt;&lt;cite&gt;Machine
Learning&lt;/cite&gt; &lt;strong&gt;86&lt;/strong&gt; (2012): 369--389&lt;/a&gt;
	&lt;li&gt;Vladimir Nikulin and Geoffrey J. McLachlan, &quot;Strong Consistency of
Prototype Based Clustering in Probabilistic
Space&quot;, &lt;a href=&quot;http://arxiv.org/abs/1004.3101&quot;&gt;arxiv:1004.3101&lt;/a&gt; [The
regularization they're using isn't at all obvious on first scan of the paper]
	&lt;li&gt;Christoph Pamminger and Sylvia Fru&amp;uuml;hwirth-Schnatter, &quot;Model-based Clustering of Categorical Time
Series&quot;, &lt;a href=&quot;http://dx.doi.org/10.1214/10-BA606&quot;&gt;&lt;cite&gt;Bayesian Analysis&lt;/cite&gt; &lt;strong&gt;5&lt;/strong&gt; (2010): 345--368&lt;/a&gt;
	&lt;li&gt;Alessandro Rinaldo, Aarti Singh, Rebecca Nugent, Larry Wasserman, &quot;Stability of Density-Based Clustering&quot;, &lt;a href=&quot;http://arxiv.org/abs/1011.2771&quot;&gt;arxiv:1011.2771&lt;/a&gt;
	&lt;li&gt;Alessandro Rinaldo and Larry Wasserman, &quot;Generalized Density
Clustering&quot;, &lt;citE&gt;Annals of Statistics&lt;/cite&gt; &lt;strong&gt;38&lt;/strong&gt;
(2010): 2678--2722, &lt;a href=&quot;http://arxiv.org/abs/0907.3454&quot;&gt;arxiv:0907.3454&lt;/a&gt;
	&lt;li&gt;Oha Shamir and Naftali Tishby, &quot;Stability and model selection in k-means clustering&quot;, &lt;a href=&quot;http://dx.doi.org/10.1007/s10994-010-5177-8&quot;&gt;&lt;cite&gt;Machine Learning&lt;/cite&gt;
&lt;strong&gt;80&lt;/strong&gt; (2010): 213--243&lt;/a&gt;
	&lt;li&gt;Qing Song, &quot;A Robust Information Clustering Algorithm&quot;,
&lt;a href=&quot;http://neco.mitpress.org/cgi/content/abstract/17/12/2672&quot;&gt;&lt;cite&gt;Neural
Computation&lt;/cite&gt; &lt;strong&gt;17&lt;/strong&gt; (2005): 2672--2698&lt;/a&gt; [&quot;We focus on the
scenario of robust information clustering (RIC) based on the minimax
optimization of mutual information (MI). The minimization of MI leads to the
standard mass-constrained deterministic annealing clustering, which is an
empirical risk-minimization algorithm. The maximization of MI works out an
upper bound of the empirical risk via the identification of outliers (noisy
data points). Furthermore, we estimate the real risk VC-bound and determine an
optimal cluster number of the RIC based on the structural risk-minimization
principle. One of the main advantages of the minimax optimization of MI is that
it is a nonparametric approach, which identifies the outliers through the
robust density estimate and forms a simple data clustering algorithm based on
the square error of the Euclidean distance.&quot;]
	&lt;li&gt;Susanne Still and William Bialek, &quot;How many clusters? An
information theoretic perspective,&quot; &lt;a
href=&quot;http://arxiv.org/abs/physics/0303011&quot;&gt;physics/0303011&lt;/a&gt;
= &lt;a
href=&quot;http://neco.mitpress.org/cgi/content/abstract/16/12/2483&quot;&gt;&lt;cite&gt;Neural
Computation&lt;/cite&gt; &lt;strong&gt;16&lt;/strong&gt; (2004): 2483--2506&lt;/a&gt;
	&lt;li&gt;Wei Sun, Junhui Wang, and Yixin Fang, &quot;Regularized k-means
clustering of high-dimensional data and its asymptotic
consistency&quot;, &lt;a href=&quot;http://projecteuclid.org/euclid.ejs/1328280901&quot;&gt;&lt;cite&gt;Electronic
Journal of Statistics&lt;/cite&gt; &lt;strong&gt;6&lt;/strong&gt; (2012): 148--167&lt;/a&gt;
	&lt;li&gt;Nguyen Xuan Vinh, Julien Epps, James Bailey, &quot;Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance&quot;, &lt;a href=&quot;http://jmlr.csail.mit.edu/papers/v11/vinh10a.html&quot;&gt;&lt;cite&gt;Journal of Machine Learning Research&lt;/cite&gt; &lt;strong&gt;11&lt;/strong&gt; (2010): 2837--2854&lt;/a&gt;
	&lt;li&gt;Ulrike von Luxburg, &quot;A Tutorial on Spectral Clustering&quot;, &lt;a href=&quot;http://arxiv.org/abs/:0711.0189&quot;&gt;arxiv::0711.0189&lt;/a&gt;
	&lt;li&gt;Junhui Wang, &quot;Consistent selection of the number of clusters via crossvalidation&quot;, &lt;a href=&quot;http://dx.doi.org/10.1093/biomet/asq061&quot;&gt;&lt;cite&gt;Biometrika&lt;/cite&gt; &lt;strong&gt;97&lt;/strong&gt; (2010): 893--904&lt;/a&gt;
	&lt;/ul&gt;
</description>
  </item>
  </channel>
</rss>
