<?xml version="1.0"?>
<!-- name="generator" content="blosxom/2.0" -->
<!DOCTYPE rss PUBLIC "-//Netscape Communications//DTD RSS 0.91//EN" "http://my.netscape.com/publish/formats/rss-0.91.dtd">

<rss version="0.91">
  <channel>
    <title>Notebooks   </title>
    <link>http://bactra.org/notebooks</link>
    <description>Cosma's Notebooks</description>
    <language>en</language>

  <item>
    <title>Data Mining</title>
    <link>http://bactra.org/notebooks/2010/01/01#data-mining</link>
    <description>
&lt;P&gt;I've taught a course on this, so I ought to be able to describe it, oughtn't
I?  Data mining, more stuffily &quot;knowledge discovery in databases&quot;, is the art
of finding and extracting useful patterns in very large collections of data.
It's not quite the same as &lt;a href=&quot;learning-inference-induction.html&quot;&gt;machine
learning&lt;/a&gt;, because, while it certainly uses ML techniques, the aim is to
directly guide action (&lt;em&gt;praxis&lt;/em&gt;!), rather than to develop a technology
and theory of induction.  In some ways, in fact, it's closer to
what &lt;a href=&quot;statistics.html&quot;&gt;statistics&lt;/a&gt; calls &quot;exploratory data
analysis&quot;, though with certain advantages and limitations that come from having
really big data to explore.

&lt;P&gt;See also:
	&lt;a href=&quot;clinical-vs-actuarial.html&quot;&gt;Clinical and Actuarial Compared&lt;/a&gt;;
	&lt;a href=&quot;structured-data.html&quot;&gt;Statistics for Structured Data&lt;/a&gt;

&lt;ul&gt;Recommended, big picture:
	&lt;li&gt;Leo Breiman, &quot;Statistical Modeling: The Two Cultures&quot;,
&lt;cite&gt;Statistical Science&lt;/cite&gt; &lt;strong&gt;16&lt;/strong&gt; (2001): 199--231 [very
much including the discussion by others and the reply by Breiman]
	&lt;li&gt;David Hand, Heikki Mannila and Padhraic Smyth, &lt;cite&gt;Principles of
Data Mining&lt;/cite&gt; [The textbook I teach from; also a book I learned a lot from.
&lt;a href=&quot;../weblog/algae-2010-01.html#data-mining&quot;&gt;Comments&lt;/a&gt;]
	&lt;li&gt;Bernard E. Harcourt, &lt;citE&gt;Against Prediction: Profiling, Policing,
and Punishing in an Actuarial Age&lt;/cite&gt;
[&lt;a
href=&quot;http://www.press.uchicago.edu/cgi-bin/hfs.cgi/00/198485.ctl&quot;&gt;Blurb&lt;/a&gt;,
my &lt;a href=&quot;../reviews/against-prediction/&quot;&gt;review&lt;/a&gt;.
Precis as &lt;a href=&quot;http://papers.ssrn.com/sol3/papers.cfm?abstract_id=756945&quot;&gt;a
43 pp. PDF working paper&lt;/a&gt;]
	&lt;li&gt;Sholom M. Weiss and Nitin Indrukyha, &lt;cite&gt;Predictive Data
Mining: A Practical Guide&lt;/cite&gt; [Pedestrian, but it &lt;em&gt;is&lt;/em&gt; practical, and
adapted to the meanest, i.e. the managerial, understanding]
	&lt;/ul&gt;

&lt;ul&gt;Recommended, close-ups:
	&lt;li&gt;S&amp;eacute;bastien Bubeck, Ulrike von Luxburg, &quot;Nearest Neighbor
Clustering: A Baseline Method for Consistent Clustering with Arbitrary
Objective
Functions&quot;, &lt;a
href=&quot;http://jmlr.csail.mit.edu/papers/v10/bubeck09a.html&quot;&gt;&lt;cite&gt;Journal of
Machine Learning Research&lt;/cite&gt; &lt;strong&gt;10&lt;/strong&gt; (2009): 657--698&lt;/a&gt;
	&lt;li&gt;Aleks Jakulin and Ivan Bratko, &quot;Quantifying and Visualizing
Attribute Interactions&quot;, &lt;a
href=&quot;http://arxiv.org/abs/cs.AI/0308002&quot;&gt;cs.AI/0308002&lt;/a&gt;
	&lt;li&gt;Jon Kleinberg, Christos Papadimitriou and Prabhakar Raghavan, &quot;A
Microeconomic View of Data Mining&quot;, &lt;cite&gt;Data Mining and Knowledge
Discovery&lt;/cite&gt; &lt;strong&gt;2&lt;/strong&gt; (1998)
[&lt;a href=&quot;http://www.cs.cornell.edu/home/kleinber/dmkd98-seg.pdf&quot;&gt;PDF&lt;/a&gt;]
	&lt;li&gt;Kling, Scherson and Allen, &quot;Parallel Computing and Information
Capitalism,&quot; in Metropolis and Rota (eds.), &lt;cite&gt;A New Era in
Computation&lt;/cite&gt; (1992) [A batch of UC Irvine comp. sci. professors who write
like &lt;a href=&quot;sociology.html&quot;&gt;sociologists&lt;/a&gt;.  &quot; `Information capitalism'
refers to forms of organization in which data-intensive techniques and
computerization are key strategic resources for corporate production.&quot;]
	&lt;li&gt;Jacob Kogan, &lt;cite&gt;Introduction to Clustering Large and
High-Dimensional
Data&lt;/cite&gt; [&lt;a href=&quot;../weblog/algae-2009-08.html#kogan&quot;&gt;Comments&lt;/a&gt;]
	&lt;li&gt;Erik Larson, &lt;cite&gt;The Naked Consumer: How Our Private Lives Become Public Commodities&lt;/cite&gt;
	&lt;/ul&gt;



&lt;ul&gt;Recommended, close-ups on text-mining and the like:
	&lt;li&gt;David M. Blei, A. Y. Ng and Michael I. Jordan, &quot;Latent Dirichlet
allocation&quot;, &lt;cite&gt;Journal of Machine Learning REsearch&lt;/cite&gt; &lt;strong&gt;3&lt;/strong&gt; (2003): 993--1022
	&lt;li&gt;David M. Blei and John D. Laffery, &quot;Correlated Topic Models&quot;,
&lt;cite&gt;NIPS&lt;/cite&gt; 2005
	&lt;li&gt;Chaitanya Chemudugunta, Padhraic Smyth, Mark Steyvers, &quot;Text
Modeling using Unsupervised Topic Models and Concept
Hierarchies&quot;, &lt;a href=&quot;http://arxiv.org/abs/0808.0973&quot;&gt;arxiv:0808.0973&lt;/a&gt;
	&lt;li&gt;Thomas Hofmann, &quot;Unsupervised Learning by Probabilistic Latent
Semantic Analysis&quot;, &lt;cite&gt;Machine Learning&lt;/cite&gt; &lt;strong&gt;42&lt;/strong&gt; (2001):
177--196
	&lt;li&gt;T. K. Landauer and S. T. Dumais, &quot;A Solution to Plato's Problem:
The Latent Sematic Analysis Theory of the Acquisition, Induction and
Representation of Knowledge&quot;, &lt;cite&gt;Psychological
Review&lt;/cite&gt; &lt;strong&gt;104&lt;/strong&gt; (1997):211--240
	&lt;/ul&gt;


&lt;ul&gt;Modesty forbids me to recommend:
	&lt;li&gt;My &lt;a href=&quot;http://www.stat.cmu.edu/~cshalizi/350&quot;&gt;lecture notes
for my data mining class&lt;/a&gt; [However, many of them are based on lecture notes
originally written
by &lt;a href=&quot;http://research.microsoft.com/en-us/um/people/minka/&quot;&gt;Tom
Minka&lt;/a&gt;, and modesty does not forbid me from recommending his work.]
	&lt;/ul&gt;


&lt;ul&gt;To read:
	&lt;li&gt;Ian Ayres, &lt;cite&gt;Super Crunchers: Why Thinking-by-Numbers Is the
New Way to Be Smart&lt;/cite&gt; [Despite the &lt;em&gt;painful&lt;/em&gt; title, Ayres has done
cool applied work in social statistics]
	&lt;li&gt;David L. Banks and Yasmin H. Said, &quot;Data Mining in Electronic
Commerce&quot;, &lt;a href=&quot;http://arxiv.org/abs/math.ST/0609204&quot;&gt;math.ST/0609204&lt;/a&gt;
= &lt;cite&gt;Statistical Science&lt;/cite&gt; &lt;strong&gt;21&lt;/strong&gt; (2006): 234--246
	&lt;li&gt;Burnham, &lt;cite&gt;Rise of the Computer State&lt;/cite&gt;
	&lt;li&gt;Bertrand Clarke, Ernest Fokoue and Hao Helen
Zhang, &lt;cite&gt;Principles and Theory for Data Mining and Machine Learning&lt;/citE&gt;
[&lt;a href=&quot;http://www.springer.com/book/978-0-387-98134-5&quot;&gt;blurb&lt;/a&gt;]
	&lt;li&gt;Pavel Dmitriev and Carl Lagoze, &quot;Mining Generalized Graph Patterns
based on User
Examples&quot;, &lt;a href=&quot;http://arxiv.org/abs/cs.DS/0609153&quot;&gt;cs.DS/0609153&lt;/a&gt;
	&lt;li&gt;Usama Fayyad, Geroges G. Grinstein and Andreas Wierse (eds.),
&lt;cite&gt;Information Visualization in Data Mining and Knowledge Discovery&lt;/cite&gt;
	&lt;li&gt;Ronen Feldman and James Sanger, &lt;cite&gt;The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data&lt;/cite&gt; [&lt;a href=&quot;http://cambridge.org/9780521836579&quot;&gt;Blurb&lt;/a&gt;
	&lt;li&gt;Hillol Kargupta and Philip Chan (eds.), &lt;cite&gt;Advances in Distributed and Parallel Knolwedge Discovery&lt;/cite&gt; [&lt;a href=&quot;http://mitpress.mit.edu/0262611554&quot;&gt;Blurb&lt;/a&gt;]
	&lt;li&gt;Hillol Kargupta, Anupam Joshi, Krishnamoorthy Sivakumar and Yelena
Yesha, &lt;cite&gt;Data Mining: Next Generating Challenges and Future
Directions&lt;/cite&gt; [&lt;a href=&quot;http://mitpress.mit.edu/0262612038&quot;&gt;Blurb&lt;/a&gt;]
	&lt;li&gt;Nicholas M. Kiefer and C. Erik Larson, &quot;Specification and
Informational Issues in Credit Scoring&quot;
[&lt;a href=&quot;http://papers.ssrn.com/sol3/papers.cfm?abstract_id=956628&quot;&gt;SSRN&lt;/a&gt;]
	&lt;li&gt;Martin Klein and Michael L. Nelson, &quot;Approximating Document
Frequency with Term Count
Values&quot;, &lt;a href=&quot;http://arxiv.org/abs/0807.3755&quot;&gt;arxiv:0807.3755&lt;/a&gt;
[Approximating inverse document frequency for the web (or other unsurveyable
corpora) by term frequency]
	&lt;li&gt;Daniel Korenblum and David Shalloway, &quot;Macrostate Data Clustering&quot;,
&lt;a href=&quot;http://dx.doi.org/10.1103/PhysRevE.67.056704&quot;&gt;&lt;cite&gt;Physical Review E&lt;/cite&gt; &lt;strong&gt;67&lt;/strong&gt; (2003): 056704&lt;/a&gt;
[This sounds a lot like spectral clustering and diffusion maps]
	&lt;li&gt;Colleen McCue, &lt;cite&gt;Data Mining and Predictive Analysis:
Intelligence Gathering and Crime Analysis&lt;/cite&gt; [To be shot after a fair trial]
	&lt;li&gt;Michalski, Kubat, Bratko and Bratko (eds.), &lt;cite&gt;Machine Learning
and Data Mining: Methods and Applications&lt;/cite&gt;
	&lt;li&gt;Petra Kralj Novak, Nada Lavrac and Geoffrey I. Web,,
&quot;Supervised Descriptive Rule Discovery: A Unifying Survey of Contrast
Set, Emerging Pattern and Subgroup Mining&quot;, &lt;a href=&quot;&quot;&gt;&lt;cite&gt;Journal
of Machine Learning Research&lt;/cite&gt; &lt;strong&gt;10&lt;/strong&gt; (2009): 377--403&lt;/a&gt;
	&lt;li&gt;M. Pavlic and M. J. van der Laan, &quot;Fitting of mixtures with
unspecified number of components using cross validation distance
estimate&quot;, &lt;cite&gt;Computational Statistics and Data
Analysis&lt;/cite&gt; &lt;strong&gt;41&lt;/strong&gt; (2003): 413--428
	&lt;li&gt;Naren Ramakrishnan and Chris Bailey-Kellogg, &quot;Sampling Strategies
for Mining in Data-Scarce Domains,&quot;
&lt;a href=&quot;http://arxiv.org/abs/cs.CE/0204047&quot;&gt;cs.CE/0204047&lt;/a&gt;
	&lt;li&gt;Jeffrey Solka, &quot;Text Data Mining: Theory and Methods&quot;,
&lt;cite&gt;Statistical Surveys&lt;/cite&gt; &lt;strong&gt;2&lt;/strong&gt; (2008): 94--112
= &lt;a href=&quot;http://arxiv.org/abs/0807.2569&quot;&gt;arxiv:0807.2569&lt;/a&gt;
	&lt;li&gt;Daniel J. Solove, &quot;Data Mining and the Security-Liberty Debate&quot;
[&lt;a href=&quot;http://ssrn.com/abstract=990030&quot;&gt;SSRN/990030&lt;/a&gt;]
	&lt;li&gt;Andreas L. Symeonidis and Pericles A. Mitkas, &lt;cite&gt;Agent
Intelligence through Data Mining&lt;/cite&gt; [&lt;a
href=&quot;http://www.springeronline.com/sgw/cda/frontpage/0,11855,5-153-22-46687939-detailsPage%253Dppmmedia%257CaboutThisBook%257CaboutThisBook,00.html&quot;&gt;Blurb&lt;/a&gt;]
	&lt;li&gt;Joseph Turow, &lt;cite&gt;Niche Envy: Marketing Discrimination in the
Digital Age&lt;/cite&gt; [&lt;a href=&quot;http://mitpress.mit.edu/0262201658&quot;&gt;Blurb&lt;/a&gt;]
	&lt;li&gt;Johannes Wollbold, &quot;Attribute Exploration of Discrete Temporal
Transitions&quot;, &lt;a href=&quot;http://arxiv.org/abs/q-bio/0701009&quot;&gt;q-bio/0701009&lt;/a&gt;
	&lt;li&gt;Mohammed Javeed Zaki
		&lt;ul&gt;
		&lt;li&gt;&lt;cite&gt;Scalable Data Mining for Rules&lt;/cite&gt;
[Ph.D. thesis, U. of Rochester, 1998; on-line through NCSTRL]
		&lt;li&gt;&quot;SPADE: An Efficient Algorithm for Mining Frequent
Sequences,&quot; &lt;cite&gt;Machine Learning&lt;/cite&gt; &lt;strong&gt;42&lt;/strong&gt; (2001): 31--60
		&lt;/ul&gt;
	&lt;/ul&gt;
</description>
  </item>
  </channel>
</rss>