Data Mining
01 Jan 2010 10:48
I've taught a course on this, so I ought to be able to describe it, oughtn't I? Data mining, more stuffily "knowledge discovery in databases", is the art of finding and extracting useful patterns in very large collections of data. It's not quite the same as machine learning, because, while it certainly uses ML techniques, the aim is to directly guide action (praxis!), rather than to develop a technology and theory of induction. In some ways, in fact, it's closer to what statistics calls "exploratory data analysis", though with certain advantages and limitations that come from having really big data to explore.
See also: Clinical and Actuarial Compared; Statistics for Structured Data
- Recommended, big picture:
- Leo Breiman, "Statistical Modeling: The Two Cultures", Statistical Science 16 (2001): 199--231 [very much including the discussion by others and the reply by Breiman]
- David Hand, Heikki Mannila and Padhraic Smyth, Principles of Data Mining [The textbook I teach from; also a book I learned a lot from. Comments]
- Bernard E. Harcourt, Against Prediction: Profiling, Policing, and Punishing in an Actuarial Age [Blurb, my review. Precis as a 43 pp. PDF working paper]
- Sholom M. Weiss and Nitin Indrukyha, Predictive Data Mining: A Practical Guide [Pedestrian, but it is practical, and adapted to the meanest, i.e. the managerial, understanding]
- Recommended, close-ups:
- Sébastien Bubeck, Ulrike von Luxburg, "Nearest Neighbor Clustering: A Baseline Method for Consistent Clustering with Arbitrary Objective Functions", Journal of Machine Learning Research 10 (2009): 657--698
- Aleks Jakulin and Ivan Bratko, "Quantifying and Visualizing Attribute Interactions", cs.AI/0308002
- Jon Kleinberg, Christos Papadimitriou and Prabhakar Raghavan, "A Microeconomic View of Data Mining", Data Mining and Knowledge Discovery 2 (1998) [PDF]
- Kling, Scherson and Allen, "Parallel Computing and Information Capitalism," in Metropolis and Rota (eds.), A New Era in Computation (1992) [A batch of UC Irvine comp. sci. professors who write like sociologists. " `Information capitalism' refers to forms of organization in which data-intensive techniques and computerization are key strategic resources for corporate production."]
- Jacob Kogan, Introduction to Clustering Large and High-Dimensional Data [Comments]
- Erik Larson, The Naked Consumer: How Our Private Lives Become Public Commodities
- Recommended, close-ups on text-mining and the like:
- David M. Blei, A. Y. Ng and Michael I. Jordan, "Latent Dirichlet allocation", Journal of Machine Learning REsearch 3 (2003): 993--1022
- David M. Blei and John D. Laffery, "Correlated Topic Models", NIPS 2005
- Chaitanya Chemudugunta, Padhraic Smyth, Mark Steyvers, "Text Modeling using Unsupervised Topic Models and Concept Hierarchies", arxiv:0808.0973
- Thomas Hofmann, "Unsupervised Learning by Probabilistic Latent Semantic Analysis", Machine Learning 42 (2001): 177--196
- T. K. Landauer and S. T. Dumais, "A Solution to Plato's Problem: The Latent Sematic Analysis Theory of the Acquisition, Induction and Representation of Knowledge", Psychological Review 104 (1997):211--240
- Modesty forbids me to recommend:
- My lecture notes for my data mining class [However, many of them are based on lecture notes originally written by Tom Minka, and modesty does not forbid me from recommending his work.]
- To read:
- Ian Ayres, Super Crunchers: Why Thinking-by-Numbers Is the New Way to Be Smart [Despite the painful title, Ayres has done cool applied work in social statistics]
- David L. Banks and Yasmin H. Said, "Data Mining in Electronic Commerce", math.ST/0609204 = Statistical Science 21 (2006): 234--246
- Burnham, Rise of the Computer State
- Bertrand Clarke, Ernest Fokoue and Hao Helen Zhang, Principles and Theory for Data Mining and Machine Learning [blurb]
- Pavel Dmitriev and Carl Lagoze, "Mining Generalized Graph Patterns based on User Examples", cs.DS/0609153
- Usama Fayyad, Geroges G. Grinstein and Andreas Wierse (eds.), Information Visualization in Data Mining and Knowledge Discovery
- Ronen Feldman and James Sanger, The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data [Blurb
- Hillol Kargupta and Philip Chan (eds.), Advances in Distributed and Parallel Knolwedge Discovery [Blurb]
- Hillol Kargupta, Anupam Joshi, Krishnamoorthy Sivakumar and Yelena Yesha, Data Mining: Next Generating Challenges and Future Directions [Blurb]
- Nicholas M. Kiefer and C. Erik Larson, "Specification and Informational Issues in Credit Scoring" [SSRN]
- Martin Klein and Michael L. Nelson, "Approximating Document Frequency with Term Count Values", arxiv:0807.3755 [Approximating inverse document frequency for the web (or other unsurveyable corpora) by term frequency]
- Daniel Korenblum and David Shalloway, "Macrostate Data Clustering", Physical Review E 67 (2003): 056704 [This sounds a lot like spectral clustering and diffusion maps]
- Colleen McCue, Data Mining and Predictive Analysis: Intelligence Gathering and Crime Analysis [To be shot after a fair trial]
- Michalski, Kubat, Bratko and Bratko (eds.), Machine Learning and Data Mining: Methods and Applications
- Petra Kralj Novak, Nada Lavrac and Geoffrey I. Web,, "Supervised Descriptive Rule Discovery: A Unifying Survey of Contrast Set, Emerging Pattern and Subgroup Mining", Journal of Machine Learning Research 10 (2009): 377--403
- M. Pavlic and M. J. van der Laan, "Fitting of mixtures with unspecified number of components using cross validation distance estimate", Computational Statistics and Data Analysis 41 (2003): 413--428
- Naren Ramakrishnan and Chris Bailey-Kellogg, "Sampling Strategies for Mining in Data-Scarce Domains," cs.CE/0204047
- Jeffrey Solka, "Text Data Mining: Theory and Methods", Statistical Surveys 2 (2008): 94--112 = arxiv:0807.2569
- Daniel J. Solove, "Data Mining and the Security-Liberty Debate" [SSRN/990030]
- Andreas L. Symeonidis and Pericles A. Mitkas, Agent Intelligence through Data Mining [Blurb]
- Joseph Turow, Niche Envy: Marketing Discrimination in the Digital Age [Blurb]
- Johannes Wollbold, "Attribute Exploration of Discrete Temporal Transitions", q-bio/0701009
- Mohammed Javeed Zaki
- Scalable Data Mining for Rules [Ph.D. thesis, U. of Rochester, 1998; on-line through NCSTRL]
- "SPADE: An Efficient Algorithm for Mining Frequent Sequences," Machine Learning 42 (2001): 31--60
