This is a good text on machine learning techniques from both the statistics and the machine learning perspectives. The authors note that these fields have developed in parallel with many researchers and practitioners working in each, but few familiar with the full range of techniques in both disciplines. Some procedures, such as tree induction and nearest neighbor clustering techniques, have been developed independently in both fields. However, for the most part statistics has focused on hypothesis testing and machine learning has tried to optimize search through the space of possible hypotheses. This book presents techniques from both traditions.
The organizational structure of the book supports its use as either a comprehensive text or a modular reference. The first section's five chapters introduce the foundations of data mining. In addition to concepts and definitions, there are simple example data sets and accessible descriptions of how both raw data and final analyses are used in this field. A particularly well-written fifth chapter discusses how to evaluate data mining models. It discusses the rationale for holdout samples, the use of cross-validation procedures, and how to avoid over-fitting models. Machine learning texts frequently lack depth on this topic while statistics texts often fail to communicate the consequences of poorly-fitted models. This integration of perspectives is a good one.
Chapters in the second section build on this foundation. Chapter 6 describes how to use ten different techniques to detect and describe patterns in large data sets. This section also describes how to prepare data for data mining, how to combine and transform variables to increase model accuracy, and how to improve prediction by combining different model types using bagging, boosting, and other aggregation techniques. A final chapter outlines directions of current and future research expanding our toolbox of techniques. The eight chapters of the third and final section are a detailed tutorial covering the Weka workbench of machine learning algorithms and data transformation tools.
This book has several communication strengths. The scope is broad for an introductory text. The Further Readings collections at each chapter's end are reasonably brief and point to current and in-depth sources. The text itself contains numerous example analyses and follows the useful strategy of analyzing the same data with several techniques. Its review of algorithms and formulas focuses on explaining how they work rather than on deriving them from general principles. A key strength is the book's close integration with Weka. This ensures that readers can step through analysis procedures, experiment with variations from default paths, and compare the performance of different formulations of the same research problems.
I recommend the book for readers introducing themselves to machine learning. It will take some of your time to learn the techniques and practice using them in Weka, but it will be time well spent. Don't skip the Weka section!
on 29 June 2013
This book is written by the software architects/developers of the WEKA machine learning tool. The book is large and comprehensive. The authors introduce the reader to the correct terminology to use when referring to concepts, and every concept they mention they follow through and explain its purpose and nuances in detail. The first 400 pages are dedicated to data mining and machine learning theory in an academic context with examples using simple tabular data which is easy to understand and is never longer than 1 page so as not to overcomplicate the learning experience (KISS - keep it simple stupid). There is however some maths that is over my head (the polynomials on page 231 and cartesian products on page 266). The remaining 200 pages are focused on using the WEKA multi-platform tool which the authors personally developed and have made open source. The hands-on section using WEKA also includes exercises which are suitable for a classroom environment. The authors of this book are not chancers looking to make some quick money, they are clearly experts and are very very good at what they do. Highly recommended.
on 17 March 2015
This is a very well written and easy to read book on the topic of Data Mining, it is closely related to the Weka data mining software.
I use this book for a module at my University and it is also very reasonably priced (circa £21.00) which for a student book is affordable.
The book chapters follow a logical order, from data input, output representations, data mining algorithms for supervised (classification) and unsupervised (clustering) and examples for Weka. It covers all the major models, from Linear, Statistical, Rule representation and decision trees: It covers basic algorithms such as 1R (OneR) and various clustering methods, K-Means etc.
There is a few places where it is not clear or followed through in methods, I had to spend a bit of time replicating their numbers, some of the examples seem to be coincidental in their numbers which makes it more difficult to apply a technique to different data sets due to the some parts, lack of follow through (decision tree induction based on entropy as a splitting criteria for example, the book stops short of following through an example on deeper levels in the tree, this I had to replicate the numbers to understand the method).
K-nearest neighbour chapter is lacking depth, it does not seem to cover the process of creating Voronoi tessellation boundary diagrams just jumps into Kd-Trees and so on. Could have more introduction and more coverage of KNN.
Association Rule mining, the book is not really clear on support and confidence, at least not compared to other books on the subject that more clearly states the calculations. Could have better and more varied examples, to show the application of the methods with differing data sets to help clarify application of methods.
A worthwhile book on your bookshelf for machine learning applied to data mining.