5 of 5 people found the following review helpful
Two Paths to Prediction,
This review is from: Data Mining:: Practical Machine Learning Tools and Techniques (Morgan Kaufmann Series in Data Management Systems) (Kindle Edition)
This is a good text on machine learning techniques from both the statistics and the machine learning perspectives. The authors note that these fields have developed in parallel with many researchers and practitioners working in each, but few familiar with the full range of techniques in both disciplines. Some procedures, such as tree induction and nearest neighbor clustering techniques, have been developed independently in both fields. However, for the most part statistics has focused on hypothesis testing and machine learning has tried to optimize search through the space of possible hypotheses. This book presents techniques from both traditions.
The organizational structure of the book supports its use as either a comprehensive text or a modular reference. The first section's five chapters introduce the foundations of data mining. In addition to concepts and definitions, there are simple example data sets and accessible descriptions of how both raw data and final analyses are used in this field. A particularly well-written fifth chapter discusses how to evaluate data mining models. It discusses the rationale for holdout samples, the use of cross-validation procedures, and how to avoid over-fitting models. Machine learning texts frequently lack depth on this topic while statistics texts often fail to communicate the consequences of poorly-fitted models. This integration of perspectives is a good one.
Chapters in the second section build on this foundation. Chapter 6 describes how to use ten different techniques to detect and describe patterns in large data sets. This section also describes how to prepare data for data mining, how to combine and transform variables to increase model accuracy, and how to improve prediction by combining different model types using bagging, boosting, and other aggregation techniques. A final chapter outlines directions of current and future research expanding our toolbox of techniques. The eight chapters of the third and final section are a detailed tutorial covering the Weka workbench of machine learning algorithms and data transformation tools.
This book has several communication strengths. The scope is broad for an introductory text. The Further Readings collections at each chapter's end are reasonably brief and point to current and in-depth sources. The text itself contains numerous example analyses and follows the useful strategy of analyzing the same data with several techniques. Its review of algorithms and formulas focuses on explaining how they work rather than on deriving them from general principles. A key strength is the book's close integration with Weka. This ensures that readers can step through analysis procedures, experiment with variations from default paths, and compare the performance of different formulations of the same research problems.
I recommend the book for readers introducing themselves to machine learning. It will take some of your time to learn the techniques and practice using them in Weka, but it will be time well spent. Don't skip the Weka section!