I like to compare this book with a very similar one from O'Reilly, entitled "Data Science for Business" by Foster Provost and Tom Fawcett.
Both books are organized around the Cross-Industry Standard Process Model for Data Mining (CRISP-DM), which groups data mining / predictive analytics project tasks into the following six distinct stages:
* Business Understanding: Define the project (e.g., what are the business and data modeling objectives, how are they aligned, what would be the target and/or input variables, what criteria would be used for evaluating the models, how would the models be deployed, etc)
* Data Understanding: Examine the data; identify potential problems with the data
* Data Preparation: Fix problems in the data (e.g., decide what to do with outliers and missing values; standardize data formats etc.); create derived variables; transform and/or normalize data
* Modeling: Build predictive or descriptive models
* Evaluation: Assess models; report on the expected effects of models
* Deployment: Use the models; monitor model performance
In terms of coverage, this book provides guidance for all of the above-mentioned stages, with particular attention to the Data Understanding, Data Preparation, and Evaluation stages, while the Provost and Fawcett book provides guidance mostly for the Business Understanding, Data Understanding, Modeling, and Evaluation stages only.
This book covers more modeling algorithms than the Provost and Fawcett book, but both books tend to keep discussions of the covered algorithms to qualitative descriptions only, instead of the in-depth mathematical discussions found in more theoretically-oriented books. The Provost and Fawcett book does provide better and slightly deeper descriptions of the covered algorithms, however.
Both books cover Decision Trees, for example. Whereas this book only mentions that Decision Trees belong to a class of recursive partitioning algorithms that use concepts such as "Information Gain" or "Gini Index" as possible partitioning criteria, the Provost and Fawcett book goes further by illustrating how "Information Gain" can actually be computed using simple formulas and a small enough but still interesting dataset. By learning how to hand-build the resulting Decision Tree from scratch using the illuminating but still simple example, readers of that book are likely to have more memorable insights about Decision Trees than those they can acquire from this book.
Compared to the Provost and Fawcett book, which I think is the better book pedagogically speaking because it does more things right for its readers, this book could also probably benefit from the following suggested improvements:
* Be more selective regarding what gets discussed in the Business Understanding chapter. Although it is true that the project plan that is being drawn at this stage should include deliberations about how models are going to get evaluated, a few statements indicating this should suffice. There is probably no need to make specific mentions of things such as Lift, Gain, ROC, Area Under the Curve, and Confusion Matrices which don't really get defined and discussed much, much later in the book. By doing so, the author is being laudably meticulous but risks unnecessarily distracting his readers to details they might not yet be equipped to process.
* Reconsider the ordering of some chapters. An earlier modeling chapter discusses Kohonen Self-Organizing Maps (SOM)-- a type of neural networks -- before the more basic neural networks discussion takes place in a later chapter. A chapter on how to interpret Clusters discusses the use of Decision Trees for this purpose before Decision Trees themselves are discussed in a later chapter. Having now read the book, I think reversing those chapters would make the book more readable for those who may not yet know much about neural networks or decision trees.
* Consider using color graphics. Some texts in the book read as though readers were looking at color graphics but in print those graphics were actually in grey-scale.
For readers interested in knowing what modeling algorithms are covered in this book, they include: Itemsets and Association Rules (or Market Basket Analysis), Principal Components Analysis, Clustering (K-Means and Kohonen SOM), Decision Trees, Logistic Regression, Neural Networks, K-Nearest Neighbor, Naive Bayes, Linear Regression, and Model Ensembles (Bagging, Boosting, etc.).
Finally, for readers curious about possible prerequisites for this book, I would say they include basic knowledge of statistics including understanding of concepts such as z-scores, correlations, and ANOVA (Analysis of Variance), and some SQL concepts such as group by and where clauses. The more knowledge you have of these concepts, the easier time you would have reading this book.