on 16 January 2014
In a nutshell: If you are looking for a simple (but not simplistic) introduction to nearly all of the underlying data science fundamentals then look no further, because this is the book for you!
I work primarily as a software developer, and like to consider that I have good general knowledge and experience of what data science ('data analytics', 'big data' etc.) is through College/Uni education and also the modern press and blog posts etc. However, I often struggled at times to fully understand, and perhaps more importantly knit together and apply, the core fundamentals of the topic. This book has provided exactly the explanations and 'glue' that I required, in that it delivers a very well structured (and paced) introduction and overview of data science, and also how to think in a 'data-analytics' manner.
If you preview the book with the 'look inside' feature then what you see in the table of contents is exactly what you get. Every chapter delivers upon its title (and promised 'fundamental concepts'), and frequently builds superbly upon topics introduced in early chapters. You'll move seamlessly from understanding how to frame data science questions, to learning about correlation and segmentation, to model fitting and overfitting, and on to similarity and clustering. With a brief pause to discuss exactly 'what is a good model' you'll then be thrust back into learning about visualising model performance, evidence and probabilities and then how to explore mining text.
The concluding chapters draw upon and summarise how to practically choose and apply the techniques you've learnt, and provide great discussion on how to solve business problems through 'analytical engineering'. There is also some bonus discussion on other tools and techniques that build upon earlier concepts which you might find useful, data science and business strategy, and some general thinking points around topics such as the need to human intervention in data analysis and privacy and ethics.
The book is superbly written and reads very easily, which for the potentially dry topic of data science is worthy of praise alone. The majority of chapters took me each approximately an hour to read, and then another couple of hours to re-read and ponder upon (and sometimes looking at other provided references) to fully understand some of the more complex topics and how everything related together. Each chapter also provided plenty of pointers and experimentation ideas if I wanted to go away and practically explore the topic further (say, with the Mahout framework, or R, or scikit-learn/Pandas etc.). The book could probably be read by dipping in and out of chapters, but I think you'll get a whole lot more from a cover-to-cover reading.
In summary, this is a superb book for those looking for a solid and comprehensive introduction to data science and data analytics for business, and I'm sure will that even the more experienced practitioners of the art will find something useful here. The book introduces topics in a perfect order, superbly builds your knowledge chapter after chapter, and constantly relates and reinforces the various techniques and tools your learning as it progresses. I wish more text/learning books were written this well!
on 25 August 2013
Foster Provost and Tom Fawcett have set out to write the go-to reference on Big Data. 'Data Science for Business', what you need to know about Data Mining and Data-Analytic Thinking, published by O'Reilly Media.
They have produced an authoritative book that is both a pathfinder and a lighthouse. It is a long, clearly-written book that shows what can be done using Big Data, where to go and what techniques to use to get it done, and what to watch out for.
Thank you for writing this book. The authors and their many references are already established and respected. The book brings the issues and their business applications together in one essential place. Already in just 1 month since release (25th July 2013) the eBook has gathered praise quotes from a dozen industry names. I am honoured to receive a complimentary review copy.
So to add to the recommendations, I pitch my review slightly differently: Who in business should buy this book? What does this book add to what we are already doing in business with Data and Data Mining?
On first reading, if you work in analysis, IT, Business Intelligence, Management Reporting, Marketing or SEO, I guarantee your reaction at some point will be 'I do that too'.
For me the 'Aha!' realisation came a few pages into chapter 2. The authors discuss database searches for the most profitable items in a business. All businesses do that every day! But not always in the way the academics think.
The book surprised me in covering a broader range of topics than I previously considered were Data Science. Here are some great success stories to illustrate what data science is. Buy the book to see how these things really work and how the leading companies are applying themselves to these challenges. These studies border on the commercially sensitive.
- How a supermarket can use their sales analysis to predict when people are expecting a baby, and so gain an advantage by making offers before their competitors.
- How advertisers use Facebook Likes to profile and segment their audience
- How Netflix make their movie recommendations
- How to compare web pages for plagiarism
- How to tell how far away a customer is from their mobile app
Chapter 10 talks about text analysis. In contrast to most of the book, I would say here that small and medium sized businesses are ahead of Google and the academics. While the search engines refine their algorithms to extract news and meaning from bare text, there is whole industry sector manipulating the source data to fool the algorithms and keep one step ahead: it is called Search Engine Optimisation.
If you are just starting out in using Big Data for your business decisions, you need to know the importance of Maths. In particular there are 2 challenges in the mathematics that underpin Data Science that I should warn you about even if you do not read the book:
* One is causation and correlation. When you find the beer-buying customers are also the nappy-buying customers, that is just the first step towards some very careful thinking before you draw any conclusions about which is cause and which is effect and how you might adjust your marketing or product mix to assist your customers accordingly
* The other is what is now called 'Overfitting'. Gaze hard enough and you will find trends in data just like you can find shapes in clouds or patterns on the back of your eyelids. If you search too hard through too much data, you invalidate correlation co-efficients and confidence calculations. Or to put it another way, every cloud looks like something.
A great book. For everyone who can still manage their high-school level maths, I recommend you buy this book. For everyone else, I recommend you be aware of the book and the issues within it and get it on the corporate bookshelf. For myself I look forward to checking back regularly for future editions as the science develops. Five stars.
on 4 January 2014
This is an excellent introduction to Data Science for the person who wants to gain a good understanding of the subject and what it can do for business. The authors' language is straightforward and they have attempted to simplify statistical nomenclature to avoid losing their less statistically qualified readers.
Provost and Fawcett have put together a very accessible guide that explains what Data Science is and what it is not. Their work is realistic and practical. The book presents the theory behind the subject and includes practical applications that enhance the reader's understanding. Provost and Fawcett are excellent at swiping away myths on the subject, myths that some commercial promoters of the subject may like to maintain. The power of Data Science is amazing enough without reliance on myth. The authors clearly state that Data Science is not the panacea for all ills, and that it cannot succeed without the understanding of people.
Not only does this book deal with the theoretical and technical concepts of the subject, but it discusses how the discipline can work within an organisation and with such issues as data privacy.
It is seldom one finds a book that is so clear and comprehensive.
As Foster Provost and Tom Fawcett explain in the Preface, they examine concepts that fall within one of three types:
"1. Concepts about how data science fits into the organization and the competitive landscape, including ways to attract, structure, and nurture data science teams; ways for think about how data science leads to competitive advantage; and tactical concepts for doing well with data science projects.
2. General ways of thinking data, analytically. These help in identifying appropriate data and consider appropriate methods. The concepts include the [begin italics] data mining process [end italics] as well as the collection of different [begin italics] high-level data mining tasks. [end italics]
3. General concepts for actually extracting knowledge from data, which undergird the vast array of data science tasks and their algorithms."
There you have the nature and extent of the WHAT on which the information, insights, and counsel focus. Provost and Fawcett devote most of their attention to explaining HOW to apply these concepts to achieve high-impact data mining driven by data-analytic thinking. I share their belief "that explaining data science around such fundamental concepts not only aids the reader, it also facilitates communication between and among business stakeholders and data scientists. It provides a shared vocabulary and enables both parties [data scientists and non-data scientists such as I] to understand each other better. The shared concepts lead to deeper discussions that may uncover critical issues otherwise missed."
These are among the dozens of business subjects and issues of special interest and value to me, also listed to indicate the scope of Provost and Fawcett's coverage.
o From Big Data 1.0 to Big Data 2.0 (Pages 9-13)
o From Business Problems to Data Mining Tasks(19-23)
o The Data Mining Process. (26-34)
o Other Analytics Techniques and Technologies (Pages 35-41 and 187-208)
o Selecting Informative Attributes (49-56)
o Supervised Segmentation with Tree-Structured Models (62-67)
o Class Probability Estimation and Logistic "Regression" (97-100)
o Overfitting (113-119)
Note: This is a tendency to tailor models to the training data.
o Correlation of Similarity and Distance (142-144)
o Some Important Technical Details Relating to Similarities and Neighbors (157-161)
o Stepping Back: Solving a Business Problem Versus Data Exploration (183-185)
o A Key Analytical Framework: Expected Value (194-204)
o A Model of Evidence Lift" (244-246)
o Decision Analytic Thinking II: Toward Analytic Engineering (279-289)
o Co-occurrences and Associations: Finding Items That Go Together 292-298)
o Bias, Variance, and Ensemble Methods 308-311)
o Sustaining Competitive Advantage with Data Science (318-323)
As I worked my way through the book a second time, in preparation to compose this review, I was again reminded of comments by Eric Schmidt, executive chairman of Google: "From the dawn of civilization until 2003, mankind generated five exabytes of data. Now we produce five exabytes every two days...and the pace is accelerating." Correspondingly, the challenges that this process of data accumulation creates will become even greater. Provost and Fawcett wrote this book for those who must manage this process but also to assist the efforts of instructors who are now preparing them to do so.
on 13 April 2014
I very much enjoyed reading this book. It is exceptionally well written.It gives the clearest and most understandable description of data science that I've read.
The book progresses very well without missing a step from discussing data analytic thinking to principles then techiques then evaluation of results of data science models. There are very good explanations of the terms and concepts such as classification, scoring, regression, nearest neighbours and clustering. It also explains the maths behind the techniques in an intuitive but robust way.
It explains the key data science techniques very well and places them into context. For example, it describes the terms of supervised methods such as predictors, feature attributes and outcomes and then describes supervised techniques such as decision trees and logistic regression.
It links the theory to the practical. For example, it explains what overfitting is and why it can occur and practical techiques to avoid it such as holdout dataset and fitting graphs.
What I liked especially is that the book is example-driven and these examples are very well designed, described and illustrated. The initial examples for each section are designed to be as simple as possible and they are described in detail with lots of diagrams and visualisations. Other more realistic examples follow and these are always interesting.
on 17 February 2014
The book is very well written, it explains the data mining tasks in good details, it also shows you how to approach your data mining problem from both its business and technical sides. The missing star in my rating is just because the kindle version has some glitches.
on 1 February 2016
Only 5 chapters in and I have already learnt so many new concepts ,that have revolutionised how I work with data. A clear and well written book, will be in my library for a long time to come.
on 7 June 2015
Brilliant book. Strongly recommended.
on 27 January 2015