This book is aimed at helping you to get a job as a data scientist. The first part of that is trying to convince the reader that data science is something new and that lots of people have been getting on the big data band-wagon but the author is the only person in the world who knows what data science is. Everyone else is just in the game to rebrand their tired old courses and methods but Vincent is the man to see through their deceptions and to save the reader and turn them into the real thing.
It took some doing but by page 4 I already knew that the author is more into self-promotion than actually telling you anything useful. Perhaps page 66 makes the authors approach clearest as he describes the weaknesses of various statistical methods and says that these weaknesses have been corrected in the last decade, while not giving a single reference to show why they were bad, or a single reference as to what has replaced them. Other than the unsupported comments that university lead research has not progressed, and that everyone is stuck using SAS because government says so, there is nothing much anyone in data analysis will not already know. There have been advances in data science in finance and equally importantly in the security services, but these are classified. Another lesson from those "advances" is that they are as fallible as the methods he criticises. They failed to predict 9/11 or the current financial crisis. But without a single reference there is no way to assess any of his claims.
Maybe it is a marmite book and some people will find it useful. For me any author who takes such a personal and unsupported position without balancing any of the arguments is a bad scientist. This is one to avoid.
on 9 May 2014
Although I'd read the mixed reviews beforehand, I purchased it anyway with high hopes - after all, there aren't many other books out there that claim to focus on the career of a data scientist. However, I have to concur with some of the other reviewers that it's messy and not particularly enjoyable to read. To me it reads more like a website or a blog, with some rather controversial opinions, lots and lots of lists, and very few citations to other work in the main body of the text apart from links to wikipedia(!). There is however a reading list in the final chapter, and some useful information sprinkled throughout, but arguably it's information that could already be obtained from the author's various websites. I'm going to order Max Shronn's book for a discussion of the softer skills of a data analyst/scientist which seems to be more highly regarded.
early on in this book we learn that the only people who have the right to call themselves 'data scientists' are those who work with more than 100 million rows of data. This should have served as a warning of things to come.
The author goes on to 'demonstrate' how data science should be considered as a new profession, with its own sky-high salary range. The writing often shows an astonishing level of arrogance, either on the page or between the lines. Any opportunities to demonstrate the author's superior knowledge are seized upon. He states that the book will be most useful for students, executives and entrepreneurs, but terms are not explained, no attempt is made to introduce concepts and technologies which might be unfamiliar: we are expected to climb up Dr Granville's 'learning cliff'. As a result, the only people who might he able to get through this and find any benefit are those who are already working on analysis of Facebook likes or Tweets.
The author would have been better talking the reader through the process, from building a big dataset, modelling, through the processing and storage challenges, what regression and correlation analyses do and why they are used, statistical teams and how the stats work. Instead, we get bits of useless information like "i have never worked from a business case" and 'big data analysis is not to the faint of heart" and even, in the later chapters, the authors ideas for apparently unrelated projects like email encryption, improving Captcha systems, email marketing schemes (now we know who is to blame)etc etc. He's clearly and very intelligent guy but he is no teacher.
It's not a completely barren text. There are lists of free courses and other resources which certainly interested me (and from which I learned more than i did from this book). There are also snippets of apparently useful information. i learnt about using Excel to analyse the 'bumpiness' in data - although typically from this author, I'm not sure what bumpiness is. His definition? "Bumpiness can be defined as the auto-correlation of lag 1". Enough said.
The author of this book suffers from a common problem of academics: although he clearly knows his subject extrememly well, he is not particularly good at communicating that knowledge, and the result is a poorly written, difficult to read mess which is crying for the attention of a good editor. Additionally, most of the material presented here will require a good grounding in data analytics and statistics: although Mr Granville claims to offer an introduction into the topic for executives and Java programmers, even the introductory chapter is crammed full with bulleted lists of technical jargon which is only ever explained poorly, if at all. The premise of the book is that data science, and the analysis of "big data" is a rapidly evolving field, and many courses and guidebooks approach its problems using outdated tools and mindsets. Unfortunately in making this point, the author has a tendency to come across as rather holier-than-thou. At times his sniping becomes petty, for example when he berates a competitor's book for including an example showing the reader how to build a Twitter "what the author calls a word cloud", he feels it necessary to point out that this "has nothing to do with cloud computing", at which point I felt like shouting "YES, AND CLOUD COMPUTING HAS NOTHING TO DO WITH A CUMULONIMBUS, WHAT IS YOUR POINT EXACTLY?"
All of which is a shame, as the book does contain plenty of valuable and seasoned wisdom, but it's so painful to get at that I would recommend that only advanced students of data analysis tackle this book. In the introduction the author states that "much of the text was initially published over the last three years on the Data Science Central website". To be frank, it shows: as a series of articles intended for a core technical audience, this stuff is fine, but as an introductory or bridge guide, it is not.
Sometimes Amazon Vine allows you to sample a new field of enterprise and thought, and get a good insight into its working, and see what you can learn from it and take back into your own sphere of thought and action. I ordered this book in this hope, but I was disappointed on this occasion.
This book is poorly written, and with much jargon. It had the elements of a highly technical in-group discussion, not a welcoming introduction. At the beginning and end of the book I am still not sure what a data scientist is or does, nor why people would want to employ one. I did have a feeling that this book was trying to making an intellectual territory grab, and making a start at defining something but I wasn't clear about what he wanted to grab and why. There was no narrative to this work.
I couldn't see anything in it that would excite me about data science, nor what exactly distinguishes it from basic data analysis. I got a sense that it's a marriage of statistics and huge databases, but I wasn't sure how the data was gathered, checked, and then how relevant statistical analyses chosen to draw valid inferences out from the data. Just because we have huge databases now, and computers to crunch the numbers, doesn't seem to me to be any reason to let go of the old basic questions around primary measurement accuracy, spread, and fair analysis of the patterns and statistics. But statisticians have asked these questions about their processes for many years. I didn't see what a data scientist added to these basic ideas and processes.
In the end this book seemed surprisingly anecdotal, and to lack a coherent narrative about what data science is, and what function it exists to meet. I didn't enjoy reading it, and didn't learn much from it.
I am really disappointed by this book, mostly because the entire premise is that 'Data Science' and jobs relating to data science only exist because of the rise of (so-called) 'Big Data' in recent years.
Data is not suddenly 'big data'. Sure, there's a lot more of it now that we're able to process and analyse, but the core skills that an analyst needs are no different if you're looking at a spreadsheet of a hundred rows or a database with terrabytes of information stored within it. You still need to understand your data - how it looks, what changes look like, what changes you should be concerned about - no matter the size. This book therefore, along with almost everything I have ever read about 'Big Data' (and I've read a lot!) is merely a thinly disguised promotional catalogue for specialist 'Big data' software which will magically handle rows and rows of complex data for you.
It does not teach you the analytic basics - the key one of which is knowing when something significant is interesting, and equally knowing when something interesting is not significant, and deciding whether that matters. Without a grasp of analytics basics you can have no hope of developing your talents as an analyst, a data scientist, or any other myriad of new buzz job titles doing the rounds.
Don't waste your time - there is so much freely available on the internet to get you started, if you find you have an affinity for analysis, then it's that natural talent that most employers are interested in. I've never had an interview as an analyst where they ask me to work a specialist big data programme, but I have had many interviews where I am given an excel spreadsheet of data and asked to present some findings. Learn the craft, not the gimmicks.
When it says it targets those who haven't got a mathematical background, I'm not sure I agree, as when a developer I know flicked through some of the statistics terms scared them off - even if the full formulae aren't derived, the level of statistical knowledge assumed in the 'non-technical' bits is still prone to using statistical terms, leaving the explanation relatively opaque.
While it berates many other books (repeatedly) for being statistical books dressed up as data science, it seems this is because the author has taken a view that 'Data Science' is a rather narrow niche that unsurprisingly closely matches his own background. On the repetition side, there are examples which are repeated in different chapters as well, and some of the theorems are explained multiple times, so the limited size of the book is limited further. Add in the lists of books, of sites to get data from, of salary ranges, and so on, and quite a few pages are taken up with information that will not directly improve your understanding of the subject, and which is liable to date badly.
Having taken some of the Coursera courses mentioned, I'd agree that they are of use, and indeed, they go into much more mathematical detail than this book does. The author here seem keen to promote his own patented method of 'hidden decision trees' without explaining why they are necessarily better than model-based methods, or, indeed, why they are supposed to scale that much better than other methods - the discussion of computational complexity falls rather short on detail of how the practical algorithms are inherently more amenable to clustered computation.
Additionally, the nature of the hidden decision tree sounds like it bears some comparison with specific types of neural network. The rejection of some methods and preference for model-free prediction seems based on the idea that other methods are best at linear separation, when neural networks, SVMs with kernel functions and so on can handle more complex boundaries. In fact the much-hyped 'patented' hidden decision tree model reminded me of the Recursive Deterministic Perceptron in its ability to handle completely arbitrary shapes and varying levels of complexity (with associated potential for over or under fitting), although how the multiple decision trees fits together is unclear so it's hard to judge their relative merits.
There is doubtless some food for thought in the book, in terms of identifying key skills, the move away from relational databases, the different issues arising from very large datasets. However, it makes it sound like statistics and data analysis courses don't already cover this, when good ones do, and in greater depth than this book does.
If you want to know what areas you should learn about to be effective in the field, then this may be of interest. If you want some examples of areas where data science is being applied, or some suggestions of research areas that may be worthwhile pursuing for yourself, then this may be good, too. If you want to learn about a range of techniques, the relative merits of Big Data and NoSQL technologies (column databases, graph databases, key-value stores, etc), common statistical models and the issues they have with large datasets, or an overview on analysis using the most popular languages for numerical processing (such as R, Python, Matlab, etc), then this is not for you - as a result the book promises much but delivers more of a teaser or taster than the real thing.
The author is best at communicating his passion for 'Data Science' and his ideas of what techniques (i.e. his own home-grown ones) are most useful/scalable/reliable/'ready to provide limitless free energy', but less good at communicating how to apply the required techniques. e.g. I've been told I need to explain the 'lift' from my analysis, but I only know from previous reading what that means and how to estimate it, not from here. Very disappointing, although not without purpose.
There may be some good stuff in here but it's well hidden. Full of lists, scenarios and surprisingly unstructured thoughts, I found this very hard going indeed. It's an odd contrast between sections of very precise calculation and very vague assertions - do this, and you'll triple your income - and I have to admit for once I ran out of steam just before the end.
on 23 September 2014
Is the collation and analysis of data a 'science?' Well every discipline these days wants to be thought of as a science so maybe it's as valid a one as anything else, I mean if economics of all things can be accepted as a science, why can't data analysis. Whatever the worth of what is essentially a number crunching exercise and the analysis of data which could be as anomalous as it is subjectively relevant, this book doesn't really promote it in a very captivating way. In fact after a while, you begin to realise that perhaps the only thing being promoted here is the author. it's also not a very fluid book to work through, as it's pretty structure-less, often more like a stream of consciousness than anything else, and suddenly descends at times without warning into the depths of involved stats and lists...lots of lists....one gets the impression that the author likes lists.
A saving grace for this book is that the author clear loves- indeed he lives and breathes- data analysis. Although one is tempted to tell him to get out more, fair play to him. This book though unfortunately is a missed opportunity to make data science sexy for the uninitiated. Shame.
In common it seems with other Amazon reviewers, author, Vincent Granville's Developing Analytic Talent, proved a rather uninspiring trip through the field of statistical analysis. Despite the blurb offering a positively glowing overview..'Learn what it takes to succeed in the the most in–demand tech job Harvard Business Review calls it the sexiest tech job of the 21st century.'..I didn't particularly feel excited by the authors rather plodding style or the stating of the obvious. To be fair, I just read it out of interest,not through a professional attachment so perhaps students in the field might find the book a useful source of reference and pointer?