Data Analysis with Open Source Tools and over 2 million other books are available for Amazon Kindle . Learn more

Sign in to turn on 1-Click ordering.
Trade in Yours
For a 3.17 Gift Card
Trade in
More Buying Choices
Have one to sell? Sell yours here
Sorry, this item is not available in
Image not available for
Image not available

Start reading Data Analysis with Open Source Tools on your Kindle in under a minute.

Don't have a Kindle? Get your Kindle here, or download a FREE Kindle Reading App.

Data Analysis with Open Source Tools [Paperback]

Philipp K. Janert
4.0 out of 5 stars  See all reviews (2 customer reviews)
RRP: 25.99
Price: 18.19 & FREE Delivery in the UK. Details
You Save: 7.80 (30%)
o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o
Only 1 left in stock (more on the way).
Dispatched from and sold by Amazon. Gift-wrap available.
Want it Friday, 29 Aug.? Choose Express delivery at checkout. Details


Amazon Price New from Used from
Kindle Edition 15.21  
Paperback 18.19  
Trade In this Item for up to 3.17
Trade in Data Analysis with Open Source Tools for an Amazon Gift Card of up to 3.17, which you can then spend on millions of items across the site. Trade-in values may vary (terms apply). Learn more

Book Description

28 Nov 2010

Collecting data is relatively easy, but turning raw information into something useful requires that you know how to extract precisely what you need. With this insightful book, intermediate to experienced programmers interested in data analysis will learn techniques for working with data in a business environment. You'll learn how to look at data to discover what it contains, how to capture those ideas in conceptual models, and then feed your understanding back into the organization through business plans, metrics dashboards, and other applications.

Along the way, you'll experiment with concepts through hands-on workshops at the end of each chapter. Above all, you'll learn how to think about the results you want to achieve -- rather than rely on tools to think for you.

  • Use graphics to describe data with one, two, or dozens of variables
  • Develop conceptual models using back-of-the-envelope calculations, as well asscaling and probability arguments
  • Mine data with computationally intensive methods such as simulation and clustering
  • Make your conclusions understandable through reports, dashboards, and other metrics programs
  • Understand financial calculations, including the time-value of money
  • Use dimensionality reduction techniques or predictive analytics to conquer challenging data analysis situations
  • Become familiar with different open source programming environments for data analysis

"Finally, a concise reference for understanding how to conquer piles of data."--Austin King, Senior Web Developer, Mozilla

"An indispensable text for aspiring data scientists."--Michael E. Driscoll, CEO/Founder, Dataspora

Frequently Bought Together

Data Analysis with Open Source Tools + Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython + Think Stats
Price For All Three: 48.34

Buy the selected items together

Product details

  • Paperback: 540 pages
  • Publisher: O'Reilly Media; 1 edition (28 Nov 2010)
  • Language: English
  • ISBN-10: 0596802358
  • ISBN-13: 978-0596802356
  • Product Dimensions: 23.4 x 17.8 x 3.6 cm
  • Average Customer Review: 4.0 out of 5 stars  See all reviews (2 customer reviews)
  • Amazon Bestsellers Rank: 83,115 in Books (See Top 100 in Books)
  • See Complete Table of Contents

More About the Author

Discover books, learn about writers, and more.

Product Description

Book Description

A hands-on guide for programmers and data scientists

About the Author

After previous careers in physics and softwaredevelopment, Philipp K. Janert currentlyprovides consulting services for data analysis,algorithm development, and mathematical modeling.He has worked for small start-ups and in largecorporate environments, both in the U.S. andoverseas. He prefers simple solutions that workto complicated ones that don't, and thinks thatpurpose is more important than process. Philippis the author of "Gnuplot in Action - UnderstandingData with Graphs" (Manning Publications), and haswritten for the O'Reilly Network, IBM developerWorks,and IEEE Software. He is named inventor on a handfulof patents, and is an occasional contributor to CPAN.He holds a Ph.D. in theoretical physics from theUniversity of Washington. Visit his company websiteat

Customer Reviews

5 star
3 star
2 star
1 star
4.0 out of 5 stars
4.0 out of 5 stars
Most Helpful Customer Reviews
10 of 10 people found the following review helpful
4.0 out of 5 stars Mixed opinion 17 Jan 2012
Format:Paperback|Verified Purchase
I have to agree with a lot of the US reviews. I am missing a focus in the book.

The author wants to make a point how important it is to understand the math behind real world problems, but I was disappointed by his attempts to convey mathematical principles. Formulas may work for some people, to me the book failed to point out why they are necessary - or how i can add value with them in the analyses i do. In this regards, the author overpays his dues to his academic background. I can see how the author studied physics and addresses people with like-wise framed minds. But for these people, the book will be too trivial. The major disappointment for me was that the book failed to live up to its expectations regarding the subtitle "with open Source tools". I would have expected a range of cool tools to work with, instead it's GNU and R, and there is not a single end-to-end case of getting the data, figuring out the issue and then presenting it in a graph. Sometimes, the style is too conversational, sometimes it is too strict and abstract. There are few moments when the two extremes touch. Other parts of the book - were the author shares his academic insights - felt awkward. The statement "You will never understand what mathematics is if you see it only as something you use to obtan certain results" will definitely find its way in my "Dictionary of Received Ideas".

Still after all this negative criticism, I am giving it an average 4 stars. Why? There were some conversational parts that are helpful. This happens especially when the author highlights pitfalls and real-world application on distribution laws and showing/interpreting graphical analysis (although he doesn't point out how it's done).
Read more ›
Comment | 
Was this review helpful to you?
5 of 5 people found the following review helpful
4.0 out of 5 stars Very good 12 Dec 2011
By heltz
Format:Paperback|Verified Purchase
I you have no idea of what a statistical analysis or data analysis is and you'd like to know how to do it. This is your book.
There is not much formulas so if you want a book that gives you the mathematical basis this is not for you.
This book presents the different methods and the tools you can use to get them with some specific examples.
It is an applied book to get it on quickly.
The structure is done in a way that you can pick chapter you are interested in and come back later ofr other parts.

Definitely a good purchase.
Comment | 
Was this review helpful to you?
Most Helpful Customer Reviews on (beta) 4.2 out of 5 stars  38 reviews
202 of 221 people found the following review helpful
2.0 out of 5 stars It falls short of initial expectations 7 Feb 2011
By J. Felipe Ortega Soto - Published on
This book is aimed at offering a practical, hands-on introduction to data analysis for pragmatic readers without strong scientific or statistical background. Some basic programming experience is required. The author provides many personal (and sometimes useful) comments about different tools and procedures in data analysis.

However, a careful reading reveals many problems, specially an obscure presentation of key concepts. In my opinion, the target audience for this book would be people without previous contact with data analysis. Hence the importance of presenting its core elements correctly. Otherwise, it's useless for them.

In particular:

- Few pages are actually dedicated to present open source tools supporting the different graphs and techniques included in the book. From the title, I expected a more complete tour through available open source tools for data analysis.

- No clues about how to obtain most of the graphs and results presented in the book. No related data sets are available for download, either. A book like this is useless if we cannot learn how to replicate all the examples.

- The formula of the variance for a sample is just wrong. One must divide by n-1 and not n; see "Applied Statistics and Probability for Engineers" (Montgomery and Runger 2006).

- The author presents one of the most obscure explanations for the median I've ever come across. Recurring to an RFC (RFC 2330) to explain such a simple concept is really awkward.

- In chapter 3 and Appendix B, natural logarithms (base e) are presented in the text, while graphs plot powers of 10. Definitely, not the right way to transmit correct concepts and methods.

- I concur with a previous review in that "Workshop" sections just present an ultra-short overview of some open source tools. A quick search in your favourite engine will display much more informative introductions (even quick start guides).

- Today, effective data analysis heavily depends on using the best possible implementation. While I might find educational to learn some of this implementations, in a real situation it is much better to rely on precise implementations of algorithms already available (e.g. libraries in GNU R).

All in all, I still recommend "R in a Nutshell" for a gentle introduction to data analysis with an open source tool (GNU R). It also has some inaccuracies and typos, but at least it's much more informative and clear. Besides, it does include an R package with all datasets and examples, ready to be installed and explored.
39 of 39 people found the following review helpful
3.0 out of 5 stars Full of insight, light on details 17 April 2011
By Code Monkey - Published on
This book covers such a wide range of topics that it necessarily skims over all of them but it always hits all the major points that an introductory survey should. Each chapter has a straight forward tone, strikes the right balance between developing mathematical rigor and developing an intuitive understanding of data , and undeniably passes on the lessons of hard earned, real world experience. But a reader who is actually working on a real data problem will almost certainly come to the realization that the understanding gained is somewhat superficial - that it's going to take a lot more heavy reading (probably of books, papers, and software tools recommended in this book) to get any real work done!

The single biggest problem with this book is its misleading title. This book is not going to teach you how to use open source software to analyze data. There is only minimal information about how one would actually use the software tools being discussed. What you get is a brief commentary about what the author thinks each software package is good for. It's the same story as with the mathematical details: you will not find them here, but this book will give you an excellent idea of what to look for. So in the end it does leave you feeling just a little bit cheated, even though all the advice you got seems extremely well informed.

What this book does astonishingly well is communicate an attitude to data analysis that most textbooks (and nearly all the college courses I took) seem to miss. Nearly every chapter is a stream of stunningly insightful observations on how to approach data, without the mathematical detail that overwhelms most practicing programmers. I would recommend it to any reader who understands that truly useful insights are hard to come by, but detailed algorithms and formulae are easily found in the Internet Age. I wish the book were a few hundred pages shorter, that it corrected a few sloppy mistakes (like confusing revenue and profit), but I'm certainly glad I read it.
42 of 45 people found the following review helpful
3.0 out of 5 stars Good, not great. Prerequisites and chapter organization issues. 27 Jan 2011
By Peter Alfheim - Published on
The book is very good for the intermediate-to-advanced data analysts. Beginners beware: there are some important prerequisites that are not obvious before you buy it, and there are some organization problems.

First, the prerequisites. "I strongly recommend that you make it a habit to avoid all statistical language"..."Once we start talking about standard deviations, the clarity is gone." These are two sentences in the same passage from the Preface. The rest of that passage is similar. However, even the first chapters make heavy use of statistical language. Moreover, they assume that you already know statistics to the level of density estimation, noise, splines, and regression. Page 21 even features a footnote about the Fourier transform and Fourier convolution theorem. Clearly this book is not for the statistically-shy or for mathematically-shy in general, no matter what the Preface suggests. You also need to know Python and R.

Second, the chapter organization problems. There's a mismatch between the first part of each chapter, which introduces concepts and techniques, and the Workshop part of the same chapter, which uses software. I was expecting the Workshop to illustrate the implementation of the same concepts and techniques. It's not really so. The Workshop introduces Python and R facilities at a different (lower) speed than the rest of the chapter. One could even wonder why the Workshop is in the same chapter. I'd rather that each chapter consisted of a few detailed case studies that first introduce concepts and techniques and then illustrate them with software libraries.
11 of 11 people found the following review helpful
2.0 out of 5 stars Wrong enough to hurt 9 Aug 2012
By T. Carroll - Published on
Format:Paperback|Verified Purchase
While I'm not an expert in all the areas covered in this book, I am in a few. In those areas, this book is really wrong -- actually doing damage wrong.

For instance, when talking about regressions, the author claims that:

1) "Regression only makes sense when you want to use it as a prediction." This a very wrong. Any decent Econometrics book well be almost entirely about counter examples.
2) "Linear regression is appropriate only if the data can be described as a straight line." The "linear" in linear regression doesn't mean that at all. It just means that the form of the function must be linear in the coefficients to be calculated. In particular, x = a + b*x + c*x^2 will fit a parabola to the data.
3) "Historically, one of the attractions of linear regression has been that it is easy to calculate." It's easy to calculate for a single independent variable but multivariate regression are devilishly difficult to calculate because of numerical issues.

The is where I really have problems because multiple regression models are one of the most useful techniques available for understanding the effect of different factors and the author just dismisses them out of hand.

There are other problems:

When talking about the CDF, he defines it as the integral of the histogram. The histogram is not the probability density function. The PDF is defined to integrate to 1 where the histogram integrates to something else.

The formula for standard deviation is wrong, the formula for exponential moving average is wrong (a typesetting problem).

So, my problem is that I find a lot of problems with the portions of the book I know. Can I trust the remainder of the book or should I be wary? In this case, I'm wary.
43 of 53 people found the following review helpful
5.0 out of 5 stars Wow! 22 Nov 2010
By Jeffrey K. Tyzzer - Published on
Format:Paperback|Verified Purchase
Lucid, learned, and full of insights--a great book on a difficult subject. When I pre-ordered this title, I expected it to be more cookbook-oriented. There are certainly cookbook aspects to it, but it goes way beyond that. For one, it's deep: Janert gives you solutions, sure, but you also get considerable background to go with them. I particularly like chapter 9's sagacious treatment of probability models, especially the section on power law distributions. For another, it's comprehensive--there is a lot of material here, and it's delivered with discipline and care. You can tell that Janert really pushed himself (maybe with a bit of help from his editor) when writing this book. Finally, this book has heart. Data analysis is a means to an end (albeit a wonderful, fascinating one), and the author does his best to ensure that we the reader keep the objective in mind--to inform and enlighten--all the while ensuring that we know enough to pick the right tool for the job. Chapter 16 is another stand-out, and I especially appreciated Janert's distinction here between operational and representative reports and his point about the former: good design emphasizes the content. That's a bit of Tufte-esque advice that we would all do well to remember.
Were these reviews helpful?   Let us know
Search Customer Reviews
Only search this product's reviews

Customer Discussions

This product's forum
Discussion Replies Latest Post
No discussions yet

Ask questions, Share opinions, Gain insight
Start a new discussion
First post:
Prompts for sign-in

Search Customer Discussions
Search all Amazon discussions

Look for similar items by category