Hadoop: The Definitive Guide: The Definitive Guide and over 2 million other books are available for Amazon Kindle . Learn more

Trade in Yours
For a 5.20 Gift Card
Trade in
Have one to sell? Sell yours here
Sorry, this item is not available in
Image not available for
Image not available

Start reading Hadoop: The Definitive Guide: The Definitive Guide on your Kindle in under a minute.

Don't have a Kindle? Get your Kindle here, or download a FREE Kindle Reading App.

Hadoop: The Definitive Guide [Paperback]

Tom White
3.3 out of 5 stars  See all reviews (3 customer reviews)

Available from these sellers.


Amazon Price New from Used from
Kindle Edition 18.12  
Paperback --  
Trade In this Item for up to 5.20
Trade in Hadoop: The Definitive Guide for an Amazon Gift Card of up to 5.20, which you can then spend on millions of items across the site. Trade-in values may vary (terms apply). Learn more
There is a newer edition of this item:
Hadoop: The Definitive Guide Hadoop: The Definitive Guide 4.3 out of 5 stars (6)
In stock.

Book Description

15 Jun 2009 0596521979 978-0596521974 1

Hadoop: The Definitive Guide helps you harness the power of your data. Ideal for processing large datasets, the Apache Hadoop framework is an open source implementation of the MapReduce algorithm on which Google built its empire. This comprehensive resource demonstrates how to use Hadoop to build reliable, scalable, distributed systems: programmers will find details for analyzing large datasets, and administrators will learn how to set up and run Hadoop clusters.

Complete with case studies that illustrate how Hadoop solves specific problems, this book helps you:

  • Use the Hadoop Distributed File System (HDFS) for storing large datasets, and run distributed computations over those datasets using MapReduce
  • Become familiar with Hadoop's data and I/O building blocks for compression, data integrity, serialization, and persistence
  • Discover common pitfalls and advanced features for writing real-world MapReduce programs
  • Design, build, and administer a dedicated Hadoop cluster, or run Hadoop in the cloud
  • Use Pig, a high-level query language for large-scale data processing
  • Take advantage of HBase, Hadoop's database for structured and semi-structured data
  • Learn ZooKeeper, a toolkit of coordination primitives for building distributed systems

If you have lots of data -- whether it's gigabytes or petabytes -- Hadoop is the perfect solution. Hadoop: The Definitive Guide is the most thorough book available on the subject.

"Now you have the opportunity to learn about Hadoop from a master-not only of the technology, but also of common sense and plain talk." -- Doug Cutting, Hadoop Founder, Yahoo!

Product details

  • Paperback: 528 pages
  • Publisher: O'Reilly Media; 1 edition (15 Jun 2009)
  • Language: English
  • ISBN-10: 0596521979
  • ISBN-13: 978-0596521974
  • Product Dimensions: 2.6 x 18.1 x 23.1 cm
  • Average Customer Review: 3.3 out of 5 stars  See all reviews (3 customer reviews)
  • Amazon Bestsellers Rank: 873,981 in Books (See Top 100 in Books)
  • See Complete Table of Contents

More About the Author

Discover books, learn about writers, and more.

Product Description

Book Description

MapReduce for the Cloud

About the Author

Tom White has been an Apache Hadoop committer since February 2007, and is a member of the Apache Software Foundation. He works for Cloudera, a company set up to offer Hadoop support and training. Previously he was as an independent Hadoop consultant, working with companies to set up, use, and extend Hadoop. He has written numerous articles for O'Reilly, java.net and IBM's developerWorks, and has spoken at several conferences, including at ApacheCon 2008 on Hadoop. Tom has a Bachelor's degree in Mathematics from the University of Cambridge and a Master's in Philosophy of Science from the University of Leeds, UK.

Inside This Book (Learn More)
Browse Sample Pages
Front Cover | Copyright | Table of Contents | Excerpt | Index | Back Cover
Search inside this book:

What Other Items Do Customers Buy After Viewing This Item?

Customer Reviews

5 star
3 star
1 star
3.3 out of 5 stars
3.3 out of 5 stars
Most Helpful Customer Reviews
5 of 5 people found the following review helpful
4.0 out of 5 stars Excellent book: get it, but beware... 12 Nov 2009
This is really an excellent introduction to Hadoop and related projects (e.g. HBase, Zookeeper,Pig, etc.). If you want to find out more about Hadoop and the Map/Reduce paradigm this is definitively the best book you can get. If you want to get serious about Hadoop and Map/Reduce (this was my case) it is even better because it really covers a lot of details that you will not be able to find on the on-line (rather scarce) documentation.
If you are mainly interested in HBase do not expect too much though. The book is dedicating to it only one chapter providing a very high level overview that will neither help you address nor solve any real-life problem. Anyway, the title of the book is 'Hadoop' not 'HBase' so, fair enough.
The main reason for holding back the fifth star is that the book is based on the 0.19.* API. This book was just bad timing since it was only released in June 2009 and as of Novemeber 2009 it is already obsolete since Hadoop 0.20 API changed so radically that most of the examples will only work if you are happy to use 90% deprecated classes in your code. I still think it is a 4-star book because any decent developer would have little trouble moving from the 0.19.* to the 0.20.* API.
So, in short, go get it !
Comment | 
Was this review helpful to you?
1 of 1 people found the following review helpful
2.0 out of 5 stars Poor Practical Examples 17 Aug 2010
By altuure
This book is good at going over the API specifications. It tries to explain most of the technical functions and for long book like this it can be very boring.
But if you try to understand the essence of the cloud and mapreduce it is really poor. there very little example about the contents. It is very hard get benefit from the technical knowledge
Comment | 
Was this review helpful to you?
4.0 out of 5 stars A good overview of Hadoop 5 April 2011
Format:Kindle Edition
It's a good overview book I think, although the reader will need to follow up with the websites and where the community has gone since it was published. Detailed enough to get your feet wet but as with all technical subjects, the reader needs an update from the internet after reading.

Recommend it.
Comment | 
Was this review helpful to you?
Most Helpful Customer Reviews on Amazon.com (beta)
Amazon.com: 4.3 out of 5 stars  15 reviews
47 of 49 people found the following review helpful
5.0 out of 5 stars Pigs and Elephants on the road to World Domination 13 July 2009
By Techie Evan - Published on Amazon.com
These days, one can't seem to attend technical conferences without hearing marketing-oriented speakers' world domination plans for their products. So imagine this: what if pigs and elephants are involved? Elephants would be Hadoop installations, and Pigs would be one of those animal-themed tools, smarter cousins of the elephants really, riding on top of Hadoops, directing them on how to perform their jobs. Would the world be a better place?

Hadoop is the brainchild of Doug Cutting, who named his creation after his kid's stuffed yellow elephant. Hadoop enables large datasets distributed over a cluster of machines to be processed in parallel. One machine or node in that cluster would usually house a JobTracker and a NameNode. The JobTracker schedules and manages processing jobs to be executed in the other machines, and the NameNode manages the metadata (e.g., file names and locations, etc) of the datasets to be processed. The processing jobs are programmed in the form of Map and Reduce functions. Inputs are usually split into blocks to be processed in parallel by two or more identical mappers. The close to final outputs are then fed to one or more identical reducers, whose job is to perform any final transformations on the intermediate data to produce data summaries in the expected format. Several companies are using Hadoop to extract knowledge from their extensive data.

I've read this book and Jason Venners' Pro Hadoop book. Although I like both, I like this book better for the following reasons: more comprehensive coverage of topics, and more insiders' information on design rationales and how certain Hadoop features really work behind the scenes.

Here's a breakdown of and some commentaries on the book's contents:

Chapter One introduces Hadoop, its history and how it's different from similar tools or frameworks. Kinda dry. Chapter Two introduces the MapReduce Programming model and its benefits when compared to, say, the use of Unix tools for achieving parallel processing of text files. This is also where readers are introduced to the concepts of: map, combiner, and reduce functions, shuffle and sort, streaming, etc. Chapters Three and Four are all about the Hadoop Distributed FileSystems and I/O and the design decisions that were made to address performance, reliability, and safety concerns.

Chapter Five shows you how to develop, configure, test, run and tune a MapReduce Application. Good chapter but Jason Venner's book has better materials on testing and debugging MapReduce applications.

Chapters Six through Eight discuss how MapReduce really works behind the scene, including advanced features. This is where you'll learn how flexible Hadoop is when it comes to handling different types of inputs and outputs in terms of numbers, sizes, formats, and usage scenarios. Excellent!

Chapters Nine and Ten are really good. They teach you how to set up and administer Hadoop clusters. There's even a brief but informative section on how to use Hadoop with Amazon EC2 servers.

Chapters 11-13 devote one chapter each on how to install and interact with frameworks built on top of Hadoop: Pig, HBase, and ZooKeeper. Chapter 14 provides Case Studies (e.g., How Facebook uses Hadoop to analyze ad campaign effectiveness, etc.).

Appendices A and B provide instructions on how to install Apache's Hadoop and Cloudera's distribution, respectively, and C gives you a runthrough of the steps to take when preparing to use the NCDC Weather Data used in the book.

Very thorough and well written book. 4.5 stars rating.
38 of 43 people found the following review helpful
3.0 out of 5 stars Partly succeeds 8 Sep 2009
By BillyJoeBob - Published on Amazon.com
Format:Paperback|Verified Purchase
Tom White certainly writes very well: this book is very readable. It is also quite comprehensive, falling somewhere between a tutorial and a reference.

That being said, I was ultimately rather disappointed. First, and most importantly, it was not clear to me after reading this book how I might use Hadoop for some of my projects, or if indeed they were good candidates for MapReduce. I feel it should have been possible to provide some generic guidance. Second, some chapters are written by other authors, and these did not uniformly provide the same quality of instruction, reading occasionally like advertisements.

I confess I am puzzled by the number of encapsulating and utility APIs that have grown up around Hadoop. Why do we need Pig, HBase, Hive, Zookeeper and Cascading? Apparently because (according to what I have read here), bare Hadoop is hard to program with (productively). Some indication of how these wrappers interact with each other would have been helpful.

As it is, I feel LESS urge to evangelize for Hadoop having read this book. Surely not the desired effect?
7 of 9 people found the following review helpful
5.0 out of 5 stars First 25 Pages Have You Up And Running! 24 Aug 2009
By Jonathan Zdziarski - Published on Amazon.com
I picked up this book to catch up on Hadoop, which the rest of my team has been using for several months. Unfortunately I was too busy with other projects to spend any time on MapReduce and thought it'd be a grueling process to be brought up to speed on it. Within the first 25 pages and about 3 hours, Tom had me up and running my first MapReduce job which I successfully adapted for a specific metric we were trying to generate. The book does a great job of breaking down Hadoop's complex pieces into easy to understand components, but doesn't try and pump you full of conceptual BS before it lets you touch real code.

If I were to make any suggestions it would be to start the book off with some simple instructions for installing and getting Hadoop up and running on a local machine, followed by some simple explanations of DFS and Hadoop's commands for managing the file system. I would also explain much earlier how to get your classes recognized by Hadoop for those a bit rusty at Java. Fortunately, the online Wiki was very good about providing instructions to get me going on a Mac, and that took a majority of OS-specific needs off the burden of the book. You will, no doubt, have to be intelligent to read this book, but if you're using Hadoop, there is already a prerequisite for technical proficiency you'll need to satisfy. Overall good job, Tom.
4 of 5 people found the following review helpful
4.0 out of 5 stars The elephant is tamed 30 April 2010
By JUG Lugano - Published on Amazon.com
Original review written by Paolo Canesi, JUG Lugano, [...]

Managing and analyzing huge data sets has become a very common problem in various areas of modern information technology, from different types of Web applications (social, financial, trading, ...) to applications for analyzing scientific data.

Distributed systems over a cluster of machines are almost a mandatory choice in such cases, but designing and implementing an effective solution in those areas may be troublesome and become a nightmare.

The Apache Hadoop Project is an infrastructure that helps the construction of reliable, scalable, distributed systems. Mainly known for its MapReduce and distributed file system (HDFS) subprojects, it actually includes other services that complement or extend them.

Tom Whites' "Hadoop: The Definitive Guide" is an enjoyable book which fully explains these complex technologies. The book is organized in such a way that the reader is gently guided into the Hadoop ecosystem. It begins with a couple of very readable chapters as a general introduction to the problems Hadoop is meant to solve and the main solutions to them (MapReduce and HDFS), then examines closely all its aspects, often describing what really happens under the scenes, giving useful design suggestions and common pitfalls descriptions. When reading this book you won't be overwhelmed by tons of lines of code: examples are short and yet effective.

This kind of structure makes it hard to classify the book as a mere tutorial or as a real reference guide, it can be rather considered a mix of the two. If this turns out to be a positive choice in many ways, it has some drawbacks: the reader is sometimes forced to go back and forth through the chapters and has to read it almost entirely to get a full understanding. But this is perhaps the price to pay for having a fluent and pleasant reading.

Let's go quickly through the chapters:

The first chapter is a brief history of Hadoop project illustrating its main characteristics and comparing them to those of others similar technologies. Chapter two is a pleasant introduction to MapReduce. The third chapter breaks the continuity of the previous one examining the Hadoop Distributed File System (HDFS subproject) in detail. Chapter four makes a step down in the abstraction layer talking about the Hadoop I/O fundamentals: data integrity, compression, serialization and data structures, explaining the design choice.

Chapters five to eight are an excellent source for learning Hadoop MapReduce in depth. They cover all the aspects of it: starting from practical ones, such as how to configure, run, test and debug map reduce programs, to those more advanced and formal, like programming models, data formats, sorting and joining tools.

The two following chapters list few very interesting and useful suggestions for managing and setting up a Hadoop cluster, a precious resource for administrators.

Chapters eleven to thirteen are for Pig, HBase and Zookeper subprojects under the Hadoop umbrella. Despite of suffering from brevity, they are still interesting.

Chapter fourteen is made for the reader not to feel alone: important case studies using Hadoop (e.g. Yahoo, and others contributions from Apache Hadoop community).

My final opinion is that "Hadoop: The Definitive Guide" is a very useful resource for those who want to learn how to ride the "pachydermic" Hadoop (like a "Mahout", perhaps?).
4 of 5 people found the following review helpful
5.0 out of 5 stars Don't understand all the other negative reviews 23 July 2009
By Timothy T. Wee - Published on Amazon.com
This is the book to get if you are actually doing something with Hadoop. It's been a lifesaver, and has answered all our questions of, "I wonder if I can do x in Hadoop?"
It gives a lot of information about the internals of Hadoop, which you will want to know when things go wrong or when you just want to get more out of Hadoop.
I normally don't post reviews as much, but I think Tom White and this book deserves way more than 5 stars, so I'm not sure why it only has 3 stars on Amazon.
Were these reviews helpful?   Let us know
Search Customer Reviews
Only search this product's reviews

Customer Discussions

This product's forum
Discussion Replies Latest Post
No discussions yet

Ask questions, Share opinions, Gain insight
Start a new discussion
First post:
Prompts for sign-in

Search Customer Discussions
Search all Amazon discussions

Look for similar items by category