I'm not a database person but I've worked with SQL databases (esp. MySQL) and have read a few papers about non-relational databases, particularly Google's Bigtable. I understand the "web-scale" data challenge and see how a distributed, fault-tolerant, tunable open-source database like Cassandra can be an incredibly useful tool for addressing it. Therefore I was really looking forward to the publication of Eben Hewitt's Cassandra, The Definitive Guide. I was hoping that it would lay out all the important things a person would need to know in order to decide whether Cassandra made sense for their project and, if it did, how specifically they would use it.
Now that the book's out and I've had a chance to read it once through, I have to say that it does not meet my expectations. The author is clearly very interested in his subject and also very anxious to share insights not only into Cassandra but into modern non-relational databases in general (to the extent of including a 25-page appendix "The Nonrelational Landscape" at the end of the book). He does a pretty good job of explaining how Cassandra works at the level of distributed storage including scaling as well as availability and consistency. And though I haven't gone through the steps, he seems to give pretty good instructions for installing, configuring and monitoring a Cassandra cluster.
What he doesn't cover nearly as well as I was hoping (and would have expected from an O'Reilly book) is data modeling in Cassandra and the actual APIs for putting data into the database and getting data out (i.e. querying). It's not that he doesn't cover these subjects at all. In fact he devotes two chapters to data modeling (Chapter 3 The Cassandra Data Model and Chapter 4 Sample Application) and two to APIs (Chapter 7 Reading and Writing Data and Chapter 8 Clients), and these chapters contain a lot of useful information. The problem is that the information I really want is either mixed in with other, for me, less important information and/or is too limited or even not present at all.
Here are some things that I would have expected to be presented in reasonably full, coherent form in a "definitive guide" to Cassandra:
Column families, supercolumns and columns - what are they for, how do you use them effectively? Especially supercolumns, which, in conjunction with the intrinsically sparse data representation, allow you to blur the distinction between structure and data and store data in "wide" format and even as out-and-out row-specific lists. He touches on matters of this sort, including in the design patterns at the end of his Data Modeling chapter, but doesn't integrate them into a coherent account of how to use the Cassandra data representation model.
Lack of joins - what are the alternatives? He addresses this issue too, but mostly says, denormalize your tables and design for common queries - or even more bluntly, precompute the results of your common queries and put them into your database. This may be a good approach in some situations, but leaves a lot of questions like, when do you precompute your query results, where and how, what triggers the computation, and how do you handle data changes that invalidate previously precomputed query results (one of the problems that normalization and joins were originally designed to solve). Also, I believe he does not say very much about implementing joins and other complex queries on the client side. Does Cassandra have properties that determine more vs. less efficient ways of doing this? How important is planning for locality in your column family organization? And supercolumns for maintaining lists/sets so that you don't have to assemble them at query time?
Primary API - what is it? As the author explains, Cassandra doesn't have a query language, so he can't offer a chapter on the Cassandra equivalent of, say, SQL for relational databases. But Cassandra does have an API that lets you put data in and get data out, if not also other things like creating and deleting column families, supercolumns and columns. I was really expecting a chapter (or appendix or whatever) listing out the complete set of API requests and responses, either in some language-neutral format or in terms of the "native" Cassandra language, i.e. Java, ideally with additional information on "bindings" for other client-side languages like PHP, Python and so on. Again the information is sort of there, but not pulled together.
Higher-level wrappers - what are they about? The author talks about Thrift and Avro as (at least somewhat) high-level languages for communicating with Cassandra, but doesn't lay out in any coherent what those languages are. These tools may be very familiar to some, but I'm sure not to all. He does provide enough information - especially in the form of external links - to make it possible to start exploring these tools, but I would have expected the book to give a pretty good idea of what they're about without having to go off and read other material.
While I am, overall, dissatisfied with the book, I found it both an interesting read and an engaging introduction to the world of Cassandra. It also undeniably offers a wealth of information, even if it's not exactly the information a person may be looking for. For this reason I'm rating it 3 stars.