NoSQL Distilled: A Review

nosql_distilledRecently in my job at Wolters Kluwer I’ve started to look beyond the world of relational databases. The relational model forces our data into an unnatural structure, which through the years has become less and less practical to work with. The linked nature of our data is severely disregarded, and the inflexible table structure makes our new texts exceedingly hard to manage. Our team of engineers is therefore growing increasingly interested in non-relational databases. That’s why I picked up NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence by Martin Fowler and Pramod J. Sadalage, a book that promised me a clear and succinct introduction to the world of NoSQL databases. Here’s my summary and review. 

The first chapters of NoSQL Distilled introduce the world of NoSQL, and the reasons for moving there. Everyone who works with relational databases has experienced the “impedance mismatch”: the difference between the data model of their applications and the relational model of their databases. For example, if the number of authors of a document can vary, relational databases force you to have a separate “authors” table to list all the names. And if your documents have links between them, you can’t follow a chain of links in your database without performing an exuberant number of joins. Similarly, relational databases aren’t particularly well-equipped when it comes to handling extremely large numbers of data, like the web scale data that companies like Facebook, Twitter or Google work with. It’s no surprise that these companies in particular have been actively exploring alternatives for their database needs.

NoSQL databases, as these alternatives are called, take aggregates as their central data structure. These aggregates typically map much more nicely to the data model used by applications. Document stores, for example, can store a varying number of authors as a list of variable length. Graph databases are optimized for walking graphs of linked entities. Moreover, because of their aggregate orientation, it’s also easy to run these databases on a cluster. Data can be distributed, “sharded”, between several clusters, and technologies like MapReduce can be used to perform operations on the cluster.

As the authors of NoSQL Distilled admit, this is not without its downsides. First, NoSQL databases rarely have the full “ACID” transactions of relational databases (Atomic, Consistent, Isolated and Durable). Most importantly, when it comes to consistency, the so-called CAP theorem states that you can only get two of the three properties of Consistency, Availability and Partition tolerance. Most NoSQL databases are therefore “eventually consistent”, which means that all inconsistencies will eventually be resolved. Second, NoSQL databases work without a schema. Whereas relational databases require the data in their tables to take a certain form, NoSQL databases are much more flexible. In many situations, this flexibility is a good thing. It allows quick changes to the data model in agile software development, for example. There’s a catch, however. In the end you need to know the type of data that resides in your database: you only want to allow your users to create and update data when their data conforms to that type. Because NoSQL databases move this responsibility to the application using the database, your developers are likely to write more code.

In the second part of the book, the authors discuss the main types of NoSQL databases, each with a concrete example. We learn about key-value stores (Riak), document databases (MongoDB), column-family stores (Cassandra) and graph databases (Neo4j). This is, of course, a simplification of the NoSQL landscape. Many key-value stores take properties of document databases, and some databases (like OrientDB) straddle two of the mentioned categories. Still, the overview successfully introduces the main distinctions between existing databases and can serve as a reference against which other databases can be evaluated. Of particular interest are the lists of typical use cases for every database. The NoSQL landscape is extremely varied, and the choice of database that you make must eventually depend on the nature of your data, and the way in which it will be used.

Finally, the last part of NoSQL Distilled is less focused. The authors go into the “Polyglot persistence” from the subtitle of their book, briefly discuss schema migrations, and introduce some more databases beyond the NoSQL world. This part contains some interesting bits and pieces, but the discussions are too superficial to be really helpful. The main point is that NoSQL databases will not completely replace their relational competitors. Rather, we’re evolving toward a polyglot situation where different databases serve different needs.

Like data and applications, different readers have varying needs. Because of its succinctness, developers who already have experience with NoSQL databases will likely not learn anything new from NoSQL Distilled. With just five or ten pages set apart for each database, the authors only scratch the surface of the many possibilities on offer. Newcomers, however, or people eager for a clear overview of the data models and database features of the NoSQL world, will be perfectly happy with this bookFor that group, I can highly recommend it.

Leave a comment