The road to SQL started with Dr. E.F. Codd’s paper, “A Relational Model of Data for Large Shared Data Banks“, published in Communications of the ACM in June 1970. His colleagues at IBM, Donald Chamberlin and Raymond Boyce were working on a query language (originally named SQUARE, Specifying Queries As Relational Expressions) that culminated in the 1974 paper, “SEQUEL: A Structured English Query Language“. Since that time, SQL has become the dominant language for relational database systems. In recent years, frameworks and architectures have arrived on the programming scene that attempt to hide (or completely remove) the use of SQL and relational databases allowing developers to focus even more on user interfaces, business logic and platform support in our application development. We have also seen the rise of alternatives to relational databases, the so-called “NoSQL” data stores. Are we witnessing the end of SQL and relational databases?
In a December DDJ podcast interview by Mike Riley, I was asked the question: “With the growing trend of ORMs (Object Relational Mapping), some software developers are saying that SQL is becoming less relevant. What are your opinions about that statement?” Thinking over the holidays about that question and the implications including and beyond ORMs, I spent some time thinking more about frameworks like Ruby on Rails Active Record and Hibernate. These frameworks still rely on team members who work on the design, development and maintenance of relational databases. The work that Microsoft has done with LINQ also reduces the impedence mismatch between programming languages and database languages.
The “NoSQL movement” and Cloud based data stores are striving to completely remove developers from a reliance on the SQL language and relational databases. Some developers think that the NoSQL movement is something completely new. Object databases appeared in the 1980s and Ray Ozzie started the commercial document-centric datastore business with Lotus Notes in the 1990′s. Charlie Caro, Senior Software Engineer who works on Embarcadero’s InterBase SQL database engine, told me “At the time, it seemed almost heretical that a data model that ignored concurrency control could flourish. But Ozzie recognized the benefits of distribution, replication and ease-of-setup would outweigh the very, infrequent occurrence of update conflicts in managing documents and messages. Even so, there were opt-in capabilities if documents needed to be locked for safe modification without data loss. But the default was to just let it fly with no update coordination.”
NoSQL, according to WikiPedia, is “an umbrella term for a loosely defined class of non-relational data stores”. The term was first used by Rackspace employee Eric Evans. In his blog post last October, Eric talks about the name NoSQL (now generally known to mean Not Only SQL). The real nugget in the post is “the whole point of seeking alternatives is that you need to solve a problem that relational databases are a bad fit for.” Adam Keys in his The Real Adam blog post offered up an alternative term, “Post-Relational”. One goal of some NoSQL databases is to remove the overhead and memory footprint of relational databases. Other goals of NoSQL databases include: closer affinity with programming languages, use of web technologies and RPC calls for access and optional forms of data query.
In a recent blog post, “The “NoSQL” Discussion has Nothing to Do With SQL” Professor Michael Stonebraker discusses comparisons of SQL and NoSQL databases. SQL and NoSQL databases can be compared using any or all of the following features and characteristics. (Note: there are many more that can be added to this list for either or both types of databases. Post a comment with additional items that might characterize the differences between the two types):
- Vertically and Horizontally Scalable – relational databases (traditionally) reside on one server which can be scaled by adding more processors, memory and storage to provide scalability. Relational databases residing on multiple servers usually use replication to keep the databases synchronized. NoSQL databases can reside on a single server but more often are designed to work across a cloud of servers. (NoSQL: Distributed and Scalable Non-Relational Database Systems)
- Columns, Key/Value Store, Tuple Store – relational databases are usually comprised of a collection of columns in a table and/or view (fixed schema, join operations). NoSQL databases often store a combination of key and value pairs or Tuples (schema free, an ordered list of elements).
- In-Memory Dataset, On-Disk Storage – relational databases almost always reside on a disk drive or a storage area network. Sets of database rows are brought into memory as part of SQL select or stored procedure operations. Some (but not all) of the NoSQL databases are designed to exist in memory for speed and can be persisted to disk.
- Document-Oriented, Collection-Oriented, Column-Oriented, Object-Oriented, Set-Oriented, Row-Oriented – Document-oriented databases contain records of documents, fields and XML. Collection-oriented datasets provide closer affinity to object-oriented programming languages. Relational databases are characterized by data organized in tables, rows and columns (Column-Oriented). SQL select operations normally return cursors to a single row or a set of rows containing the columns specified. Object-Oriented databases have been around since the dawn of object-oriented programming but relational databases are by far the dominant database architecture in use today (and for years to come). Are object databases NoSQL databases? The rise of object-relational mapping (ORM) frameworks marries object-oriented programming to the larger number of relational databases. NoSQL databases data are often organized into objects, key/value pairs or tuples. NoSQL database election operations are often done in code or an interface.
In an email conversion, Charlie Caro told me the following: “If Facebook has to manage 100,000,000′s of user profiles, a distributed, context-free key-value store is probably the way to go. Simple lookup by large numbers of users but only a single updater of the profile by the owning user means a transactional database is probably overkill. Multiple readers, with one updater, needs no concurrency control. In many cases, the simpler setup and ease-of-use of the NoSQL solutions is what attracts its user community. SQL databases require more setup (schema et al) but those schemas are exactly what give relational databases higher performance opportunity in parallel DBMS implementations. This ease of use premium finds its way in programming languages as well. Many developers today prefer scripting languages over their compiled counterparts, which have static type checking for safer programs. Scripting languages are just so forgiving and easy to get started with and projects to compile scripts into .NET/Java intermediate byte codes are increasing their performance.” We both agreed that it’s really all about having the right tool for the job, and it always has been! No sense hammering a screw into the wall when you can use a screwdriver.
For years to come, most of us will still rely on and use relational databases and SQL. I certainly will. I will also continue to look for better ways to isolate and encapsulate data access in my applications. As always, any project decisions have to match with the user and business requirements. For new projects, I believe, we have genuine non-relational alternatives on the table (pun intended).
I covered some background on the SQL language and relational databases, the current and future for relational databases, the rise of frameworks that hide some of the complexities of database programming and the rise of the NoSQL databases. In this second installment I will take a walk through (some of) the currently available open and closed source NoSQL databases. Then, in part 3, I will point you to NoSQL Internet resources, past/upcoming events and offer some guidance for developers.
It’s amazing to see how many open and closed source alternative data stores have appeared. At the same time more are appearing every day. If I have left off one or more of your favorite NoSQL databases post a comment. Below you will find many different types of NoSQL databases: Document-Oriented, Collection-Oriented, Column-Oriented, Object-Oriented, Graph-Oriented, Set-Oriented, Row-Oriented and more.
|Description:||Modern, high performance, persistent graph database.|
|Storage:||Disk based, meta-data and data triples.|
|Description:||C language embeddable library for enterprise-grade, concurrent, transactional storage services. Thread safe to avoid data corruption or loss|
|Storage:||B-tree, hash table, persistent queue|
|API(s):||C, C++ and Java|
|Notes:||Use BerkleyDB XML layer on top of BerkleyDB for XML based applications. Comparison of BerkleyDB and relational databases|
|Type:||Sparse, distributed, persistent multidimensional sorted map.|
|Description:||Distributed storage system for structured data. Data model provides dynamic control over data layout and format. Data can live in memory or on disk.|
|Storage:||Data is stored as an uninterpreted array of bytes. Client applications can create structured and semi-structured data inside the byte arrays.|
|API(s):||Python, GQL, Sawzall API, REST, various.|
|Notes:||Overview: Bigtable: A Distributed Storage System for Structured Data (PDF format)|
|Type:||Dimensional hash table|
|Description:||Highly scalable distributed database. Combines Dynamo’s distributed design and Bigtable’s column family data model.|
|Storage:||Clusters of multiple keyspaces. The keyspace is a name space for column families. Columns are comprised of a name, value and timestamp.|
|API(s):||Java, Ruby, perl, Python, C#, Thrift framework.|
|Notes:||Open sourced by Facebook in 2008. Wiki, FAQ, Examples|
|Description:||Distributed database with incremental replication, bi-directional conflict detection and management.|
|Storage:||Ad-hoc and schema-free with a flat address space.|
|Notes:||CouchDB Introduction, Technical Overview|
|Description:||Java and .NET dual license (commercial and open source) object database.|
|Storage:||Data objects are stored in the way they are defined in the application.|
|API(s):||Java, .NET languages.|
|Notes:||db4o db4o database runtime engine, about db4o|
|Company/Org:||Millstone Creative Works|
|Description:||Schemaless database similar to Amazon’s SimpleDB. Open source, standalone Java application server.|
|Storage:||JSON data format, “bags” (similar to tables).|
|Description:||Open source Amazon Dynamo clone written in Erlang.|
|Storage:||Distributed key/valve store, Pluggable storage engines.|
|Description:||Distributed cache processes, partitions, replicates and manages data across servers.|
|Storage:||Data and database cache, “near cache” for local subset of data. Java persistent cache. Map reduce support.|
|API(s):||Java APIs, REST data service|
|Notes:||eXtreme Scale Document library web site|
|Type:||Hierarchical, multi-dimensional sparse arrays, content associative memory|
|Description:||Small footprint, multi-dimensional array with fill support for ACID transactions, optimistic concurrency and software transactional memory.|
|Storage:||Unstructured array of bytes. Can be Key/Value, document oriented, schema-less, dictionary or any other data model.|
|API(s):||Mumps, C/C++, SQL|
|Type:||Embedded storage library|
|Description:||Lightweight embedded database engine. Supports on disk and in memory databases.|
|Storage:||B+tree with variable length keys.|
|API(s):||C++, Python, .NET and Java|
|Notes:||hamsterdb FAQ, examples, tutorial|
|Type:||Sparse, distributed, persistent multidimensional sorted map.|
|Description:||Open source, distributed, column-oriented, “Bigtable like” store|
|Storage:||Data row has a sortable row key and an arbitrary number of columns, each containing arrays of bytes.|
|API(s):||Java API, Thrift API, RESTful API|
|Notes:||Part of Apache Hadoop project. HBase Wiki, FAQ|
|Type:||Sparse, distributed, persistent multidimensional sorted map.|
|Description:||High performance distributed data storage system designed to run on distributed filesystems (but can run on local filesystems). Modeled after Google Bigtable.|
|Storage:||Row key (primary key), column family, column qualifier, time stamp.|
|API(s):||C++, Thrift API, HQL|
|Notes:||Hypertable Architectural overview, FAQ|
|Description:||Scalable, highly available, peer to peer, data grid platform.|
|Storage:||Key/Value pair with optional expiration lifespan.|
|API(s):||Java, PHP, Python, Ruby, C|
|Notes:||Infinispan FAQ, Wiki|
|Description:||Internet graph database made up on nodes and edges. Supports in-memory and persistent storage alternatives including RDBMS, file system, file grid, and custom storage.|
|Storage:||Nodes (meshobjects) and edges (relationships). Meshobjects can have entity types, properties and participage in relationships. MeshObjects raise events.|
|API(s):||RESTful web services.|
|Notes:||InfoGrid Overview, FAQ|
|Description:||Distributed (master/slave) key-value data store delivering strong consistency, fault-tolerance and high availability.|
|Storage:||Uses BErkeleyDB library for For local storage. Key/Value pairs and their state are replicated to multiple servers.|
|API(s):||C/C++, Python, PHP, HTTP|
|Notes:||Keyspace Overview, FAQ|
|Description:||High performance, high realiability persistent storage engine for key/value object storage.|
|Storage:||Uses BerkeleyDB as storage library/backend.|
|API(s):||Memcache protocol, C, Python, Java, perl|
|Notes:||MemcacheDB complete guide (PDF format)|
|Description:||Multiuser distributed database including support for replication and dynamic reconfiguration.|
|Storage:||Organized as a set of tables made up of Erlang records. Tables also have properties including type location, persistence, etc.|
|Notes:||Mnesia Reference manual|
|Description:||Scalable, high-performance, open source, schema-free, document-oriented database|
|Storage:||JSON-like data schemas, Dynamic queries, Indexing, replication, MapReduc|
|Notes:||MongoDB Documentation Index|
|Description:||Embedded, small footprint, disk based, transactional graph database written in Java. Dual license – free and commercial.|
|Storage:||Graph-oriented data model with nodes, relationships and properties.|
|API(s):||Java, Python, Ruby, Scala, Groovy, PHP, RESTful API.|
|Notes:||Neo4J Wiki, API, FAQ|
|Description:||Key/Value store with the dataset kept in memory and saved to disk asynchronously. “not just another key-value DB”|
|Storage:||Values can be strings, lists sets and sorted sets.|
|API(s):||Python, Ruby, PHP, Erlang, Lua, C, C#, Java, Scala, perl|
|Description:||Scalable Web Service providing data storage, query and indexing in Amazon’s cloud.|
|Storage:||Items (like rows of data), Attributes (like column headers), and Values (can be multiple values)|
|Notes:||SimpleDB FAQ, Getting Started Guide, Developer Guide, API|
|Description:||Library (written in C) of functions for managing files of key/value pairs. Multi-thread support.|
|Storage:||Keys and Values can have variable byte length. Binary data and strings can be used as a key and a value.|
|API(s):||C, perl, Ruby, Java, Lua.|
|Notes:||Tokyo Cabinet Specifications, presentation (PDF format). Also available: Tokyo Tyrant (remote service), Tokyo Distopia (full text search), Tokyo Promenade (content management).|
|Description:||“It is basically just a big, distributed, persistent, fault-tolerant hash table.” High performance and availability.|
|Storage:||Each key is unique to a store. Each key can have at most one value. Supported types: JSON, string, identity, protobuf, java-serialization.|
|API(s):||Java, C++, custom clients|
|Notes:||Project Voldemort Wiki, Client how-to|
It’s one thing to have lots of choices for non-relational databases. Building up a NoSQL knowledge and experience base will definitely help managers, architects and developers compare and contrast what they already know about relational databases. Relational databases and the SQL language are still the architecture and lingua franca for the design, development and management of database applications. While we are still at the beginning of the use of databases in cloud infrastructures, we can move forward faster because of all of the work and collaboration that is taking place. Depending on the user and business requirements, we can choose between existing relational database technologies or the NoSQL alternatives.
Stay tuned to the conclusion (part three) of this blog series where I’ll give you links to additional NoSQL Internet resources, past/upcoming events, offer some guidance for developers and highlight some of the comments and answer some of the questions from the first two parts.
If you are looking for more information about NoSQL and non-relational databases, check out the following sites, blog posts, and articles:
- No to SQL? Anti-database movement gains steam, Eric Lai, Computerworld
- Dynamo: Amazon’s “highly available key-value store.” Werner Voegel, Amazon CTO, from his blog post and team article.
- Google BigTable: “distributed storage system for managing structured data.” Google Labs home page and paper.
- “Death to Relational Databases“, a generic intro to NoSQL by Ben Scofield, CodeMash January 14, 2010.
- Scalable Transactions for Web Applications in the Cloud, by Wei Zhou, Pierre Guillaume and Chi Chi-Hung. Euro-Par 2009 conference (and the PDF paper).
- Is Microsoft Feeling the “NoSQL” Heat?, by David Ramel for Redmond Developer News.
- It’s not NoSQL, it’s post-relational by Adam Keys, software developer and writer, on The Real Adam blog, August 2009.
- The Future Is Big Data in theCloud, by Ping Li, Accel Partners.
- The Dark Side of NoSQL from the Code Monkeyism blog
- NoSQL Ecosystem by Jonathan Ellis, on the RackSpace cloud blog.
- NoSql Databases – Part 1 – Landscape by Vineet Gupta, GM Software Engineering at Directi Group, on his blog.
- NoSQL meetup groups around the world from meetup.com.
- nosql-databases.org – website that is “Your Ultimate Guide to the Non-Relational Universe!”
- nosql-discussion Google web discussion group
There are several upcoming and recent NoSQL conferences and meet ups with great information and solutions for architects and developers. Here are just a few:
- NoSQL Live, March 11, 2010. Boston, Massachusetts. Hosted by 10gen (provides commercial support for MongoDB).
- Glue Conference 2010 (Gluecon), May 26-27, Broomfield, Colorado.
- Scandinavian Web Developer Conference 2010, June 2-3, Stockholm Sweden.
- ICOODB 2010 – 3rd International Conference on Objects and Databases, September 28-30, 2010, Frankfurt/Main, Germany. Workshops: NoSQL Workshop & Meetup 28th Sept 2010.
- FOSDEM – http://nosqldevroom.pbworks.com/NoSQL-devroom-Talks
- Oakland California NOSQL meet up November – 2009. On the meet up web site there are several links to papers that were presented including: No SQL is a Horseless Carriage, Project Voldemort: What’s New, Cassandra in a nutshell, CouchDB, MarkLogic Server, JCR in 15 minutes.
It has been great to see the comments and suggestions that have been posted on my Computerworld blog and also on Digg. Thank you to everyone for joining the discussion about relational and non-relational database development. Here are excerpts from the many comments.
- Emil Eifrem (Neo4j) commented: “You talk about scaling to size and handling Facebook’s 100M user profiles. That’s an important use case and one that for example a key-value store handles brilliantly. But it turns out most companies aren’t Facebook. You can categorize the four emerging categories of NOSQL databases (key-value stores, column family stores, document dbs and graph databases) along the axes of scaling to size and scaling to complexity. For more information about that, see this blog post. Graph databases (like e.g. Neo4j, which I’m involved with, or Sones) excels at representing complex and rapidly evolving domain models and then traversing them with high performance.”
- Mongo-DB Developer commented: “We have seen the most common use case to date being use of nosql solutions as operational data store of web infrastructure projects. By operational, I mean, problems with real time writes and reads (contrast with data warehousing with bulk occasional loading). For these sort of problems these solutions work well and also fit well with agile development methods where the somewhat ‘schemaless’ (or more accurately, columnless) nature of some of the solutions, and the dynamically typed nature of the storage, really helps.”
- Peter R commented: “I have already seen, in the domain I work in, the movement away from straight up SQL databases. XML databases are one technology that will be stealing a lot of SQL’s thunder (if they haven’t already). Do I think SQL will ever die? No. But the key is that there will be/are more options that need to be thought about when designing a system now.”
- Anonymous commented: “I agree object databases have a purpose. They are great for large datasets that need to be replicated and called by a key. However SQL provides a very important capability and that it is to be able to query data across a number of datasets very efficiently, this will be very hard to duplicate in a simple key value database.”
- Johannes Ernst commented: “One of the difficulties for “normal” developers with many of the NoSQL technologies that you’ve described so far has been the learning curve and the additional work required: e.g. it’s easy and everybody knows how to put “every customer can place one or more orders” into a relational database, but what if the only thing you have is keys and opaque values? Compared to many other NoSQL alternatives, graph databases provide a high level of abstraction, freeing developers to concentrate on their application, while still bringing many of the same NoSQL benefits. For example, in InfoGrid (http://infogrid.org/), a project I’m involved in, you can define “Customer” and “Order” and their relationship, and the InfoGrid graph database takes care of storing and retrieving data and enforcing the relationship. In our experience, that makes graph databases much more approachable to developers than many other NoSQL technologies.”
- Database-ed commented: “The problem is that when folks think about storing information that they need to retrieve, they are so ingrained to SQL that they fail to think of other means. The Facebook example is a case in point. Who is ever going to ask for an accurate report of every user in Facebook? If you miss something the first time you go looking, you can always present it later. The end user doesn’t know you lost it, they assume it didn’t exist at the time and now it does. Yet you still need to store the data for easy retrieval. One problem with SQL is that it ties you into the relationships. Facebook is about letting people build the relationships based on the fields they want to build them on, not the ones you might think of. I know, it can be done within the confines of SQL, but it is a lot harder to do when the size gets large.”
- Raptor007 commented: “Some tasks that are poorly serviced by SQL may get switched over to a new method, but other implementations that are perfectly suited to SQL will continue using it. As they quoted Eric Evans in the article, “the whole point of seeking alternatives is that you need to solve a problem that relational databases are a bad fit for.”
- Miracle Blue commented: “While I highly doubt there’s going to be any significant migration away from SQL and the like any time soon, I think more web developers will start experimenting with data stores and other data solutions as we move further into the cloud.”
- TheUnGod commented: “And as companies turn to ask their SQL DBAs what they think of this, they’ll say “lets stick with SQL.” Honestly, there are so many people that support SQL right now that will not switch any time soon this article is just bogus. You can’t make a switch like that until people can support it properly.”
- SteelChicken commented: “Document centric is pretty dumb if you plan on doing any sort of analytics and data mining. Great for workflow and such.”
- Angusm commented: “The significance of the NoSQL movement is that it adds new tools that offer better solutions to specific problems. The future probably belongs to NoSQL in the sense of ‘not-only SQL’, rather than ‘no SQL’. Don’t imagine that NoSQL solutions offer a free lunch though. I had an educational experience when I changed a view definition in a CouchDB data store and my first trivial query took an hour to come back. CouchDB can be pleasingly fast when all its indexes are built, but if you have to rebuild those indexes from scratch … well, let’s just say that’s not something you want to do on a live client-facing site.”
- Afex2win commented: “digg is one of the bigger proponents of Cassandra, a distributed data store in the vein of which the article is talking about. http://about.digg.com/blog/looking-future-cassandra“
- Drmangrum commented: “SQL will be around for awhile. It’s good at doing what it was designed to do. However, there are many times when people use SQL simply because there is nothing better out there. As data complexity rises, a new method for accessing and persisting that data will have to be investigated. Part of the problem with many of the alternate solutions is that few people know how to use them.”
For years to come, most of us will still use relational databases and SQL. At the same time, I will continue to look for better ways to design and implement databases in my applications. As always, any project decision has to match with the user and business requirements. For new projects, I believe, we now have genuine non-relational alternatives on the table (pun intended).
Programming is Life!