NoSQL ≠ open source

I thought we finished with trying to define NoSQL in 2010 but Martin Fowler has raised the question again with his recent post – although he has a good reason to do so since he is collaborating on a book on the subject.

Fowler’s list of common characteristics (which he acknowledges is not definitional) is as follows:

  • Not using the relational model (nor the SQL language)
  • Open source
  • Designed to run on large clusters
  • Based on the needs of 21st century web properties
  • No schema, allowing fields to be added to any record without controls
  • You could argue about whether all NoSQL databases are designed to run on large clusters, but the characteristic from the list above that I would dispute is open source.

    While it is undoubtedly true to say that most NoSQL databases are open source, I don’t believe it defines them in the same way that other common characteristics do.

    The main argument for making open source licensing a requirement of NoSQL seems to me to be historical. The first NoSQL meeting, cited by Fowler, specified that it was about “open source, distributed, non-relational databases”.

    However, making open source licensing a defining characteristic of NoSQL would also exclude a number of products that would otherwise clearly fit the definition of NoSQL, as well as projects such as Google’s BigTable and Amazon’s Dynamo which were the genesis of much – although by no means all – of the momentum behind the NoSQL database movement.

    For the sake of argument let’s assume Amazon decided to release a version of Dynamo that could be deployed on-premise and for whatever reason decided not to release “Dynamo-on-premise” under an open source license.

    Is anyone seriously going to argue that a closed source “Dynamo-on-premise” wouldn’t be a NoSQL database?

    For what it’s worth since our NoSQL, NewSQL and Beyond report the description of NoSQL I have been using is:

  • A new breed of non-relational database products
  • sharing a rejection of fixed table schema and join operations
  • designed to meet scalability requirements of distributed architectures
  • and/or schema-less data management requirements
  • Although, like Fowler I would not claim this to be a definition.

    The Data Day, Today: Jan 10 2012

    Oracle OEMs Cloudera. The future of Apache CouchDB. And more.

    An occasional series of data-related news, views and links posts on Too Much Information. You can also follow the series @thedataday.

    * Oracle announced the general availability of Big Data Appliance, and an OEM agreement with Cloudera for CDH and Cloudera Manager.

    * The Future of Apache CouchDB Cloudant confirms intention to integrate the core capabilities of BigCouch into Apache CouchDB.

    * Reinforcing Couchbase’s Commitment to Open Source and CouchDB Couchbase CEO Bob Wiederhold attempts to clear up any confusion.

    * Hortonworks Appoints Shaun Connolly to Vice President of Corporate Strategy Former vice president of product strategy at VMware.

    * Splunk even more data with 4.3 Introducing the latest Splunk release.

    * Announcement of Percona XtraDB Cluster (alpha release) Based on Galera.

    * Bringing Value of Big Data to Business: SAP’s Integrated Strategy Forbes interview with with Sanjay Poonen, President and corporate officer of SAP Global Solutions.

    * New Release of Oracle Database Firewall Extends Support to MySQL and Enhances Reporting Capabilities Self-explanatory.

    * Big data and the disruption curve “Many efforts are being funded by business units and not the IT department and money is increasingly being diverted from large enterprise vendors.”

    * Get your SQL Server database ready for SQL Azure Microsoft “codename” SQL Azure Compatibility Assessment.

    * An update on Apache Hadoop 1.0 Cloudera’s Charles Zedlewski helpfully explains Apache Hadoop branch numbering.

    * Xeround and the CAP Theorem So where does Xeround fit in the CAP Theorem?

    * Can Yahoo’s new CEO Thompson harness big data, analytics? Larry Dignan thinks Scott Thompson might just be the right guy for the job.

    * US Companies Face Big Hurdles in ‘Big Data’ Use “21% of respondents were unsure how to best define Big Data”

    * Schedule Your Agenda for 2012 NoSQL Events Alex Popescu updates his list of the year’s key NoSQL events.

    * DataStax take Apache Cassandra Mainstream in 2011; Poised for Growth and Innovation in 2012 The usual momentum round-up from DataStax.

    * Objectivity claimed significant growth in adoption of its graph database, InfiniteGraph and flagship object database, Objectivity/DB.

    * Cloudera Connector for Teradata 1.0.0 Self-explanatory.

    * For 451 Research clients

    # SAS delivers in-memory analytics for Teradata and Greenplum Market Development report

    # With $84m in funding, Opera sets out predictive-analytics plans Market Development report

    * Google News Search outlier of the day: First Dagger Fencing Competition in the World Scheduled for January 14, 2012

    And that’s the Data Day, today.

    How to to provide a strongly consistent distributed database and not break CAP Theorem

    In the months since we coined the term NewSQL we have come to define it as referring to a new breed of relational database products designed to meet scalability requirements of distributed architectures, or improve performance so horizontal scalability is no longer a necessity, while maintaining support for SQL and ACID.

    During the recent round of NoSQL Road Show events it has emerged that this description could be taken to suggest that NewSQL products are able to provide consistency, availability and partition tolerance and therefore contravene the common understanding of CAP Theorem that “a distributed system can satisfy any two of these guarantees at the same time, but not all three.”

    How is possible to provide strongly consistent distributed systems and not break CAP Theorem?

    For a start, CAP Theorem is not that simple. As others have pointed out – Cloudera’s Henry Robinson for example – CAP Theorem isn’t simply a case of “consistency, availability, partition tolerance. Pick two.”

    We know this to be the case, since while Amazon’s Dynamo (and the many NoSQL databases it has inspired) sacrifices consistency for availability, it does so with eventual consistency, not the total absence of consistency. Clearly is possible to have systems that are partition tolerant, highly available and offer *a degree of consistency*.

    Partition tolerance is not something that can be relaxed in the same manner – in fact the proof of CAP Theorem relies on an assumption of partition tolerance. As Yammer engineer Coda Hale explains: “Partition Tolerance is mandatory in distributed systems. You cannot not choose it.”

    Similarly, Daniel Abadi has previously explained how CAP is not really about choosing two of three states, but about answering the question “if there is a partition, does the system give up availability or consistency?”

    Just as systems that sacrifice consistency retain a degree of consistency, Daniel also makes the point that systems that give up availability also do not do so in totality, noting that “availability is only sacrificed when there is a network partition.”

    As such, Daniel makes the point that the roles of consistency and availability in CAP are asymmetric, and that latency is the forgotten factor that re-balances the equation.

    Daniel has also returned to the issue of the tradeoff between latency and consistency in a more recent post, noting that, unlike availability vs consistency, “the latency vs. consistency tradeoff is present even during normal operations of the system.”

    The Apache Cassandra wiki actually makes this point very well:

    “The CAP theorem… states that you have to pick two of Consistency, Availability, Partition tolerance: You can’t have the three at the same time and get an acceptable latency. Cassandra values Availability and Partitioning tolerance (AP). Tradeoffs between consistency and latency are tunable in Cassandra. You can get strong consistency with Cassandra (with an increased latency).”

    This suggests that you can, in fact, have consistency, partition tolerance and availability at the same time, but that latency will suffer. ScaleDB’s Mike Hogan made that argument earlier this year in describing the ‘CAP event horizon’ – “the point at which latency for a clustered system exceeds that which is acceptable and then you must decide what concessions you are willing to make”.

    NewSQL databases are not designed to avoid that CAP event horizon by being as available as eventually consistent systems – that *would* break CAP Theorem – but arguably they are designed to delay that CAP event horizon as much as possible by delivering systems that, in the event of a partition, are highly consistent and offer *a degree of availability*.

    Whether that degree of availability is suitable for your application will depend on your tolerance – not for partitions but for latency.

    The geographic distribution of NoSQL skills – just one more thing

    Hidden away amongst the details of our little tour around LinkedIn statistics on NoSQL and Hadoop skills was some interesting information on how many LinkedIn members list the various data management technologies in our sample in their profiles.

    Our original post contained the fact that there were 9,079 LinkedIn members with “Hadoop” in their member profiles, for example, compared to 366,084 with “MySQL” in their member profiles.

    Later posts showed there were 170 with “Membase” and 1,687 with “HBase”, 787 with “Apache Cassandra” and 376 with “Riak”, 6,048 with “MongoDB” and 2,152 with “Redis”, and finally, 1,844 with “CouchDB” and 268 with “Neo4j”.

    This gives us an interesting perspective on the relative adoption of the various NoSQL databases:

    If it wasn’t already obvious from the list above, the chart illustrates just how much more prevalent MongoDB skills are compared to the other NoSQL databases, followed by Redis, Apache CouchDB, Apache HBase and Apache Cassandra. The chart also illustrates that while HBase is the second most prevalent NoSQL skill set in the USA, it is only fourth overall given its lower prevalence in the rest of the world.

    In response, a representative from a certain vendor notes “Some skills are more valued not because they are more prevalent, but because they are harder to achieve.” Make of that what you will.

    The geographic distribution of NoSQL skills: CouchDB and Neo4j

    Following last week’s post putting the geographic distribution of Hadoop skills, based on a search of LinkedIn members, in context, this week we will be publishing a series of posts looking in detail at the various NoSQL projects.

    The posts examine the geographic spread of LinkedIn members citing a specific NoSQL database in their member profiles, as of December 1, and provides an interesting illustration of the state of adoption for each.

    We’ve already taken a look at Membase and HBase; Apache Cassandra and Riak; and 10gen’s MongoDB and Redis.

    Part four brings the series to a close with a look at Apache CouchDB and Neo4j, which boast the most geographically diverse adoption of the NoSQL databases in our sample.

    The statistics showed that 36.4% of the 1,844 LinkedIn members with “CouchDB” in their member profiles are based in the US, while only 8.9% are in the Bay area, the least of any of the NoSQL database we looked at.

    The results also indicate that the UK is a particularly strong area for CouchDB skills, with 7.1%. Other hot-spots include Canada (4.1%), Germany (4.0%) and The Netherlands (3.1%).

    Neo4j is even more widely adopted, with only 36.2% of the 268 LinkedIn members with “Neo4j” in their member profiles based in the US, although 10.4% are in the Bay area.

    With 4.1%, Sweden is a hot-spot for Neo4j skills, as one might expect given that’s where it and Neo Technology originated. The UK is also strong with 9.7%, followed by India with 5.6% and the New York area with 4.9%.

    Since Neo4j originated in Europe it is of course an open question whether its higher adoption in the Rest of the World than the US is a sign of a greater spread of adoption, or a relative failure to infiltrate the US market. Given that the company already has an active presence in the US we are inclined towards the former.

    N.B. The size of the boxes is in proportion to the search result (click each image for a larger version). World map image: Owen Blacker

    The geographic distribution of NoSQL skills: MongoDB and Redis

    Following last week’s post putting the geographic distribution of Hadoop skills, based on a search of LinkedIn members, in context, this week we will be publishing a series of posts looking in detail at the various NoSQL projects.

    The posts examine the geographic spread of LinkedIn members citing a specific NoSQL database in their member profiles, as of December 1, and provides an interesting illustration of the state of adoption for each.

    We’ve already taken a look at Membase and HBase, and Apache Cassandra and Riak. Part three examines the geographic spread of 10gen’s MongoDB and Redis.

    The statistics showed that 41.0% of the 6,048 LinkedIn members with “MongoDB” in their member profiles are based in the US, putting MongoDB is the top half of the table for geographic spread.

    Only 11.2% are in the Bay area, fewer than Hadoop, Membase, HBase, Cassandra, Riak and Redis. The results also indicate that the New York area is a hot-spot for MongoDB skills, with 6.2% – as one might expect given the location of 10gen’s HQ. Other hot-spots include Brazil (4.2%) and Ukraine (2.8%).

    Redis is even more widely adopted, with only 37% of the 2,152 LinkedIn members with “Redis” in their member profiles are based in the US, although 12.0% are in the Bay area.

    Ukraine is also a hot-spot for Redis skills (3.8%) as is France (3.6%) and Spain (2.9%).

    The series will conclude later this week with CouchDB, and Neo4j.

    N.B. The size of the boxes is in proportion to the search result (click each image for a larger version). World map image: Owen Blacker

    The geographic distribution of NoSQL skills: Apache Cassandra and Riak

    Following last week’s post putting the geographic distribution of Hadoop skills, based on a search of LinkedIn members, in context, this week we will be publishing a series of posts looking in detail at the various NoSQL projects.

    The posts examine the geographic spread of LinkedIn members citing a specific NoSQL database in their member profiles, as of December 1, and provides an interesting illustration of the state of adoption for each.

    Following yesterday’s look at Membase and HBase, part two examines the geographic spread of Apache Cassandra and Basho Technologies’ Riak.

    The statistics showed that 52.2% of the 787 LinkedIn members with “Apache Cassandra” in their member profiles are based in the US (as previously explained, we had to use the ‘Apache’ qualifier with Cassandra to filer out people with the name Cassandra).

    A significant proportion (18.0%) of those are in the Bay area, although fewer than Hadoop, Membase and HBase. The results also indicate that Canada is a hot-spot for Apache Cassandra skills, with 4.1%, while Apache Cassandra is also making in-roads into Europe via France and Spain.

    Basho’s Riak is less dependent on the USA for adoption. The statistics showed that less than half – 45.5% – of the 376 LinkedIn members with “Riak” in their member profiles are based in the US, with only 13.0% in the Bay area.

    Riak hot-spots include the UK (6.9%) and Australia (4.3%). as well as the Boston area, in keeping with the company’s HQ.

    The series will continue later this week with MongoDB, CouchDB, Neo4j, and Redis.

    N.B. The size of the boxes is in proportion to the search result (click each image for a larger version). World map image: Owen Blacker

    The geographic distribution of NoSQL skills: HBase and Membase

    Following last week’s post putting the geographic distribution of Hadoop skills, based on a search of LinkedIn members, in context, this week we will be publishing a series of posts looking in detail at the various NoSQL projects.

    The posts examine the geographic spread of LinkedIn members citing a specific NoSQL database in their member profiles, as of December 1, and provides an interesting illustration of the state of adoption for each.

    We begin this week’s series with Membase and HBase, the two projects that proved, like Apache Hadoop, to have significantly greater adoption in the USA compared to the rest of the world.

    The statistics showed that 58.2% of the 170 LinkedIn members with “Membase” in their member profiles are based in the US (as previously explained, we tried the same search with Couchbase, but with only 85 results we decided to use the Membase result set as it was more statistically relevant).

    As with Hadoop, a significant proportion (27.1%) of those are in the Bay area, the highest proportion of all the NoSQL databases we looked at. The results also indicate that Ukraine is a hot-spot for Membase skills, with 3.5%, while Membase adoption is lower the UK (2.4%) than other NoSQL databases.

    It should not be a great surprise that Apache HBase returned similar results to Apache Hadoop. The top eight individual regions for HBase were exactly the same as for Hadoop, although the UK (3.4%) is stronger for HBase, as is India (10.7%).

    The statistics showed that 57.0% of the 1,687 LinkedIn members with “HBase” in their member profiles are based in the US, with 25.0% in the Bay area (the third highest in our sample behind Hadoop and Membase).

    The series will continue later this week with MongoDB, Riak, CouchDB, Apache Cassandra, Neo4j, and Redis.

    N.B. The size of the boxes is in proportion to the search result (click each image for a larger version). World map image: Owen Blacker

    The geographic distribution of Hadoop skills: in context

    NC State University’s Institute for Advanced Analytics recently published some interesting statistics on Apache Hadoop adoption based on a search of LinkedIn data.

    The statistics graphically illustrate what a lot of people wer already pretty sure of: that the geographic distribution of Hadoop skills (and presumably therefore adoption) is heavily weighted in favour of the USA, and in particular the San Francisco Bay Area.

    The statistics showed that 64% of the 9,079 LinkedIn members with “Hadoop” in their member profiles (by no means perfect but an insightful measure nonetheless) are based in the US, and that the vast majority of those are in the Bay Area.

    The results are what we would expect to see given the relative level of immaturity of Apache Hadoop adoption, as well as the nature and location of the early Hadoop adopters and Hadoop-related vendors.

    The results got me thinking two things:
    - how does the geographic spread compare to a more maturely adopted project?
    - how does it compare to the various NoSQL projects?

    So I did some searching of LinkedIn to find out.

    To answer the first question I performed the same search for MySQL, as an example of a mature, widely-adopted open source project.

    The results show that just 32% of the 366,084 LinkedIn members with “MySQL” in their member profiles are based in the US (precisely half that of Hadoop) while only 4.4% are in the Bay area, compared to 28.2% of the 9,079 LinkedIn members with “Hadoop” in their member profiles.

    The charts below illustrate the difference in geographic distribution between Hadoop and MySQL. The size of the boxes is in proportion to the search result (click each image for a larger version).

    With regards to the second question, I also ran searches for MongoDB, Riak, CouchDB, Apache Cassandra*, Membase*, Neo4j, Hbase, and Redis.

    I’ll be posting the results for each of those over the next week or so, but in the meantime, the graphic below shows the split between the USA and Rest of the World (ROW) for all ten projects.

    It illustrates, as I suspected, that the distribution of skills for NoSQL databases is more geographically disperse than for Hadoop.

    I have some theories as to why that is – but I’d love to hear anyone else’s take on the results.

    *I had to use the ‘Apache’ qualifier with Cassandra to filer out anyone called Cassandra, while Membase returned a more statistically relevant result than Couchbase.

    World map image: Owen Blacker

    Forthcoming webinar: Real Enterprise NoSQL Applications

    On Wednesday, December 7, 2011 at 10am PT (6pm GMT) I’ll be taking part in a webinar with DataStax CTO and Apache Cassandra project chair Jonathan Ellis on the subject of Apache Cassandra: Real NoSQL Applications in the Enterprise Today.

    The session will shed light on real-world use cases for NoSQL databases by providing case studies from enterprise production users taking advantage of the massively scalable and highly-available architecture of Apache Cassandra.

    I’ll be summarising some of the findings from our NoSQL, NewSQL and Beyond research report, and exploring the drivers behind the development and adoption of NoSQL databases – explaining how the failure of existing suppliers to meet the performance, scalability and flexibility needs of large-scale data processing has led to the development and adoption of alternative data management technologies.

    Jonathan will provide more detail on Apache Cassandra and DataStax, including a number of real-world projects including Netflix, Backupify, Ooyala and Constant Contact.

    You can register for the event here and find more details about our NoSQL, NewSQL and Beyond research report here.