CouchDB – sink or swim?

CouchDB – up a creek without a paddle? Image source: bobbyfeind on Flickr

Almost a year ago Apache CouchDB creator Damien Katz announced that he would no longer be contributing to the CouchDB document database project he had created, choosing instead to focus on the development of Couchbase Server 2.0, which united CouchDB with Membase Server.

While the abandonment of an open source project by the person that created it is by no means unprecedented it is still unusual enough to warrant a look at what has happened to CouchDB in the year that followed.

Surviving or thriving?

The first point to make is that the survival of CouchDB following Katz’s departure was never in doubt, thanks to the fact that it is an Apache Foundation project. One of the benefits of the foundation model is that it doesn’t depend on a dominant developer or vendor to keep a project moving forward.

Although it briefly appeared that Cloudant would fulfil the role of the major corporate backer of CouchDB with its BigCouch clustered CouchDB technology after Couchbase discontinued its own CouchDB distribution, the company instead refocused its attention on its CouchDB- and BigCouch-based managed service.

While developers from both Couchbase and Cloudant continue to develop to the project Apache CouchDB doesn’t have a lead corporate backer, nor does it need one. According to factoids gathered by Ohloh, there were 30 contributors to the Apache CouchDB project in the past 12 months, up from 18 in the prior 12 months, and placing CouchDB in the top 2% of all project teams on Ohloh.

The question is not whether CouchDB is surviving, however, but whether it is thriving. That increase in contributor count would suggest so, but that’s by no means the full story. In contrast, the number of commits per month has declined in the past 12 months, representing, as Ohloh describes it, “a substantial decrease in development activity”. As the related chart illustrates, in fact, activity has pretty much flatlined since the beginning of the year.

Source: Ohloh

This should not be altogether surprising since the latest release went GA in April.

In response to a request for comment, a spokesperson on behalf of the Apache CouchDB PMC stated:

“Despite an unsettled start to the year, the CouchDB project and the
surrounding community continue to grow and evolve, with the release of
1.2.0 earlier this year, and the forthcoming 1.3.0, currently being
prepared for release
. 1.3.0 includes in the last year alone, over 221
commits on the just the master branch, comprising 167 files changed,
5745 insertions, 2248 deletions — solid progress for a project with
22,000 lines of code total.”

Additionally, while the start of that flatline coincides with Katz’s departure from the project, it is not clear that the two are actually related. Ohloh figures indicate that Katz hadn’t actually committed code to the project since August 2010 and is only the eighth all-time most active committer to the project.

It is clear that there is still a lot of activity ongoing in the Apache CouchDB community, with the PMC citing rcouch, bigcouch, PouchDB, TouchDB frameworks for both iOS and Android, a Mac OS X binary installation, and

The PMC spokesperson added:

“Structurally, the project has added both committers and grown the
project management committe, and has been having regular meetings
through the last 2 months to improve communication within the team,
and help steer the community. A roadmap has been put together, and
Ubuntu-style time-scheduled releases are planned for 2013 to keep the
good oil flowing.”

However, in assessing the health of Apache CouchDB, we must look at adoption trends, as well as project activity.

Waving or drowning?

Searching mailing list archives using MarkMail indicates that there has been a decline in the number of messages to the developer, user, commits mailing lists in the past 12 months, although with increased activity on the latter since July.

Additionally, figures from Indeed.com suggest that job activity related to CouchDB saw a sharp decline in the early months of the year, although also a recovery in recent months.

couchdb Job Trends graph

couchdb Job Trends Couchdb jobs

However, that activity is perhaps best viewed in the context of a comparison with another major NoSQL project – MongoDB for instance – which reveals that CouchDB job postings have more or less level-off since the start of the year.

couchdb, mongodb Job Trends graph

couchdb, mongodb Job Trends Couchdb jobsMongodb jobs

We have also been tracking the traction of NoSQL projects via searches of LinkedIn member profiles. The latest figures, due to be published later this week, show that mentions of CouchDB in LinkedIn member profiles grew over 139% between December 2011 and today.

That sounds good, but again must be viewed in the context of the rest of the NoSQL ecosystem. The statistics show that mentions of a selection of other major NoSQL databases grew significantly faster in the same period.

So what are we to make of all the evidence. Clearly the Apache CouchDB project will survive, and the lack of updates in 2012 is not a major concern, although the level of interest in the project is not growing as fast as other NoSQL technologies. My personal gut feel is that Apache CouchDB has the potential to become the PostgreSQL of the NoSQL generation: a solid, mature projects with a large community of developers and ecosystem of associated vendors that is often over-shadowed by more commercially-oriented alternatives but has a loyal and committed user-base.

Key to this comparison bearing up on longterm scrutiny will be the ability of the Apache CouchDB project to increase and maintain the level of development so that the Lines of code chart, above, better resembles that of PostgreSQL, below:

The comparison with PostgreSQL is also apt given the departure from the project of its creator. While many people do know the origins of the PostgreSQL project given that the original project leader is one of the most famous database experts in the world, I am sure a lot of PostgreSQL users wouldn’t know or care whether the project’s creator continued to be involved. Similarly, Katz’s departure from Apache CouchDB, while undoubtedly a short-term challenge, appears not to have had a significant impact on the project’s ongoing development.

Necessity is the mother of NoSQL

As we noted last week, necessity is one of the six key factors that are driving the adoption of alternative data management technologies identified in our latest long format report, NoSQL, NewSQL and Beyond.

Necessity is particularly relevant when looking at the history of the NoSQL databases. While it is easy for the incumbent database vendor to dismiss the various NoSQL projects as development playthings, it is clear that the vast majority of NoSQL projects were developed by companies and individuals in response to the fact that the existing database products and vendors were not suitable to meet their requirements with regards to the other five factors: scalability, performance, relaxed consistency, agility and intricacy.

The genesis of much – although by no means all – of the momentum behind the NoSQL database movement can be attributed to two research papers: Google’s BigTable: A Distributed Storage System for Structured Data, presented at the Seventh Symposium on Operating System Design and Implementation, in November 2006, and Amazon’s Dynamo: Amazon’s Highly Available Key-Value Store, presented at the 21st ACM Symposium on Operating Systems Principles, in October 2007.

The importance of these two projects is highlighted by The NoSQL Family Tree, a graphic representation of the relationships between (most of) the various major NoSQL projects:

Not only were the existing database products and vendors were not suitable to meet their requirements, but Google and Amazon, as well as the likes of Facebook, LinkedIn, PowerSet and Zvents, could not rely on the incumbent vendors to develop anything suitable, given the vendors’ desire to protect their existing technologies and installed bases.

Werner Vogels, Amazon’s CTO, has explained that as far as Amazon was concerned, the database layer required to support the company’s various Web services was too critical to be trusted to anyone else – Amazon had to develop Dynamo itself.

Vogels also pointed out, however, that this situation is suboptimal. The fact that Facebook, LinkedIn, Google and Amazon have had to develop and support their own database infrastructure is not a healthy sign. In a perfect world, they would all have better things to do than focus on developing and managing database platforms.

That explains why the companies have also all chosen to share their projects. Google and Amazon did so through the publication of research papers, which enabled the likes of Powerset, Facebook, Zvents and Linkedin to create their own implementations.

These implementations were then shared through the publication of source code, which has enabled the likes of Yahoo, Digg and Twitter to collaborate with each other and additional companies on their ongoing development.

Additionally, the NoSQL movement also boasts a significant number of developer-led projects initiated by individuals – in the tradition of open source – to scratch their own technology itches.

Examples include Apache CouchDB, originally created by the now-CTO of Couchbase, Damien Katz, to be an unstructured object store to support an RSS feed aggregator; and Redis, which was created by Salvatore Sanfilippo to support his real-time website analytics service.

We would also note that even some of the major vendor-led projects, such as Couchbase and 10gen, have been heavily influenced by non-vendor experience. 10gen was founded by former Doubleclick executives to create the software they felt was needed at the digital advertising firm, while online gaming firm Zynga was heavily involved in the development of the original Membase Server memcached-based key-value store (now Elastic Couchbase).

In this context it is interesting to note, therefore, that while the majority of NoSQL databases are open source, the NewSQL providers have largely chosen to avoid open source licensing, with VoltDB being the notable exception.

These NewSQL technologies are no less a child of necessity than NoSQL, although it is a vendor’s necessity to fill a gap in the market, rather than a user’s necessity to fill a gap in its own infrastructure. It will be intriguing to see whether the various other NewSQL vendors will turn to open source licensing in order to grow adoption and benefit from collaborative development.

NoSQL, NewSQL and Beyond is available now from both the Information Management and Open Source practices (non-clients can apply for trial access). I will also be presenting the findings at the forthcoming Open Source Business Conference.

How soon is now? Corporate contributions and open source innovation in the context of NoSQL

In my role as part of The 451 Group’s Information Management practice I have recently initiated coverage on the various “NoSQL” databases, which are providing a fresh challenge to conventional relational databases (clients can get a good introduction to our coverage here, while non-clients can also see some of my thinking aloud over at our Too Much Information blog).

The rise of the NoSQL movement is also highly relevant in the context of open source software, however, especially in relation to two key issues related to open source software.

1/ The (lack of) corporate user contributions
2/ Open source as a source of innovation (as opposed to disruption)

NoSQL is very much a user-led phenomenon and has occurred as the likes of Google, Amazon, Facebook, LinkedIn and Twitter have created their own distributed data management technologies to overcome the fact that traditional database products were not able to match their performance and scalability requirements.

No all NoSQL databases are the product of companies that we would traditionally think of as users rather than developers, and not all NoSQL databases are open source, but there are a large number of projects that fulfill both criteria: such as Apache Cassandra (which originated at Facebook), Apache Hbase (Yahoo), Hypertable (Zvents), Voldemort (LinkedIn) and FlockDB (Twitter).

Meanwhile there are a number of vendors and projects focused on adding persistence, replication, index and query capabilities to memcached, which was originally created by Danga Interactive to solve its database scalability issues.

This is also (mostly) not a matter of businesses creating projects in house and then simply throwing the code over the wall. At last week’s NoSQL EU event in London, Twitter’s analytics lead, Kevin Weill, discussed how Twitter is working with Digg to create real-time analytics for Cassandra. Kevin also recently Tweeted (naturally enough) about Hadoop-LZO, a project to bring splittable LZO compression to Hadoop, on which Twitter is collaborating with Cloudera and Facebook.

There are plenty of other examples of contributions being made by Twitter, Facebook, Digg and LinkedIn on their own open source pages, but in many ways the biggest thing here is not the individual contributions but the commitment to the overall culture of contribution and collaboration.

It is often said that open source developers begin by scratching their own itch, and that is most definitely true when we look at the motivations behind the creation of projects by the companies above, but there is also a culture and clear understanding that there is much to gain from collaboration.

The NoSQL technologies also undermine the suggestion that while open source can be used to commoditize established markets it is not good an innovation. While the likes of Cassandra and Voldemort – not to mention Neo4J, Redis, CouchDB, Riak and MongoDB – are undoubtedly operating within a larger established market, the longer we look at NoSQL the clearer it is that far from commoditizing an established market these technologies are being used to innovate beyond the realms of the established relational database and establish new database market segments.

