December 2nd, 2011 — Data management
NC State University’s Institute for Advanced Analytics recently published some interesting statistics on Apache Hadoop adoption based on a search of LinkedIn data.
The statistics graphically illustrate what a lot of people wer already pretty sure of: that the geographic distribution of Hadoop skills (and presumably therefore adoption) is heavily weighted in favour of the USA, and in particular the San Francisco Bay Area.
The statistics showed that 64% of the 9,079 LinkedIn members with “Hadoop” in their member profiles (by no means perfect but an insightful measure nonetheless) are based in the US, and that the vast majority of those are in the Bay Area.
The results are what we would expect to see given the relative level of immaturity of Apache Hadoop adoption, as well as the nature and location of the early Hadoop adopters and Hadoop-related vendors.
The results got me thinking two things:
- how does the geographic spread compare to a more maturely adopted project?
- how does it compare to the various NoSQL projects?
So I did some searching of LinkedIn to find out.
To answer the first question I performed the same search for MySQL, as an example of a mature, widely-adopted open source project.
The results show that just 32% of the 366,084 LinkedIn members with “MySQL” in their member profiles are based in the US (precisely half that of Hadoop) while only 4.4% are in the Bay area, compared to 28.2% of the 9,079 LinkedIn members with “Hadoop” in their member profiles.
The charts below illustrate the difference in geographic distribution between Hadoop and MySQL. The size of the boxes is in proportion to the search result (click each image for a larger version).
With regards to the second question, I also ran searches for MongoDB, Riak, CouchDB, Apache Cassandra*, Membase*, Neo4j, Hbase, and Redis.
I’ll be posting the results for each of those over the next week or so, but in the meantime, the graphic below shows the split between the USA and Rest of the World (ROW) for all ten projects.
It illustrates, as I suspected, that the distribution of skills for NoSQL databases is more geographically disperse than for Hadoop.
I have some theories as to why that is – but I’d love to hear anyone else’s take on the results.
*I had to use the ‘Apache’ qualifier with Cassandra to filer out anyone called Cassandra, while Membase returned a more statistically relevant result than Couchbase.
World map image: Owen Blacker
April 20th, 2011 — Data management
As we noted last week, necessity is one of the six key factors that are driving the adoption of alternative data management technologies identified in our latest long format report, NoSQL, NewSQL and Beyond.
Necessity is particularly relevant when looking at the history of the NoSQL databases. While it is easy for the incumbent database vendor to dismiss the various NoSQL projects as development playthings, it is clear that the vast majority of NoSQL projects were developed by companies and individuals in response to the fact that the existing database products and vendors were not suitable to meet their requirements with regards to the other five factors: scalability, performance, relaxed consistency, agility and intricacy.
The genesis of much – although by no means all – of the momentum behind the NoSQL database movement can be attributed to two research papers: Google’s BigTable: A Distributed Storage System for Structured Data, presented at the Seventh Symposium on Operating System Design and Implementation, in November 2006, and Amazon’s Dynamo: Amazon’s Highly Available Key-Value Store, presented at the 21st ACM Symposium on Operating Systems Principles, in October 2007.
The importance of these two projects is highlighted by The NoSQL Family Tree, a graphic representation of the relationships between (most of) the various major NoSQL projects:
Not only were the existing database products and vendors were not suitable to meet their requirements, but Google and Amazon, as well as the likes of Facebook, LinkedIn, PowerSet and Zvents, could not rely on the incumbent vendors to develop anything suitable, given the vendors’ desire to protect their existing technologies and installed bases.
Werner Vogels, Amazon’s CTO, has explained that as far as Amazon was concerned, the database layer required to support the company’s various Web services was too critical to be trusted to anyone else – Amazon had to develop Dynamo itself.
Vogels also pointed out, however, that this situation is suboptimal. The fact that Facebook, LinkedIn, Google and Amazon have had to develop and support their own database infrastructure is not a healthy sign. In a perfect world, they would all have better things to do than focus on developing and managing database platforms.
That explains why the companies have also all chosen to share their projects. Google and Amazon did so through the publication of research papers, which enabled the likes of Powerset, Facebook, Zvents and Linkedin to create their own implementations.
These implementations were then shared through the publication of source code, which has enabled the likes of Yahoo, Digg and Twitter to collaborate with each other and additional companies on their ongoing development.
Additionally, the NoSQL movement also boasts a significant number of developer-led projects initiated by individuals – in the tradition of open source – to scratch their own technology itches.
Examples include Apache CouchDB, originally created by the now-CTO of Couchbase, Damien Katz, to be an unstructured object store to support an RSS feed aggregator; and Redis, which was created by Salvatore Sanfilippo to support his real-time website analytics service.
We would also note that even some of the major vendor-led projects, such as Couchbase and 10gen, have been heavily influenced by non-vendor experience. 10gen was founded by former Doubleclick executives to create the software they felt was needed at the digital advertising firm, while online gaming firm Zynga was heavily involved in the development of the original Membase Server memcached-based key-value store (now Elastic Couchbase).
In this context it is interesting to note, therefore, that while the majority of NoSQL databases are open source, the NewSQL providers have largely chosen to avoid open source licensing, with VoltDB being the notable exception.
These NewSQL technologies are no less a child of necessity than NoSQL, although it is a vendor’s necessity to fill a gap in the market, rather than a user’s necessity to fill a gap in its own infrastructure. It will be intriguing to see whether the various other NewSQL vendors will turn to open source licensing in order to grow adoption and benefit from collaborative development.
NoSQL, NewSQL and Beyond is available now from both the Information Management and Open Source practices (non-clients can apply for trial access). I will also be presenting the findings at the forthcoming Open Source Business Conference.
February 8th, 2011 — Data management, M&A
The predicted consolidation of the NoSQL database landscape has begun. Membase and CouchOne have announced that they are merging to form Couchbase.
And in more interesting NoSQL news, Danish IT company Trifork has announced that it has acquired an 8% stake in Basho as part of the NoSQL vendor’s $7.4m series D round, and has become the European distributor for Riak.
The formation of Couchbase brings together to of the leading companies in the NoSQL space, and the complementary nature of the their technology and business plans highlights that the term NoSQL has been applied to many different database technologies which are being adopted for different reasons.
While Membase had focused on improving the performance of distributed applications through its Membase Server distributed database, CouchOne focused on developer interest in flexible document data stores and mobile applications, rather than performance at scale.
Additionally while Membase was focused on operational adoption with a small (albeit significant) developer community, the priority with CouchOne has been on growing adoption of Apache CouchDB, with commercial efforts only recently becoming the focus of attention.
The technology is also complementary. Couchbase will combine the Membase and CouchDB projects to form a new distributed document store project of the same name that combines the caching and clustering technology of Membase with the CouchDB document data store.
The result will be a new distributed document database covering a variety of use cases from mobile applications (Mobile Couchbase) to scalable clusters (Elastic Couchbase), with synchronization of data between the various Couchbase implementations enabled by CouchSync.
The merged company will be led by Bob Weiderhold, formerly CEO of Membase, while Damien Katz, formerly CEO of CouchOne and creator of the CouchDB database, becomes CTO.
Couchbase is claiming more than 200 customers, which would indicate phenomenal growth for both companies since the launch of their CouchOne Mobile and Membase Server products in September and October 2010 respectively.
Prior to the launch of those products they previously claimed just a handful of customers each, although CouchOne had signed up thousands of users to its free hosted services, so it had a large and willing audience ready for conversion.
Additionally the company claims millions of combined users since CouchDB has been included in every installation of the Ubuntu Linux distribution since late 2009 and Heroku (now part of Salesforce.com) offers a Membase-driven service to thousands of its hosting customers.
We previously predicted that we would see the NoSQL market both consolidate and proliferate this year, and it is worth noting that the merger of CouchOne and Membase will not result in a similar consolidation of open source projects.
While Couchbase.org can be expected to replace membase.org over time, the Couchbase project will be independent of the Apache CouchDB, which will not be impacted by the merger. Couchbase will continue to contribute to both CouchDB and also the memcached project.
While we’re on the subject of NoSQL, it is also interesting to see that Danish IT vendor Trifork has not only signed up to be European distributor of the Riak database, but has also taken a stake in Basho Technologies.
Trifork has acquired newly issued shares in Basho representing 8.35% of the company as part of its series D round, with an option to acquire an additional 3.96% at the end of Q1 2011.
February 25th, 2010 — Data management
As a company, The 451 Group has built its reputation on taking a lead in covering disruptive technologies and vendors. Even so, with a movement as hyped as NoSQL databases, it sometimes pays to be cautious.
In my role covering data management technologies for The 451 Group’s Information Management practice I have been keeping an eye on the NoSQL database movement for some time, taking the time to understand the nuances of the various technologies involved and their potential enterprise applicability.
That watching brief has now spilled over into official coverage, following our recent assessment of 10gen. I also recently had the chance to meet up with Couchio’s VP of business development, Nitin Borwankar (see coverage initiation of Couchio). I’ve also caught up with Basho Technologies sooner rather than later. A report on that is now imminent.
There are a couple of reasons why I have formally began covering the NoSQL databases. The first is the maturing of the technologies, and the vendors behind them, to the point where they can be considered for enterprise-level adoption. The second is the demand we are getting from our clients to provide our view of the NoSQL space and its players.
This is coming both from the investment community and from existing vendors, either looking for potential partnerships or fearing potential competition. The number of queries we have been getting related to NoSQL and big data have encouraged articulation of my thoughts, so look-out for a two-part spotlight on the implications for the operational and analytical database markets in the coming weeks.
The biggest reason, however, is the recognition that the NoSQL movement is a user-led phenomena. There is an enormous amount of hype surrounding NoSQL but for the most part it is not coming from vendors like 10gen, Couchio and Basho (although they may not be actively discouraging it) but from technology users.
A quick look at the most prominent key-value and column-table NoSQL data stores highlights this. Many of these have been created by user organizations themselves in order fill a void and overcome the limitations of traditional relational databases – for example Google (BigTable), Yahoo (Hbase), Zvents (Hypertable), LinkedIn (Voldemort), Amazon (Dynamo), and Facebook (Cassandra).
It has become clear that traditional database technologies do need meet the scalability and performance requirements of dealing with big data workloads, particularly at a scale experienced by social networking services.
That does raise the question of how applicable these technologies will be to enterprises that do not share the architecture of the likes of Google, Facebook and LinkedIn – at least in the short-term. Although there are users – Cassandra users include Rackspace, Digg, Facebook, and Twitter, for example.
What there isn’t – for the likes of Cassandra and Voldemort, at least – is vendor-based support. That inevitably raises questions about the general applicability of the key-value/column table stores. As Dave Kellog notes, “unless you’ve got Google’s business model and talent pool, you probably shouldn’t copy their development tendencies”.
Given the levels of adoption it seems inevitable that vendors will emerge around some of these projects, not least since, as Dave puts it, “one day management will say: ‘Holy Cow folks, why in the world are we paying programmers to write and support software at this low a level?’”
In the meantime, it would appear that the document-oriented data stores (Couchio’s CouchDB, 10gen’s MongoDB, Basho’s Riak) are much more generally applicable, both technologically and from a business perspective. UPDATE – You can also add Neo Technology and its graph database technology to that list).
In our forthcoming two-part spotlight on this space I’ll articulate in more detail our view on the differentiation of the various NoSQL databases and other big data technologies and their potential enterprise applicability. The first part, on NoSQL and operational databases, is here.