NoSQL LinkedIn Skills Index – March 2013

As Q1 comes to a close its time to take another look at our NoSQL LinkedIn Skills Index, based on the number of LinkedIn member profiles mentioning each of the NoSQL projects. This is the second update since we rebooted the analysis in September 2012 to account for more products and refine our search terms.

NoSQL_Mar

A few interesting statistics to pick out: Neo4j has, as predicted, jumped ahead of MarkLogic for sixth place. No other changes of position, but outside the top ten, shown here, Apache Accumulo continues to grow well.

In fact, Apache Accumulo had the fastest rate of growth for the second quarter in succession, just ahead of DynamoDB and OrientDB -once again – followed by Apache Cassandra and MongoDB.

MongoDB’s growth means that it once again extended its lead as the most popular NoSQL database, according to LinkedIn profile mentions. As the chart below illustrates, it now accounts for 46% of all mentions of NoSQL technologies in LinkedIn profiles, according to our sample, compared with 45% in December.

NoSQL_Mar2

The Data Day, Two days: January 11/14 2013

Navigating our illustrated database landscape map. And more

And that’s the Data Day, today.

The Data Day, A few days: January 2-4, 2013

Apache Cassandra and BigTop updates. And more

And that’s the Data Day, today.

NoSQL LinkedIn Skills Index – rebooted

I decided to reboot our analysis of NoSQL skills, according to LinkedIn search results.

There are two main reasons for doing so: the first iteration did not take in enough of the various NoSQL projects; and I have – with help – worked my way around the eccentricities of LinkedIn search to produce a more accurate result for Apache Cassandra.

The analysis therefore now incorporates a wider spectrum of NoSQL projects, the top ten most popular of which are displayed below. The chart illustrates the number of LinkedIn member profiles mentioning each of the NoSQL projects:

The main change from the previous results is the promotion of Apache Cassandra, thanks to our better search string, while MarkLogic is the first of our new additions to make the top ten.

What hasn’t changed is the dominance of MongoDB, which is way-ahead of all the others. While I am not breaking out growth percentages versus previous counts due to the reboot, it is fair to say that MongoDB is outpacing many of its rivals. Neo4j and DynamoDb are also growing particularly well.

In fact, as can be seen from the chart below, MongoDB accounts for 43% of all mentions of NoSQL technologies in LinkedIn profiles, according to our sample.

The Data Day, Today: August 8 2012

Who loves Hadoop? Who doesn’t?

And that’s the Data Day, today.

The Data Day, Today: Apr 25 2012

Splunk soars on IPO. VMware acquires Cetas. Vertica retain autonomy. And more.

An occasional series of data-related news, views and links posts on Too Much Information. You can also follow the series @thedataday.

* For 451 Research clients

# Splunk IPO: $3bn and counting M&A Insight

# VMware snaps up Cetas Software for ‘big data’ analytics Deal Analysis

# HP’s Vertica retains its autonomy, continues integration with Autonomy Impact Report

# SAP makes long-awaited predictive analytics move of its own Impact Report

# Sanbolic pitches data management platform for server, desktop and database consolidation Impact Report

* Splunk IPO kills, lives up to expectations

* VMware acquires Cetas Software for Cloud and Big Data Analytics

* Opera Solutions Acquires Procurement Analytics Tools and Services from BIQ and Lexington Analytics

* Terascala Announces $14M Series B Funding Round Led by Strategic Partner Consortium

* Ravel Acquired by W2O Group To Expand Big Data Client Services And Enrich In-House Analytics and Insights Technology

* Teradata Active Data Warehouses Provide Private Cloud Benefits

* Pentaho Introduces New Interactive Visualization and Expanded Big Data Analytics

* Teradata Unveils New Purpose-Built Appliance for SAS High-Performance Analytics

* SAP Establishes Global Managing Board to Lead Company

* Oracle to Hadoop Under OneAppliance: GridIron Introduces First All-Flash Appliance Line With Unprecedented Performance to Tackle Unified Big Data Processing

* Lucid Imagination Technology Integration with SugarCRM Lets Customers Enjoy Improved Global Search Capabilities with Apache Lucene/Solr

* The Apache Software Foundation Announces Apache Cassandra v1.1

* Miso project: how it will help you make your own Guardian-style infographics and data visualisations

And that’s the Data Day, today.

Update on the relative popularity of NoSQL database skills

Back in December we ran a series of posts looking at the geographic distribution of NoSQL skills, according to the results of searching LinkedIn member profiles, culminating in a look at the relative overall popularity of the major NoSQL databases.

This week I took another look at LinkedIn to update the results for a forthcoming report, which gives us the opportunity to see how the results have changed over the past quarter:

While this provides us with an interesting opportunity to track LinkedIn profile mentions over time there isn’t a huge amount we can learn from this first update – other than that MongoDB seems to be increasing its dominance.

The only significant change that isn’t immediately obvious from looking at the chart is that Apache HBase has overtaken Apache CouchDB by a tiny margin to claim third place overall.

As we noted last time, however, Apache HBase is more reliant on the US than other NosQL databases for its LinkedIn mentions: it is the second most prevalent NoSQL database mentioned in the USA but fourth in the rest of the world.

Two other points to take into consideration:

- The results for Apache Cassandra are probably disproportionately low since we have to search for the full phrase in order to avoid including people called Cassandra.

- Previously we only searched for Membase. This time we added together the search results for both Membase and Couchbase. This may mean the result for Couch/Membase is disproportionately high since some members probably listed both.

This is not meant to be a comprehensive analysis, however, but rather a snapshot of one particular data source.

The Data Day, Today: Jan 10 2012

Oracle OEMs Cloudera. The future of Apache CouchDB. And more.

An occasional series of data-related news, views and links posts on Too Much Information. You can also follow the series @thedataday.

* Oracle announced the general availability of Big Data Appliance, and an OEM agreement with Cloudera for CDH and Cloudera Manager.

* The Future of Apache CouchDB Cloudant confirms intention to integrate the core capabilities of BigCouch into Apache CouchDB.

* Reinforcing Couchbase’s Commitment to Open Source and CouchDB Couchbase CEO Bob Wiederhold attempts to clear up any confusion.

* Hortonworks Appoints Shaun Connolly to Vice President of Corporate Strategy Former vice president of product strategy at VMware.

* Splunk even more data with 4.3 Introducing the latest Splunk release.

* Announcement of Percona XtraDB Cluster (alpha release) Based on Galera.

* Bringing Value of Big Data to Business: SAP’s Integrated Strategy Forbes interview with with Sanjay Poonen, President and corporate officer of SAP Global Solutions.

* New Release of Oracle Database Firewall Extends Support to MySQL and Enhances Reporting Capabilities Self-explanatory.

* Big data and the disruption curve “Many efforts are being funded by business units and not the IT department and money is increasingly being diverted from large enterprise vendors.”

* Get your SQL Server database ready for SQL Azure Microsoft “codename” SQL Azure Compatibility Assessment.

* An update on Apache Hadoop 1.0 Cloudera’s Charles Zedlewski helpfully explains Apache Hadoop branch numbering.

* Xeround and the CAP Theorem So where does Xeround fit in the CAP Theorem?

* Can Yahoo’s new CEO Thompson harness big data, analytics? Larry Dignan thinks Scott Thompson might just be the right guy for the job.

* US Companies Face Big Hurdles in ‘Big Data’ Use “21% of respondents were unsure how to best define Big Data”

* Schedule Your Agenda for 2012 NoSQL Events Alex Popescu updates his list of the year’s key NoSQL events.

* DataStax take Apache Cassandra Mainstream in 2011; Poised for Growth and Innovation in 2012 The usual momentum round-up from DataStax.

* Objectivity claimed significant growth in adoption of its graph database, InfiniteGraph and flagship object database, Objectivity/DB.

* Cloudera Connector for Teradata 1.0.0 Self-explanatory.

* For 451 Research clients

# SAS delivers in-memory analytics for Teradata and Greenplum Market Development report

# With $84m in funding, Opera sets out predictive-analytics plans Market Development report

* Google News Search outlier of the day: First Dagger Fencing Competition in the World Scheduled for January 14, 2012

And that’s the Data Day, today.

User perspectives on NoSQL

The NoSQL EU event in London this week was a great event with interesting perspectives from both vendors – Basho, Neo Technology, 10gen, Riptano – and also users – The Guardian, the BBC, Amazon, Twitter. In particular I was interested in learning from the latter about how and why they ended up using alternatives to the traditional relational database model.

Some of the reasons for using NoSQL have been well-documented: Amazon CTO Werner Vogels talked about how the traditional database offerings were unable to meet the scalability Amazon.com requires. Filling a functionality void also explains why Facebook created Cassandra, Google created BigTable, and Twitter created FlockDB (etc etc). As Werner said, “We couldn’t bet the company on other companies building the answer for us.”

As Werner also explained, however, the motivation for creating Dynamo was also about enabling choice and ensuring that Amazon was not trying to force the relational database to do something it was not designed to do. “Choosing the right tool for the job” was a recurring theme at NoSQL EU.

Given the NoSQL name it is easy to assume that this means that the relational database is by default “the wrong tool”. However, the most important element in that statement is arguably not “tool”, but “job” and The Guardian discussed how it was using non-relational data tools to create new applications that complement its ongoing investment in the Oracle database.

For example, the Guardian’s application to manage the progress of crowdsourcing the investigation of MP’s expenses is based on Redis, while the Zeitgeist trending news application runs on Google’s AppEngine, as did its live poll during the recent leader’s election debate. Datablog, meanwhile, relies on Google Spreadsheets to serve up usable and downloadable data – we’ll ignore for a moment whether Google Spreadsheets is a NoSQL database ;-)

Long-term The Guardian is looking towards the adoption of a schema-free database to sit alongside its Oracle database and is investigating CouchDB. The overarching theme, as Matthew Wall and Simon Willison explained, is that the relational database is now just a component in the overall data management story, alongside data caching, data stores, search engines etc.

On the subject of choosing the right tool for the job, Basho’s engineering manager Brian Fink pointed out that using NoSQL technology alongside relational SQL database technology may actually improve the performance of the SQL database since storing data in a relational database that does not need SQL features slows down access to data that does need SQL features.

Another perspective on this came from Werner Vogels who noted that unlike database administrators/ systems architects, users don’t care about where data resides or what model it uses – as long as they get the service they require. Werner explained that the Amazon.com homepage is a combination of 200-300 different services, with multiple data systems. Users do not think about data sources in isolation, they care about the amalgamated service.

This was also a theme that cropped up in the presentation by Enda Farrell, software architect at the BBC, who noted that the BBC’s homepage is a PHP application integrated with multiple data sources at multiple data centers, and also Twitter‘s analytics lead Kevin Weil, who described Twitter’s use of Hadoop, Pig, HBase, Cassandra and FlockDB.

While the company is using HBase for low-latency analytic applications such as people search and moving to Cassandra from MySQL for its online applications, it uses its recently open-sourced FlockDB graph database to serve up data on followers and correlate the intersection of followers to (for example) ensure that Tweets between two people are only sent to the followers of both. (As something of an aside, Twitter is using Hadoop to store the 7TB of of data its generates a day from Tweets, and Pig for non-real time analytics).

Kevin noted that the company is also working with Digg to build real-time analytics for Cassandra and will be releasing the results as open source, and also discussed how Twitter has made use of open source technologies created by others such as Facebook (both Cassandra and the Scribe log data aggregation server.

One of the issues that has arisen from the fact that organizations such as Amazon and Facebook have had to create their own data management technologies is the proliferation of NoSQL databases and a certain amount of wheel re-invention.

Werner explained that SmugMug creator Don Macaskill ended up being a MySQL expert not because he necessarily wanted to be, but because he needed to be because he had to be to keep his applications running.

“He doesn’t want to have to become an expert in Cassandra,” noted Werner. “What he wants is to have someone run it for him and take care of that.” Presumably Riptano, the new Cassandra vendor formed by Jonathan Ellis – project chair for the Cassandra database – will take care of that, but in the meantime Werner raised another long-term alternative.

“We shouldn’t all be doing this,” he said, adding that Dynamo is not as popular within Amazon Web Services as it once was as it is a product, that requires configuration and management, rather than a service, and Amazon employees “have better things to do.”

Which raises the question – don’t Twitter, Facebook, the BBC, the Guardian et al have better things to do than developing and maintaining database architecture? In a perfect world, yes. But in a perfect world they’d all have strongly consistent, scalable distributed database systems/services that are suited to their various applications.

Interestingly, describing S3 as “a better key/value store than Dynamo”, Werner noted that SimpleDB and S3 are “a good start to provide that service”.

Looking forward to NoSQL EU

I was asked a few weeks ago whether I thought NoSQL was largely a US, (and specifically) West Coast phenomenon. While it might seem that way for some of those in the bubble that is the Bay Area (and to be fair that’s where I was at the time), the answer is a definite “no”.

As if to prove it, NoSQL EU is being held London next week with a great program of presentations from NoSQL vendors, projects and users.

April 20 features presentations on The Guardian’s use of NoSQL, as well as an overview from Alex Popescu of MyNoSQL, followed by presentations from Basho, 10gen, Rackspace and Neo Technology.

April 21 sees Amazon CTO Werner Vogels describing the birth of Dynamo, as well as presentations on the use of NoSQL databases from the BBC, Twitter, and Comcast. That is followed by presentations on Redis, Tokyo Cabinet (et al) and “the fate of the relational database”. Oh, and a panel debate moderated by some bloke called James Governor ;-)

Then on the 22nd there’s a day of workshops involving MongoDB, Redis, Riak and Neo4J.

It’s shaping up to be a great event and I’m really looking forward to it. If you’re going to be there and want to say hi (between sessions!) let me know.