VC funding for Hadoop and NoSQL tops $350m

451 Research has today published a report looking at the funding being invested in Apache Hadoop- and NoSQL database-related vendors. The full report is available to clients, but below is a snapshot of the report, along with a graphic representation of the recent up-tick in funding.

According to our figures, between the beginning of 2008 and the end of 2010 $95.8m had been invested in the various Apache Hadoop- and NoSQL-related vendors. That figure now stands at more than $350.8m, up 266%.

That statistic does not really do justice to the sudden uptick of interest, however. The figures indicate that funding for Apache Hadoop- and NoSQL-related firms has more than doubled since the end of August, at which point the total stood at $157.5m.

A substantial reason for that huge jump is the staggering $84m series A funding round raised by Apache Hadoop-based analytics service provider Opera Solutions.

The original commercial supporter of Apache Hadoop, Cloudera, has also contributed strongly with a recent $40m series D round. In addition, MapR Technologies raised $20m to invest in its Apache Hadoop distribution, while we know that Hortonworks also raised a substantial round (unconfirmed, but reportedly $20m) from Benchmark Capital and former parent Yahoo as it was spun off in June. Index Ventures also recently announced that it has become an investor in Hortonworks.

I am reliably informed that if you factor in Hortonworks’ two undisclosed rounds, the total funding for Hadoop and NoSQL vendors is actually closer to $400m.

The various NoSQL database providers have also played a part in the recent burst of investment, with 10gen raising a $20m series D round and Couchbase raising $15m. DataStax, which has interests in both Apache Cassandra and Apache Hadoop, raised an $11m series B round, while Neo Technology raised a $10.6m series A round. Basho Technologies raised $12.5m in series D funding in three chunks during 2011.

Additionally, there are a variety of associated players, including Hadoop-based analytics providers such as Datameer, Karmasphere and Zettaset, as well as hosted NoSQL firms such as MongoLab, MongoHQ and Cloudant.

One investor company name that crops up more than most in the list above is Accel Partners, which was an original investor in both Cloudera and Couchbase, and backed Opera Solutions via its Accel- KKR joint venture with Kohlberg Kravis Roberts.

It appears that those investments have merely whetted Accel’s appetite for big data, however, as the firm last week announced a $100m Big Data Fund to invest in new businesses targeting storage, data management and analytics, as well as data-centric applications and tools.

While Accel is the fist VC shop that we are aware of to create a fund specifically for big data investments, we are confident both that it won’t be the last and that other VCs have already informally earmarked funds for data-related investments.

451 clients can get more details on funding and M&A involving more traditional database vendors, as well as our perspective on potential M&A suitors for the Hadoop and NoSQL players.

Scalable SQL: more than the mullet of the database world?

In the first part of our coverage on emerging database products and vendors we examined the new NoSQL databases and suggested that the incumbent database vendors would likely respond to the growing threat with a mix of in-memory and distributed caching technologies.

That is yet to happen, although it has only been a few months and the NoSQL databases have generated more noise than revenue at this stage, but in the meantime a new set of database vendors and products have emerged that could pose a more direct threat to the database incumbents while thwarting the potential of the NoSQL upstarts.

For want of a better phrase we have taken to referring to these products collectively as scalable SQL databases, and have just published a new spotlight report pulling together our various reports on the runners and riders.

Some of the vendors promise to deliver the scalability and flexibility promised by NoSQL while retaining the support for SQL queries and/or ACID (atomicity, consistency, isolation, durability). That is not an insignificant boast and it will be tough to offer the best of both worlds.

“SQL For Business, NoSQL For Partay!” is the explanation offered by MulletDB, a project that promises scalability and SQL queries. The danger is the scalable SQL ends up being the database equivalent of the celebrated mullet hairstyle or its business attire equivalent: the jacket and jeans.

One of the companies trying to avoid that problem is GenieDB (coverage) The London-based company’s GenieDB Engine is a fully replicated distributed database that combines a key-value store database with a ‘sharded’ memcached layer. Another example is Clustrix, which was founded in December 2006 to develop a new database appliance that would offer both scalability and durability in a single product.

Meanwhile VoltDB emerged earlier this summer with a transactional database management system that is designed to scale across clusters of industry-standard servers while retaining transactional integrity.

Additionally Xeround has recently confirmed its intention to reposition its Intelligent Data Grid (IDG) technology as Xeround Data Service, a scalable SQL database with support for ACID-compliant transactional capabilities for cloud computing environments, while New Technology/enterprise’s CloudTran, is designed to bring enterprise-level transaction management to GigaSpaces’ XAP in-memory data grid for on-premises deployment, and eventually any PaaS offering.

Meanwhile we are intrigued by VMware’s acquisiton of distributed data management vendor GemStone and its positioning of GemFire as a next-generation data management layer for cloud applications, as well as the forthcoming introduction of SQL querying in GigaSpaces’ eXtreme Application Platform (XAP), which will enable in-memory management of relational data and initiatives.

It is very early stages for all these vendors, and they have yet to prove that they have truly solved the problem of consistency and partition tolerance. In the meantime there are plenty of other contenders waiting in line.

Akiban is promising that it has the secret to SQL scalability with an approach that pre-groups data in order to overcome latency, caching and data distribution issues. Another company currently in stealth mode is JustOne Database which is working on perfecting a new storage model in order to deliver the performance and scalability required to support transactions and analytics on the same data simultaneously.

That is also the goal of Tokutek, which offers the TokuDB MySQL storage engine is based on Fractal Tree indexing technology designed to reduce data-insertion times and improve the performance of MySQL for both read and write applications.

JustOne and Tokutek are part of a slightly different set of vendors we are viewing under the scalable SQL umbrella: those that promise to improve performance for appropriate workloads to the extent that the advanced scale-out capabilities promised by some NoSQL databases become irrelevant.

While we’re on the subject of existing database vendors that could be considered part of the scalable SQL set, it is also worth mentioning MarkLogic. The company has recently been| associating itself with NoSQL and while the fact that it does not support SQL makes it a better literal fit with NoSQL the company’s support for ACID means that we would see it as an option for customers looking to improve performance without losing consistency, especially for unstructured or semi-structured data.*

As we previously noted; to some degree, the rise of NoSQL has resulted from the inability of the MySQL database to scale consistently. It is no surprise to see many of the scalable SQL vendors promising to improve the performance and scalability of MySQL, therefore, while others promote a clean-slate approach to address new big data management problems.

We have more details on each of the products and projects, mentioned above (as well as some not mentioned) their potential use cases, how they relate to MySQL, and what potential impact they may have on the adoption of NoSQL technologies, in the full report.

This is very much the start of our coverage of these vendors however. Expect more coverage in the near future, as well as a wider perspective on the potential for alternatives to the incumbent database suppliers, into 2011.

*Additionally, since the absence of SQL is only really tangential to many of the projects and products referred to as NoSQL it seems to me to be appropriate to have a database that does not support SQL in the scalable SQL category.

Is Sybase buying Aleri?

Marc Adler and Marco Seiriö seem to think so.

Such a deal would seem a little strange coming less than a year after Sybase licensed the underlying complex event processing (CEP) engine for Sybase CEP from Coral8, immediately prior to Coral8’s acquisition by Aleri.

The terms of that licensing agreement provide a clue as to why Sybase would consider opening up its wallet again to snap up Aleri, however.

As Aleri insisted last March, “The licensing arrangement allows Sybase to embed CEP capabilities within and ONLY WITHIN Sybase products such as RAP”.

Sybase later confirmed (clients only) to us that this was indeed the arrangement and maintained that its strategy for CEP was to embed it within larger platform products.

As well as RAP – The Trading Edition, the company’s risk-analytics platform, Sybase also had plans to target opportunities in the telecommunications, healthcare and government sectors.

One justification for the acquisition of Aleri would be that it would allow Sybase to target those markets and other opportunities with a standalone CEP offering based on Aleri’s next-generation engine codenamed Ohio which is slated for roll-out in 2010 and is designed to include the best features from Aleri Streaming Platform and the Coral8 Engine and be backwards-compatible with both.

Then of course there are the Aleri/Coral assets beyond the core CEP engine, including the Aleri Studio visual modeling application, as well as dashboard and OLAP server capabilities, and packaged applications for risk and liquidity analysis and management.

As for why Aleri would sell out to Sybase – we certainly noted some trepidation from the company when we caught up (clients only) in September last year. While the company was buoyant about its plans for Ohio it was reticent to discuss details of customer wins/successes.

The only thing the company would say was that it had more than 80 customers, the number of combined customers when the merger closed.

At that point it was somewhat more confident, claiming (clients only) to be the largest pure-play CEP vendor in terms of headcount and customer base and revenue (although with none of the CEP vendors disclosing revenue figures, that last claim was always highly debatable).

The future of the database is… plaid?

Oracle has introduced a hybrid column-oriented storage option for Exadata with the release of Oracle Database 11g Release 2.

Ever since Mike Stonebraker and fellow researchers at MIT, Brandeis University, the University of Massachusetts and Brown University presented (PDF) C-Store, a column-oriented database at the 31st VLDB Conference, in 2005, the database industry has debated the relative merits of row- and column-store databases.

While row-based databases dominated the operational database market, column-based database have made in-roads in the analytic database space, with Vertica (based on C-Store) as well as Sybase, Calpont, Infobright, Kickfire, Paraccel and SenSage pushing column-based data warehousing products based on the argument that column-based storage favors the write performance required for query processing.

The debate took a fresh twist recently as former SAP chief executive, Hasso Plattner, recently presented a paper (PDF) calling for the use of in-memory column-based storage databases for both analytical and transaction processing.

As interesting as that is in theory, of more immediate interest is the fact that Oracle – so often the target of column-based database vendors – has introduced a hybrid column-oriented storage option with the release of Oracle Database 11g Release 2.

As Curt Monash recently noted there are a couple of approaches emerging to hybrid row/column stores.

Oracle’s approach, as revealed in a white paper (PDF) has been to add new hybrid columnar compression capabilities in its Exadata Storage servers.

This approach maintains row-based storage in the Oracle Database itself while enabling the use of column-storage to improve compression rates in Exadata, claiming a compression ratio of up to 10 without any loss of query performance and up to 40 for historical data.

As Oracle’s Kevin Closson explains in a blog post: “The technology, available only with Exadata storage, is called Hybrid Columnar Compression. The word hybrid is important. Rows are still used. They are stored in an object called a Compression Unit. Compression Units can span multiple blocks. Like values are stored in the compression unit with metadata that maps back to the rows.”

Vertica took a different hybrid approach with the release of Vertica Database, 3.5, which introduced FlexStore, a new version of the column-store engine, including the ability to group a small number of columns or rows together to reduce input/output bottlenecks. Grouping can be done automatically based on data size (grouped rows can use up to 1MB) to improve query performance of whole rows or specified based on the nature of the column data (for example, bid, ask and date columns for a financial application) to improve query performance.

Likewise, the Ingres VectorWise project (previously mentioned here) will create a new storage engine for the Ingres Database positioned as a platform for data-warehouse and analytic workloads, make use of vectorized execution, which sees multiple instructions processed simultaneously. The Vectorwise architecture makes use of Partition Attributes Across (PAX), which similarly groups multiple rows into blocks to improve processing, while storing the data in columns.

Update – Daniel Abadi has provided an overview at the different approaches to hybrid row-column architectures and suggests something I had suspected, that Oracle is also using the PAX approach, except outside the core database, while Vertica is using what he calls a fine-grained hybrid approach. He also speculates that Microsoft may end up going the third route, fractured mirrors – Update

Perhaps the future of the database may not be row- or column-based, but plaid.

Ingres launches project for in-memory, columnar, vectorized database engine

Interesting news from Ingres today that it is teaming up with VectorWise, a database engine spin-off from Amsterdam’s Centrum Wiskunde & Informatica (CWI) scientific research establishment, to collaborate on a new database kernel project.

The Ingres VectorWise project will create a new open source storage engine for the Ingres Database that will better enable it to be positioned as a platform for data warehouse and analytic workloads, although Ingres does not have detailed plans for the productization of the technology at this stage. The starting point for the project is the theory that modern multi-core parallel processors now look like, and behave like, symmetrical multi processing (SMP) servers, and that on-chip memory is taking the place of RAM, but that database software has not been updated to take advantage of process developments.

In order to do so Ingres and VectorWise will be collaborating on vectorized execution, which sees multiple instructions processed simultaneously, and in-cache processing, through which the execution occurs within the CPU cache and main memory is effectively treated like disk. The result, according to Ingres, is to reduce the I/O bottleneck for query processing. Additionally, the VectorWise engine enables on the fly decompression and operation handling in memory and includes a compressed column store.

It is claimed that the Ingres VectorWise project will deliver 10x performance increases over the current Ingres database.

VectorWise span off from CWI in 2008 to commercialize the the X100 system previously created by its database architecture research group. Development of X100, now also known as VectorWise, has been led by respected research scientists Peter Boncz and Marcin Zukowski.

Ingres maintains that by working with the CWI research scientists it has proven that their theories are technically feasible in a commercial product. Bringing such a commercial product to general availability is the next step, and history has proven that can be easier said than done. With that caveat we are impressed with the vision and ambition that Ingres is demonstrating.