As well as contributing to the CAOS research practice here at The 451 Group I am also part of the information management team, with a focus on databases, data caching, CEP, and – from the start of this year – data warehousing.
I’ve covered data warehousing before but taking a fresh look at this space in recent months it’s been fascinating to see the variety of technologies and strategies that vendors are applying to the data warehousing problem. It’s also been interesting to compare the role that open source has played in the data warehousing market, compared to the database market.
I’m preparing a major report on the data warehousing sector, for publication in the next couple of months. What follows is a rough outline of the role open source has played in the sector. Any comments or corrections much appreciated:
Unlike other sectors, where the role of open source has mostly been the disruption of incumbent proprietary vendors by commercial open source specialists, the impact of open source in the data warehousing sector has been more subtle, and arguably more pervasive.
Vendors such as Netezza and Greenplum have used the PostgreSQL database to build their data warehousing products, benefiting from the robust, mature PostgreSQL code base and reduced time to market. However, the end products of these development efforts are not open source.
For example, Netezza used PostgreSQL as the database scaffolding to reduce the time to create its Netezza Performance Server (NPS), although the BSD license used by the PostgreSQL project enabled the company to do so without its resulting database having to be made available under an open source license, and the majority of the PostgreSQL code has subsequently been replaced. Additionally, Aster Data makes use of PostgreSQL as a data store on each node of its nCluster massively parallel data warehouse.
Similarly Greenplum also used PostgreSQL as the basis for its massively-parallel Greenplum Database and also set up and supported the Bizgres distribution with business intelligence and data warehousing specific contributions made available under the BSD license. However that project fizzled out and the website is now closed, although Greenplum’s use of PostgreSQL continues.
Another example of PostgreSQL usage comes from Paraccel, which used the PostgreSQL optimizer code in version 1.0 of its Analytic Database in order to improve time to market. That is now being replaced by a new optimizer called Omne, which is specifically designed to support the MPP columnar architecture of Paraccel and its compression capabilities, unlike the SMP PostgreSQL optimizer, which was extended to support MPP. While Omne retains some elements of the open source PostgreSQL optimizer code base, Paraccel claims it will remove all PostgreSQL code from its products with an update to the Omne technology in 2010.
Additionally Vertica, which was founded by Mike Stonebraker, creator of PostgreSQL and Ingres, is a commercial implementation of the C-Store academic research project, which was also licensed under BSD.
It is also worth mentioning that prior to its acquisition by Microsoft, DATAllegro made use of a commercial license of the open source Ingres database within its data warehousing appliances. DATAllegro actually did most of the early development work for its first appliance using PostgreSQL, but decided to change to Ingres late in 2004 to make use of partitioning capabilities, backup utilities and optimizer features. Needless to day Ingres is being replaced by Microsoft SQL Server in Microsoft’s forthcoming Madison data warehouse appliances.
LucidDB is another, often overlooked open source database, and was purpose-built for data warehousing. Based on technology developed by Broadbase Software, the code was picked up by erstwhile business intelligence SaaS provider LucidEra and combined with the Eigenbase data management framework to create LucidDB. Following LucidEra’s recent demise the LucidDB code is not currently commercially supported, although the non-profit Eigenbase Foundation is continuing to sponsor its development.
Another often overlooked open source database – to the extent that I overlooked it – is MonetDB, a column-oriented database management system developed at Amsterdam’s Centrum Wiskunde & Informatica (CWI) scientific research establishment by many of the same researchers who went on to create Vectorwise (see below). MonetDB, the company, was spun-off from CWI in 2008 with the aim of disseminating the code and identifying commercial joint venture and collaboration projects to increase its adoption.
Finally, at least for now, July saw the launch of HadoopDB from Yale’s Database Research team. HadoopDB is designed to be an analytical database system that combines the scalability of Hadoop and the performance of parallel database systems and, according to Daniel Abadi, it is “an open source stack that includes PostgreSQL, Hadoop, and Hive, along with some glue between PostgreSQL and Hadoop, a catalog, a data loader, and an interface that accepts queries in MapReduce or SQL and generates query plans that are processed partly in Hadoop and partly in different PostgreSQL instances spread across many nodes in a shared-nothing cluster of machines.”
Despite this rampant use of open source code, it was not until Infobright launched Infobright Community Edition (ICE) in 2008 that we saw the first commercial open source vendor delivering its core warehouse software under an open source license. The Infobright columnar database acts as a storage engine for the MySQL database turning it into a realistic option for data warehouses of more than 200GB according to Infobright (Sun maintains that MySQL can perform as a stand-alone data-warehousing platform up to 2TB with the default MyISAM non-transactional storage engine).
While MySQL is not well known as a platform for data warehousing, Sun’s internal surveys indicate that data warehousing is the fifth-most-common use case for MySQL, which explains why it is not just Infobright that is looking to build a data warehousing business around MySQL.
Kickfire emerged in April 2008 with a beta version of its MySQL Appliance, which is built around the MySQL database and its SQL chip, which provides native instruction execution while operating directly out of memory on compressed data. Kickfire is targeting deployments in the 100GB-3TB range, while Infobright acts as a MySQL storage engine to enable use with up to 30TB of data. Infobright is developing a shared-everything, peer-to-peer architecture that will support up to 100 concurrent users and 100TB of data. Delivery is scheduled for the fourth quarter.
It remains to be seen whether Oracle will retain its commercial relationships with Kickfire and Infobright once its acquisition of Sun, and therefore MySQL, closes, but one company that has already been impacted by the acquisition its Calpont, which had planned to make a big splash at the recent MySQL Conference & Expo with the launch of its new strategy to provide a data-warehousing storage engine for the MySQL database.
The plan, to offer an open source column-oriented storage engine that will provide the MySQL database with the capabilities to function as a data warehouse, scaling from capacities of 100GB to 100TB, remains in place, although the storage engine will be in beta testing for the foreseeable future while Calpont waits to see what Oracle will do.
The most recent open source entrant into the data warehousing market is Ingres, which has teamed up with VectorWise, another database-engine spin-off from Amsterdam’s CWI, to collaborate on a new database-kernel project designed to better enable it to be positioned as a platform for data-warehouse and analytic workloads. he resulting software will be fully open source although Ingres does not have detailed plans for the productization of the technology at this stage. The Vectorwise technology was originally known as X100 and was used initially as an extension module inside the MonetDB database.
While open source is playing an increasing role in the data warehousing market, PostgreSQL has primarily taken the role of lowering barriers to entry for new vendors by providing a platform for the development of data warehouse-specific capabilities on a proven database platform.
MySQL serves a similar role for Infobright, Kickfire and Calpont, but could also play a significant role in lowering barriers to entry for new data warehousing customers with small volumes of data.
Calpont turned its attention to MySQL and the midrange market in order to exploit the requirement for scalable data-warehousing capabilities from MySQL’s estimated 11 million users, as well as the fact that the low-end of the market has not been well-supported by the existing data-warehousing vendors.
Sun estimates that 90% of all data warehouses have 6TB of data or less, while Kickfire estimates there are 17,000 addressable accounts that are trying to use MySQL to create data warehouses with volumes greater than 50GB.
These estimates explain why Sun et al see an opportunity for MySQL-based warehouses to grab a slice of the market based on a low cost systems targeting a large number of customers and small amounts of data – the complete inverse of the traditional focus for data warehousing requirements, which is based on high cost systems supporting large amounts of data and a relatively small number of potential customers.
Additionally, Kickfire, Infobright and Calpont are looking to replicate the strategy MySQL successful followed in the database market by targeting a market niche that is not being served by the incumbents and avoid competing head on with the likes of Teradata, IBM, Oracle and Netezza.