Open source’s role in lowering the barriers to data warehousing

As well as contributing to the CAOS research practice here at The 451 Group I am also part of the information management team, with a focus on databases, data caching, CEP, and – from the start of this year – data warehousing.

I’ve covered data warehousing before but taking a fresh look at this space in recent months it’s been fascinating to see the variety of technologies and strategies that vendors are applying to the data warehousing problem. It’s also been interesting to compare the role that open source has played in the data warehousing market, compared to the database market.

I’m preparing a major report on the data warehousing sector, for publication in the next couple of months. What follows is a rough outline of the role open source has played in the sector. Any comments or corrections much appreciated:

Unlike other sectors, where the role of open source has mostly been the disruption of incumbent proprietary vendors by commercial open source specialists, the impact of open source in the data warehousing sector has been more subtle, and arguably more pervasive.

Vendors such as Netezza and Greenplum have used the PostgreSQL database to build their data warehousing products, benefiting from the robust, mature PostgreSQL code base and reduced time to market. However, the end products of these development efforts are not open source.

For example, Netezza used PostgreSQL as the database scaffolding to reduce the time to create its Netezza Performance Server (NPS), although the BSD license used by the PostgreSQL project enabled the company to do so without its resulting database having to be made available under an open source license, and the majority of the PostgreSQL code has subsequently been replaced. Additionally, Aster Data makes use of PostgreSQL as a data store on each node of its nCluster massively parallel data warehouse.

Similarly Greenplum also used PostgreSQL as the basis for its massively-parallel Greenplum Database and also set up and supported the Bizgres distribution with business intelligence and data warehousing specific contributions made available under the BSD license. However that project fizzled out and the website is now closed, although Greenplum’s use of PostgreSQL continues.

Another example of PostgreSQL usage comes from Paraccel, which used the PostgreSQL optimizer code in version 1.0 of its Analytic Database in order to improve time to market. That is now being replaced by a new optimizer called Omne, which is specifically designed to support the MPP columnar architecture of Paraccel and its compression capabilities, unlike the SMP PostgreSQL optimizer, which was extended to support MPP. While Omne retains some elements of the open source PostgreSQL optimizer code base, Paraccel claims it will remove all PostgreSQL code from its products with an update to the Omne technology in 2010.

Additionally Vertica, which was founded by Mike Stonebraker, creator of PostgreSQL and Ingres, is a commercial implementation of the C-Store academic research project, which was also licensed under BSD.

It is also worth mentioning that prior to its acquisition by Microsoft, DATAllegro made use of a commercial license of the open source Ingres database within its data warehousing appliances. DATAllegro actually did most of the early development work for its first appliance using PostgreSQL, but decided to change to Ingres late in 2004 to make use of partitioning capabilities, backup utilities and optimizer features. Needless to day Ingres is being replaced by Microsoft SQL Server in Microsoft’s forthcoming Madison data warehouse appliances.

LucidDB is another, often overlooked open source database, and was purpose-built for data warehousing. Based on technology developed by Broadbase Software, the code was picked up by erstwhile business intelligence SaaS provider LucidEra and combined with the Eigenbase data management framework to create LucidDB. Following LucidEra’s recent demise the LucidDB code is not currently commercially supported, although the non-profit Eigenbase Foundation is continuing to sponsor its development.

Another often overlooked open source database – to the extent that I overlooked it – is MonetDB, a column-oriented database management system developed at Amsterdam’s Centrum Wiskunde & Informatica (CWI) scientific research establishment by many of the same researchers who went on to create Vectorwise (see below). MonetDB, the company, was spun-off from CWI in 2008 with the aim of disseminating the code and identifying commercial joint venture and collaboration projects to increase its adoption.

Finally, at least for now, July saw the launch of HadoopDB from Yale’s Database Research team. HadoopDB is designed to be an analytical database system that combines the scalability of Hadoop and the performance of parallel database systems and, according to Daniel Abadi, it is “an open source stack that includes PostgreSQL, Hadoop, and Hive, along with some glue between PostgreSQL and Hadoop, a catalog, a data loader, and an interface that accepts queries in MapReduce or SQL and generates query plans that are processed partly in Hadoop and partly in different PostgreSQL instances spread across many nodes in a shared-nothing cluster of machines.”

Despite this rampant use of open source code, it was not until Infobright launched Infobright Community Edition (ICE) in 2008 that we saw the first commercial open source vendor delivering its core warehouse software under an open source license. The Infobright columnar database acts as a storage engine for the MySQL database turning it into a realistic option for data warehouses of more than 200GB according to Infobright (Sun maintains that MySQL can perform as a stand-alone data-warehousing platform up to 2TB with the default MyISAM non-transactional storage engine).

While MySQL is not well known as a platform for data warehousing, Sun’s internal surveys indicate that data warehousing is the fifth-most-common use case for MySQL, which explains why it is not just Infobright that is looking to build a data warehousing business around MySQL.

Kickfire emerged in April 2008 with a beta version of its MySQL Appliance, which is built around the MySQL database and its SQL chip, which provides native instruction execution while operating directly out of memory on compressed data. Kickfire is targeting deployments in the 100GB-3TB range, while Infobright acts as a MySQL storage engine to enable use with up to 30TB of data. Infobright is developing a shared-everything, peer-to-peer architecture that will support up to 100 concurrent users and 100TB of data. Delivery is scheduled for the fourth quarter.

It remains to be seen whether Oracle will retain its commercial relationships with Kickfire and Infobright once its acquisition of Sun, and therefore MySQL, closes, but one company that has already been impacted by the acquisition its Calpont, which had planned to make a big splash at the recent MySQL Conference & Expo with the launch of its new strategy to provide a data-warehousing storage engine for the MySQL database.

The plan, to offer an open source column-oriented storage engine that will provide the MySQL database with the capabilities to function as a data warehouse, scaling from capacities of 100GB to 100TB, remains in place, although the storage engine will be in beta testing for the foreseeable future while Calpont waits to see what Oracle will do.

The most recent open source entrant into the data warehousing market is Ingres, which has teamed up with VectorWise, another database-engine spin-off from Amsterdam’s CWI, to collaborate on a new database-kernel project designed to better enable it to be positioned as a platform for data-warehouse and analytic workloads. he resulting software will be fully open source although Ingres does not have detailed plans for the productization of the technology at this stage. The Vectorwise technology was originally known as X100 and was used initially as an extension module inside the MonetDB database.

While open source is playing an increasing role in the data warehousing market, PostgreSQL has primarily taken the role of lowering barriers to entry for new vendors by providing a platform for the development of data warehouse-specific capabilities on a proven database platform.

MySQL serves a similar role for Infobright, Kickfire and Calpont, but could also play a significant role in lowering barriers to entry for new data warehousing customers with small volumes of data.

Calpont turned its attention to MySQL and the midrange market in order to exploit the requirement for scalable data-warehousing capabilities from MySQL’s estimated 11 million users, as well as the fact that the low-end of the market has not been well-supported by the existing data-warehousing vendors.

Sun estimates that 90% of all data warehouses have 6TB of data or less, while Kickfire estimates there are 17,000 addressable accounts that are trying to use MySQL to create data warehouses with volumes greater than 50GB.

These estimates explain why Sun et al see an opportunity for MySQL-based warehouses to grab a slice of the market based on a low cost systems targeting a large number of customers and small amounts of data – the complete inverse of the traditional focus for data warehousing requirements, which is based on high cost systems supporting large amounts of data and a relatively small number of potential customers.

Additionally, Kickfire, Infobright and Calpont are looking to replicate the strategy MySQL successful followed in the database market by targeting a market niche that is not being served by the incumbents and avoid competing head on with the likes of Teradata, IBM, Oracle and Netezza.

Tags: , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,

16 comments ↓

#1 Lowering barriers to data warehousing adoption with open source — Too much information on 08.06.09 at 10:52 am

[…] sector, for publication in the next couple of months. In preparartion for that I’ve published a rough outline of the role open source has played in the sector over on our CAOS Theory blog. Any […]

#2 451 CAOS Links (caostheory) 's status on Thursday, 06-Aug-09 15:54:09 UTC - Identi.ca on 08.06.09 at 10:54 am

[…] http://blogs.the451group.com/opensource/2009/08/06/open-sources-role-in-lowering-the-barriers-to-da… […]

#3 451 CAOS Theory » Open source's role in lowering the barriers to … « Computer Internet and Technology Articles. on 08.06.09 at 12:16 pm

[…] the original post:  451 CAOS Theory » Open source's role in lowering the barriers to … August 6th, 2009 | Tags: caos, data, infobright, links, linux, mobile, mysql, networks, opensorce, […]

#4 Open source’s role in lowering the barriers to data warehousing | Suporte de Informática on 08.06.09 at 2:17 pm

[…] the original post: Open source’s role in lowering the barriers to data warehousing Blogs, […]

#5 Roland Bouman on 08.06.09 at 2:18 pm

Hi! nice overview – thanks

“LucidDB is another, often overlooked open source database,”

Another one is MonetDB (http://monetdb.cwi.nl/) also from Amsterdam’s CWI

#6 Matthew Aslett on 08.07.09 at 12:57 am

Yes of course. And proving it is often overlooked, I overlooked it. I’ll add it later.

#7 451 CAOS Theory » Open source's role in lowering the barriers to … | Open Hacking on 08.06.09 at 3:25 pm

[…] the rest here:  451 CAOS Theory » Open source's role in lowering the barriers to … This entry was posted on Thursday, August 6th, 2009 at 10:45 am and is filed under Linux, News, […]

#8 Twitted by maslett on 08.07.09 at 4:56 am

[…] This post was Twitted by maslett […]

#9 Twitted by caostheory on 08.07.09 at 5:12 am

[…] This post was Twitted by caostheory […]

#10 Links 07/08/2009: Camp KDE 2010 Planned, ASUS May Return to GNU/Linux Sub-notebooks | Boycott Novell on 08.07.09 at 7:41 pm

[…] Open source’s role in lowering the barriers to data warehousing Sun’s internal surveys indicate that data warehousing is the fifth-most-common use case for MySQL, which explains why it is not just Infobright that is looking to build a data warehousing business around MySQL. […]

#11 EOS Blogs » Blog Archive » Open Source enabling a wide range of Data Warehousing technologies on 08.08.09 at 3:01 pm

[…] Matthew Aslett posted a great blog post, explaining the Open Source influence on data warehousing yesterday. Talking about PostgreSQL and Ingres as the basis of many commercial datawarehouses (examples Netezza, Greenplum) he also mentions the first core warehouse software published under an open source license by Infobright. A lot has been happening recently in this scene and the article is a must read for everyone looking for an affordable technology in this space. Bookmark It Hide Sites […]

#12 Open Source in Data Warehousing on 08.10.09 at 7:23 pm

[…] Aslett from The 451 Group has been doing some heavy research into open source’s role in lowering the barriers to data warehousing and is working on a report that is expected to be published in the next couple of months. His blog […]

#13 Twitter Trackbacks for 451 CAOS Theory » Open source’s role in lowering the barriers to data warehousing [the451group.com] on Topsy.com on 08.31.09 at 1:18 pm

[…] 451 CAOS Theory » Open source’s role in lowering the barriers to data warehousing blogs.the451group.com/opensource/2009/08/06/open-sources-role-in-lowering-the-barriers-to-data-warehousing – view page – cached An open source blog by The 451 Group. — From the page […]

#14 The Raise of Big Analytics « Big Analytics on 02.05.10 at 4:30 pm

[…] Raise of Big Analytics Recently there is an emergence of commercial and open source database systems that are specifically designed for effective storage and analytics over “big […]

#15 Open Source enabling a wide range of Data Warehousing technologies | OSBF-Blog on 01.10.12 at 9:08 am

[…] Matthew Aslett posted a great blog post, explaining the Open Source influence on data warehousing yesterday. Talking about PostgreSQL and Ingres as the basis of many commercial datawarehouses (examples Netezza, Greenplum) he also mentions the first core warehouse software published under an open source license by Infobright. A lot has been happening recently in this scene and the article is a must read for everyone looking for an affordable technology in this space. Dieser Eintrag wurde veröffentlicht in Enterprise Open Source Directory von admin. Permanenter Link des Eintrags. /* […]

#16 Open Source enabling a wide range of Data Warehousing technologies | OSBF-Blog on 01.10.12 at 9:54 am

[…] Matthew Aslett posted a great blog post, explaining the Open Source influence on data warehousing yesterday. Talking about PostgreSQL and Ingres as the basis of many commercial datawarehouses (examples Netezza, Greenplum) he also mentions the first core warehouse software published under an open source license by Infobright. A lot has been happening recently in this scene and the article is a must read for everyone looking for an affordable technology in this space. Dieser Eintrag wurde veröffentlicht in Enterprise Open Source Directory von stefan.probst. Permanenter Link des Eintrags. /* […]