January 31st, 2011 — Data management
There have been a spate of reports and blog posts recently postulating about the potential demise of the enterprise data warehouse (EDW) in the light of big data and evolving approaches to data management.
There are a number of connected themes that have led the likes of Colin White and Barry Devlin to ponder the future of the EDW, and as it happens I’ll be talking about these during our 451 Client event in San Francisco on Wednesday.
While my presentation doesn’t speak directly to the future of the EDW, it does cover the trends that are driving the reconsideration of the assumption that the EDW is, and should be, the central source of business intelligence in the enterprise.
As Colin points out, this is an assumption based on historical deficiencies with alternative data sources that evolved into best practices. “Although BI and BPM applications typically process data in a data warehouse, this is only because of… issues… concerning direct access [to] business transaction data. If these issues could be resolved then there would be no need for a data warehouse.”
The massive improvements in processing performance seen since the advent of data warehousing means that it is now more practical to process data where it resides, or is generated rather than forcing data to be held in a central data warehouse.
For example, while distributed caching was initially adopted to improve the performance of Web and financial applications, it also provides an opportunity to perform real-time analytics on application performance and user behaviour (enabling targeted ads for example) long before the data get anywhere near the data warehouse.
While the central EDW approach has some advantages for data control, security and reliability, this has always been more theoretical than practical, as there is the need for regional and departmental data marts, and users continue to use local copies of data.
As we put it in last year’s Data Warehousing 2009-2013 report:
“The approach of many users now is not to stop those distributed systems from being created, but rather to ensure that they can be managed according to the same data-quality and security rules as the EDW.
With the application of cloud computing capabilities to on-premises infrastructure, users now have the promise of distributed pools of enterprise data that marry central management with distributed use and control, empowering business users to create elastic and temporary data marts without the risk of data-mart proliferation.”
The concept of the “data cloud” is nascent, but companies such as eBay are pushing in that direction, while also making use of data storage and processing technologies above and beyond traditional databases.
Hadoop is a prime example, but so too are the infrastructure components that are generating vast amounts of data that can be used by the enterprise to better understand how the infrastructure is helping or hindering the business in responding to changing demands.
For the 451 client event we have come up with the term ‘datastruture’ to describe these infrastructure elements. What is ‘datastructure’? It’s the machines that are responsible for generating machine-generated data.
While that may sound like we’ve just slapped a new label on existing technology we believe that those data-generating machines will evolve over time to take advantage of improved available processing power with embedded data analytics capabilities.
Just as in-database analytics has enabled users to reduce data processing latency by taking the analytics to the data in the database, it seems likely that users will look to do the same for machine-generated data by taking the analytics to the data in the ‘datastructure’.
This ‘datastructure’ with embedded database and analytics capabilties therefore becomes part of the wider ‘data cloud’, alongside regional and departmental data marts, and the central business application data warehouse, as well as the ability to spin up and provision virtual data marts.
As Barry Devlin puts it: “A single logical storehouse is required with both a well-defined, consistent and integrated physical core and a loose federation of data whose diversity, timeliness and even inconsistency is valued.”
Making this work will require new data cloud management capabilities, as well as an approach to data management that we have called “total data”. As we previously explained:
“Total data is about more than data volumes. It’s about taking a broad view of available data sources and processing and analytic technologies, bringing in data from multiple sources, and having the flexibility to respond to changing business requirements…
Total data involves processing any data that might be applicable to the query at hand, whether that data is structured or unstructured, and whether it resides in the data warehouse, or a distributed Hadoop file system, or archived systems, or any operational data source – SQL or NoSQL – and whether it is on-premises or in the cloud.”
As for the end of the EDW, both Colin and Barry argue, and I agree, that what we are seeing does not portend the end of the EDW but recognition that the EDW is a component of business intelligence, rather than the source of all business intelligence itself.
January 10th, 2011 — Data management
Among the numerous prediction pieces during the rounds at the moment, Bradford Stephens, founder of Drawn to Scale suggested we could be in for continued proliferation of NoSQL database technologies in 2011, while Redmonk’s Stephen O’Grady predicted consolidation. I agree with both of them.
To understand how NoSQL could both proliferate and consolidate in 2011 it’s important to look at the small print. Bradford was talking specifically about open source tools, while Stephen was writing about commercially successful projects.
Given the levels of interest in NoSQL database technologies, the vast array of use cases, and the various interfaces and development languages – most of which are open source – I predict we’ll continue to see cross-pollination and the emergence of new projects as developers (corporate and individual) continue to scratch their own data-based itches.
However, I think we are also beginning to see the a narrowing of the commercial focus on those projects and companies that have enough traction to generate significant business opportunities and revenue, and that a few clear leaders will emerge in the various NoSQL sub-categories (key-value stores, document stores, graph databases and distributed column stores).
We can see previous evidence of the dual impact of proliferation and consolidation in the Linux market. While commercial opportunities are dominated by Red Hat, Novell and Canonical, that has not stopped the continued proliferation of Linux distributions.
The main difference between NoSQL and Linux markets, of course, is that the various Linux distributions all have a common core, and the diversity in the NoSQL space means that we are unlikely to see proliferation on the scale of Linux.
However, I think we’ll see a similar two-tier market emerge with a large number of technically interesting and differentiated open source projects, and a small number of commercially-viable general-purpose category leaders.
November 1st, 2010 — Data management
When we published our 2008 report on the impact of open source on the database market the overall conclusion was that adoption had been widespread but shallow.
Since then we’ve seen increased adoption of open source software, as well as the acquisition of MySQL by Oracle. Perhaps the most significant shift in the market since early 2008 has been the explosion in the number of open source database and data management projects, including the various NoSQL data stores, and of course Hadoop and its associated projects.
On Tuesday, November 9, 2010 at 11:00 am EST I’ll be joining Robin Schumacher, Director of Product Strategy from EnterpriseDB to present a webinar on navigating the changing landscape of open source databases.
Among the topics to be discussed are:
· the needs of organizations with hybrid mixed-workload environments
· how to choose the right tool for the job
· the involvement of user corporations (for better or for worse) in open source projects today.
You can find further details about the event and register here.
August 24th, 2010 — Data management
The data warehousing market will see a compound annual growth rate of 11.5% from 2009 through 2013 to reach a total of $13.2bn in revenues.
That is the main finding highlighted by the latest report from The 451 Group’s Information Management practice, which provides market-sizing information for the data-warehousing sector from 2009 to 2013.
The report includes revenue estimates and growth projections, and examines the business and technology trends driving the market.
It was put together with the assistance of Market Monitor – the new market-sizing service from The 451 Group and Tier1 Research. Props to Greg Zwakman and Elizabeth Nelson for their number-crunching.
Among the key findings, available via the executive summary (PDF), are:
- Four vendors dominate the data-warehouse market, with 93.6% of total revenue in 2010. These vendors are expected to retain their advantage and generate 92.2% of revenue in 2013.
- Analytic databases are now able to take advantage of greater processor performance at a lower cost, improving price/performance and lowering barriers to entry.
- With the application of cloud capabilities, users now have the promise of pools of enterprise data that marry central management with distributed use and control.
- Products that take advantage of improved hardware performance will drive revenue growth for all vendors, and will protect the market share of incumbents.
- As a result of systems performance improvements, data-warehousing vendors are also taking advantage of the opportunity to bring more advanced analytic capabilities to the DB engine.
- Although we expect many smaller vendors to grow at a much faster rate between now and 2013, it will not be at the expense of the market’s dominant vendors.
- While the Hadoop Core is not a direct alternative to traditional analytic DBs, the increased maturity of associated projects means that use cases for Hadoop- and MapReduce-enabled analytic DBs will overlap.
There is, of course, much more detail in the full report. 451 Group clients can download the report here, while non-clients can also use the same link to purchase the report, or request more information.
May 14th, 2010 — Data management, M&A
The 451 Group has published its take on the proposed acquisition of Sybase by SAP. The full report provides details on the deal, valuation and timing, as well as assessing the rationale and competitive impact in three core areas: data management, mobility, and applications.
As a taster, here’s an excerpt from our view of the deal from a database perspective:
The acquisition of Sybase significantly expands SAP’s interests in database technology, and the improved ability of the vendor to provide customers with an alternative to rival Oracle’s database products is, alongside mobile computing, a significant driver for the deal. Oracle and SAP have long been rivals in the enterprise application space, but Oracle’s dominance in the database market has enabled it to wield significant influence over SAP accounts. For instance, Oracle claims to be the most popular database for deploying SAP, and that two-thirds of all SAP customers run on Oracle Database. Buying a database platform of its own will enable SAP to break any perceived dependence on its rival, although this is very much a long-term play: Sybase’s database business is tiny compared to Oracle, which reported revenue from new licenses for database and middleware products of $1.2bn in the third quarter alone.
The long-term acquisition focus is on the potential for in-memory database technology, which has been a pet project for SAP cofounder and supervisory board chairman Hasso Plattner for some time. As the performance of systems hardware has improved, it is now possible to run more enterprise workloads in memory, rather than on disk. By using in-memory database technology, SAP is aiming to improve the performance of its transactional applications and BI software while also hoping to leapfrog rival Oracle, which has its disk-based database installed base to protect. Sybase also has a disk-based database installed base, but has been actively exploring in-memory database technology, and SAP can arguably afford to be much more aggressive about a long-term in-memory vision since its reliance on that installed base is much less than Sybase’s or Oracle’s.
SAP has already delivered columnar in-memory database technology to market via its Business Warehouse Accelerator (BWA) hardware-based acceleration engine and the SAP BusinessObjects Explorer data-exploration tool. Sybase has also delivered in-memory database technology for its transactional ASE database with the release of version 15.5 earlier this year. By acquiring Sybase, SAP has effectively delivered on Plattner’s vision of in-memory databases for both analytical and transaction processing, albeit with two different products. At this stage, it appears that SAP’s in-memory functionality will quickly be applied to the IQ analytic database while ASE will retain its own in-memory database features. Over time, expect R&D to focus on delivering column-based in-memory database technology for both operational and analytic workloads.
In addition, SAP touted the applicability of its in-memory database technology to Sybase’s complex-event-processing (CEP) technology and Risk Analytics Platform (RAP). Sybase was already planning to replicate the success of RAP in other verticals following its acquisition of CEP vendor Aleri in February, and we would expect SAP to accelerate that.
Meanwhile, SAP intends to continue to support databases from other vendors. In the short term, this will be a necessity since SAP’s application software does not currently run on Sybase’s databases. Technically, this should be easy to overcome, although clearly it will take time, and we would expect SAP to encourage its application and BI customers to move to Sybase ASE and IQ for new deployments in the long term. One of the first SAP products we would expect to see ported to Sybase IQ is the NetWeaver Business Warehouse (BW) model-driven data-warehouse environment. SAP’s own MaxDB is currently the default database for BW, although it enables deployment to Oracle, IBM DB2, Microsoft SQL Server, MaxDB, Teradata and Hewlett-Packard’s Neoview. Expect IQ to be added to that list sooner rather than later, and to potentially replace MaxDB as the default database.
I have some views on how SAP could accelerate the migration of its technology and users to Sybase’s databases but – for reasons that will become apparent – they will have to wait until next week.
March 15th, 2010 — Data management
One of the essential problems with the covering the NoSQL movement is that it describes not what the associated databases are, but what they are not (and doesn’t even do that very well since SQL itself is in many cases orthogonal to the problem the databases are designed to solve).
It is interesting to see fellow analyst Curt Monash facing the same problem. As he notes, while there seems to be a common theme that “NoSQL is Foo without joins and transactions,” no one has adequately defined what “Foo” is.
Curt has proposed HVSP (High-Volume Simple Processing) as an alternative to NoSQL, and while I’m not jumping on the bandwagon just yet, it does pass the Ronseal test (it does what it says on the tin), and it also matches my view of what defines these distributed data store technologies.
Some observations:
I agree with Curt’s view that object-oriented and XML databases should not be considered part of this new breed of distributed data store technologies. There is a danger that NoSQL simply comes to mean non-relational.
I also agree that MapReduce and Hadoop should not be considered part of this category of data management technologies (which is somewhat ironic since if there is any technology for which the terms NoSQL or Not Only SQL are applicable, it is MapReduce).
The vendors associated with the NoSQL movement (Basho, Couchio and MongoDB) are in a problematic position. While they are benefiting from, and to some extent encouraging, interest in NoSQL, the overall term masks their individual benefits. My sense is they will look to move away from it sooner rather than later.
Memcached is not a key value store. It is a cache. Hence the name.
.
There are numerous categorizations of the various NoSQL technologies available on the Internet. Without wishing to add yet another to the mix, I have created another one – more for my benefit than anything else.
It includes a list of users for the various projects (where available), and also some sense of whether the various projects fit into CAP Theorem, an understanding of which is, to my mind, essential for understanding how and why the NoSQL/HVSP movement has emerged (look out for more on CAP Theorem in a follow-up post on alternatives to NoSQL).
Here’s my take, for those that are interested. As you can see there’s a graph database-shaped whole in my knowledge. I’m hoping to fill that sooner rather than later.
By the way, our Spotlight report introducing The 451 Group’s formal coverage of NoSQL databases will be available here imminently.
Update: VMware has announced that it has hired Redis creator Salvatore Sanfilippo, and is taking on the Redis key value store project. The image below has been updated to reflect that, as well as the launch of NorthScale’s Membase.

September 2nd, 2009 — Data management
Oracle has introduced a hybrid column-oriented storage option for Exadata with the release of Oracle Database 11g Release 2.
Ever since Mike Stonebraker and fellow researchers at MIT, Brandeis University, the University of Massachusetts and Brown University presented (PDF) C-Store, a column-oriented database at the 31st VLDB Conference, in 2005, the database industry has debated the relative merits of row- and column-store databases.
While row-based databases dominated the operational database market, column-based database have made in-roads in the analytic database space, with Vertica (based on C-Store) as well as Sybase, Calpont, Infobright, Kickfire, Paraccel and SenSage pushing column-based data warehousing products based on the argument that column-based storage favors the write performance required for query processing.
The debate took a fresh twist recently as former SAP chief executive, Hasso Plattner, recently presented a paper (PDF) calling for the use of in-memory column-based storage databases for both analytical and transaction processing.
As interesting as that is in theory, of more immediate interest is the fact that Oracle – so often the target of column-based database vendors – has introduced a hybrid column-oriented storage option with the release of Oracle Database 11g Release 2.
As Curt Monash recently noted there are a couple of approaches emerging to hybrid row/column stores.
Oracle’s approach, as revealed in a white paper (PDF) has been to add new hybrid columnar compression capabilities in its Exadata Storage servers.
This approach maintains row-based storage in the Oracle Database itself while enabling the use of column-storage to improve compression rates in Exadata, claiming a compression ratio of up to 10 without any loss of query performance and up to 40 for historical data.
As Oracle’s Kevin Closson explains in a blog post: “The technology, available only with Exadata storage, is called Hybrid Columnar Compression. The word hybrid is important. Rows are still used. They are stored in an object called a Compression Unit. Compression Units can span multiple blocks. Like values are stored in the compression unit with metadata that maps back to the rows.”
Vertica took a different hybrid approach with the release of Vertica Database, 3.5, which introduced FlexStore, a new version of the column-store engine, including the ability to group a small number of columns or rows together to reduce input/output bottlenecks. Grouping can be done automatically based on data size (grouped rows can use up to 1MB) to improve query performance of whole rows or specified based on the nature of the column data (for example, bid, ask and date columns for a financial application) to improve query performance.
Likewise, the Ingres VectorWise project (previously mentioned here) will create a new storage engine for the Ingres Database positioned as a platform for data-warehouse and analytic workloads, make use of vectorized execution, which sees multiple instructions processed simultaneously. The Vectorwise architecture makes use of Partition Attributes Across (PAX), which similarly groups multiple rows into blocks to improve processing, while storing the data in columns.
Update – Daniel Abadi has provided an overview at the different approaches to hybrid row-column architectures and suggests something I had suspected, that Oracle is also using the PAX approach, except outside the core database, while Vertica is using what he calls a fine-grained hybrid approach. He also speculates that Microsoft may end up going the third route, fractured mirrors – Update
Perhaps the future of the database may not be row- or column-based, but plaid.