Entries from March 2010 ↓
March 22nd, 2010 — Data management
Gear6’s Mark Atwood is less than impressed with my recent statement: “Memcached is not a key value store. It is a cache. Hence the name.”
Mark has responded with a post in which he explains how memcached can be used as a key value store with the assistance of “persistent memcached” from Gear6, or by combining memcached with something like Tokyo Cabinet.
As much as I agree with Mark that other technologies can be used to turn memcached into a key value store I can’t help thinking his post actually proves my point: that memcached itself is not a key value store.
Either way it brings me to the next post in the NoSQL series (see also The 451 Group’s recent Spotlight report), looking at what the existing technology providers are likely to do in response.
I spent last week in San Francisco at the Open Source Business Conference where David Recordon, head of open source initiatives at Facebook, outlined how the company makes use of various open source projects, including memcached and MySQL, to scale its infrastructure.
It was an interesting presentation, although the thing that stood out for me was that Recordon didn’t once mention Cassandra, the open source key value store created by Facebook, despite being asked directly about the company’s plans for what was rather quaintly referred to as “non-relational databases”.
In fact, this recent post from Recordon puts Cassandra in context: “we use it for Inbox search, but the majority of development is now being led by Digg, Rackspace, and Twitter”. It is technologies like MySQL and memcached that Facebook is scaling to provide its core horsepower.
The death of memcached, as they say, has been greatly exaggerated.
That said, it is clear that to some extent the rise of NoSQL can be explained by CAP Theorem and the inability of the MySQL database to scale consistently. Sharding is a popular method of increasing the scalability of the MySQL database to serve the requirements of high-traffic websites, but it’s manually intensive. The memcached distributed memory object-caching system can also be used to improve performance, but does not provide persistence.
An alternative to throwing out investments in MySQL and memcached in favor of NoSQL is to improve the MySQL/memcached combination, however. A number of vendors, including Gear6 and NorthScale, are developing and delivering technologies that add persistence to memcached (see recent 451 Group coverage on Gear6 and NorthScale), while appliance providers such as Schooner Information Technology (451 coverage) and Virident Systems (451 coverage) have taken an appliance-based approach to adding persistence.
Another approach would be to improve the performance of MySQL itself. ScaleDB (451 coverage) has a shared-disk storage engine for MySQL that promises to improve its scalability. We have also recently come across GenieDB, (451 coverage) which is promising a massively distributed data storage engine for MySQL. Additionally, Tokutek’s TokuDB MySQL storage engine is based on Fractal Tree indexing technology that reduces data-insertion times, improving the performance of MySQL for both read and write applications, for example.
As we noted in our recent assessment of Tokutek, while TokuDB is effectively an operational database technology, it does blur the line between operations and analytics since the company claims it delivers a performance improvement sufficient to run ad hoc queries against live data.
Beyond MySQL, while we expect the database incumbents to feel the impact of NoSQL in certain use cases, the lack of consistency (in the CAP Theorem sense) inevitably enables quick dismissal of their wider applicability. Additionally, we expect to see the data management vendors take steps to improve performance and scalability. One method is through the use of in-memory databases to improve performance for repeatedly accessed data, another is through the use of in-memory data grid caching technologies, which are designed to solve both performance and scalability issues.
Although these technologies do not provide the scalability required by Facebook, Amazon, et al., the question is, how many applications need that level of scalability? Returning again to CAP Theorem, if we assume that most applications do not require the levels of partition tolerance seen at Google, expect the incumbents to argue that what they lack in partition tolerance they can make up for in consistency and availability.
Somewhat inevitably, the requirements mandated by NoSQL advocates will be watered down for enterprise adoption. At that level, it may arguably be easier for incumbent vendors to sacrifice a little consistency and availability for partition tolerance than it will be for NoSQL projects to add consistency and availability.
Much will depend on the workload in question, which is something that is being hidden by debates that assume a confrontational relationship between SQL and NoSQL databases. As the example of Facebook suggests, there is room for both MySQL/memcached and NoSQL
March 19th, 2010 — Storage
I’m going to be presenting the introductory session at a BrightTalk virtual conference on March 25 on the role and impact of the virtual server revolution on the storage infrastructure. Although it’s been evident for some time that the emergence of server virtualization has had — and continues to have — a meaningful impact on the storage world, the sheer pace of change here makes this a worthwhile topic to revisit. As the first presenter of the event — the conference runs all day — it’s my job to set the scene; as well as introducing the topic within the context of the challenges that IT and storage managers face, I’ll outline a few issues that will hopefully serve as discussion points throughout the day.
Deciding on which issues to focus on is actually a lot harder than it sounds — I only have 45 minutes — because, when you start digging into it, the impact of virtualization on storage is profound on just about every level; performance, capacity (and more importantly, capacity utilization), data protection and reliability, and management.
I’ll aim to touch on as many of these points as time allows, as well as provide some thoughts on the questions that IT and storage managers should be asking when considering how to improve their storage infrastructure to get the most out of an increasingly virtualized datacenter.
The idea is to make this a thought-provoking and interactive session. Register for the live presentation here: http://www.brighttalk.com/webcast/6907. After registering you will receive a confirmation email as well as a 24-hour reminder email. As a live attendee you will be able to interact with me by posing questions which I will be able to answer on air. If you are unable to watch live, the presentation will remain available via the link above for on-demand participation.
March 15th, 2010 — Data management
One of the essential problems with the covering the NoSQL movement is that it describes not what the associated databases are, but what they are not (and doesn’t even do that very well since SQL itself is in many cases orthogonal to the problem the databases are designed to solve).
It is interesting to see fellow analyst Curt Monash facing the same problem. As he notes, while there seems to be a common theme that “NoSQL is Foo without joins and transactions,” no one has adequately defined what “Foo” is.
Curt has proposed HVSP (High-Volume Simple Processing) as an alternative to NoSQL, and while I’m not jumping on the bandwagon just yet, it does pass the Ronseal test (it does what it says on the tin), and it also matches my view of what defines these distributed data store technologies.
I agree with Curt’s view that object-oriented and XML databases should not be considered part of this new breed of distributed data store technologies. There is a danger that NoSQL simply comes to mean non-relational.
I also agree that MapReduce and Hadoop should not be considered part of this category of data management technologies (which is somewhat ironic since if there is any technology for which the terms NoSQL or Not Only SQL are applicable, it is MapReduce).
The vendors associated with the NoSQL movement (Basho, Couchio and MongoDB) are in a problematic position. While they are benefiting from, and to some extent encouraging, interest in NoSQL, the overall term masks their individual benefits. My sense is they will look to move away from it sooner rather than later.
Memcached is not a key value store. It is a cache. Hence the name.
There are numerous categorizations of the various NoSQL technologies available on the Internet. Without wishing to add yet another to the mix, I have created another one – more for my benefit than anything else.
It includes a list of users for the various projects (where available), and also some sense of whether the various projects fit into CAP Theorem, an understanding of which is, to my mind, essential for understanding how and why the NoSQL/HVSP movement has emerged (look out for more on CAP Theorem in a follow-up post on alternatives to NoSQL).
Here’s my take, for those that are interested. As you can see there’s a graph database-shaped whole in my knowledge. I’m hoping to fill that sooner rather than later.
By the way, our Spotlight report introducing The 451 Group’s formal coverage of NoSQL databases will be available here imminently.
Update: VMware has announced that it has hired Redis creator Salvatore Sanfilippo, and is taking on the Redis key value store project. The image below has been updated to reflect that, as well as the launch of NorthScale’s Membase.
March 4th, 2010 — eDiscovery
IQPC’s 3rd E-discovery conference for Financial Services felt like a spa day after LegalTech. You get your CLE credit in a room of less than 40 people while being fed gourmet cookies in a comfortable chair with an expensive view of Times Square – unlike LegalTech, where you spend half your time in an elevator of 40 people, and someone has pushed the button for every floor.
There were some noteworthy insights for anyone considering an investment in e-discovery software or services. We’ve been crunching numbers for our E-discovery User Survey this week, with some interesting results:
- the overwhelming majority of respondents were still performing every part of e-discovery primarily in–house
- but about half were planning to make an e-discovery purchase in the next year
- however a large number of them hadn’t finalized their choice of product or vendor.
So, how to choose? Well, in the wake of “Zubulake Revisited,” there is now more judicial guidance on the e-discovery process and certainly more at stake.
To get an idea of what the courts are looking for and how companies are adapting, I attended IQPC’s Judicial panel on avoiding sanctions, as well as the panel on building a corporate e-discovery response team, featuring e-discovery senior management from Lehman Brothers Holdings, Barclay’s, MetLife and Bank of New York.
A few takeaways:
- Judges on the sanctions panel were not sympathetic about high data volumes, saying “Lawyers just have to start dealing with it and make requests and responses appropriate.” They rejected objections to “burdensome” ESI production requests and criticized litigants for lying about production costs to avoid producing data. One recommended native file production to cut costs rather than requesting images or paper (!)
- Judges called for earlier preservation with a written legal hold, particularly in the wake of Scheindlin’s “Zubulake Revisited” opinion, which they called a “shot across the bow.” One claimed that some companies spent ten-digit numbers on preservation alone, especially if they’re caught at a late stage and can’t easily go back. That figure sounds like I must have misheard, but I don’t argue with judges.
- Judges criticized the lack of cross-functional IT and legal expertise at Meet and Confer and in collection of data. They recommended consultant Craig Ball’s 50 questions to prepare for Meet and Confer [pdf], and advised that e-discovery collections be supervised by someone who should anticipate having to testify in court.
- In the corporate E-discovery panel, Lehman Brothers Holdings (the entity responsible for administering all of Lehman’s litigation) reported standardizing on a single review platform for collaborating with all of its law firms, claiming that they initially had “big fights,” but eventually everyone accepted it – probably no mean feat considering the volume of litigation Lehman faces.
- Another panelist noted that more corporations are taking control of review as well as collection and the earlier stages of e-discovery. She advised that law firms should analyze legal issues but corporations handle the facts, doing as much review as possible in-house or outsourcing it at lower rates to cut costs.
- No one on the corporate panel had any major objection to using SaaS e-discovery or storing legal data in the cloud.
Food for thought. We are wrapping up our E-discovery User Survey this month and distributing results, which will also be included in our upcoming e-discovery long form report – contact us if you are interested in purchasing. And many thanks to everyone who has participated in the survey.
March 3rd, 2010 — eDiscovery
There’s a special section in this week’s Economist on information management, entitled Data, Data Everywhere. It’s always good when your area of interest and coverage is on the cover of such an illustrious magazine. However, I read it and downloaded the PDF (which you can do as a subscriber) and searched that, and to my surprise there are two significant words close to my heart that don’t appear anywhere in the report. They are:
- discovery (as a short hand for e-Discovery, or just on its own)
- governance (as in information governance)
I know the author, Kenneth Cukier, he’s an excellent technology journalist and thinker with years of experience (we both spent perhaps way too long at the various meetings that hosted the various fights for control of the internet’s domain name system (DNS) in the 90s that led to the creation of ICANN).
Ken’s focus in the report was more on the data deluge created by the internet and how that affects individuals, mainly in the context of being a consumer, exploring issues such as personal privacy, and how companies such as Google and Wal-Mart manipulate ans profit from data. There was very little talk about the problems that creating, storing, searching, archiving and deleting information imposes on companies.
And although there is a section on new regulatory constraints, it was again focused mainly on privacy, personal information as a property right, and the integrity of information held about individuals by corporations, with a token nod on the need to preserve digital records, but again looking at it from a consumer’s perspective.
All important topics, for sure. But not the one that a lot of companies are spending a lot of money grappling with now and in the future.
Now I’m not naive, and didn’t expect a multi-page spread on litigation support or an exploration of what early case assessment means in a weekly magazine with such a broad readership as the Economist! But I thought that given that e-Discovery and more recently, information governance are shooting up the list of priorities of many CIOs (the ‘i’ does stand for information, after all) as realize that without appropriate litigation readiness and information governance in place they could find themselves in a financial and legal sinkhole, I thought it warranted at least a paragraph or two among the 14 pages of text.
Update: Clearwell’s CEO Aaref Hilaly posted something on the same subject at almost the same time as me.