The Data Day: February 10, 2017

SEE YOU IN COURT, THE FUTURE OF DATA AND ANALYTICS AT STAKE!

And that’s the data day, today.

The Data Day: January 27, 2017

Alternative data platforms and analytics facts.

And that’s the data day, today.

The Data Day: January 20, 2017

The same people who did the phony election data, and were so wrong, are now doing approval rating analytics. They are rigged just like before.

And that’s the data day, today.

The Data Day: October 14, 2016

Is Data Platforms and Analytics Poisoning Our Children?

And that’s the data day, today.

The Data Day, A few days: April 9-15, 2016

The past, present and future of commercial Hadoop distributions. And more.

And that’s the data day, today.

The Data Day, A few days: January 30-February 8, 2016

Investment funding for Hadoop and NoSQL in 2015. And more.

And that’s the data day, today.

Updated Data Platforms Map – January 2016

The January 2016 edition of the 451 Research Data Platforms Map is now available for download.

Initially designed to illustrate the complexity of the data platforms market, the latest version includes an updated index to help you navigate the complex array of current data platform providers.

image

There are numerous additions compared to the previous map, especially in the area of event/stream processing while we have also reconsidered our approach to Hadoop-as-a-service, narrowing it down to distinct Hadoop offerings rather than hosted Hadoop distributions.

We have also tried to clean up or approach to the convergence of Hadoop and search, although that remains a bit of a work in progress, to be honest. There’s also something in there for eagle-eyed Silicon Valley fans.

You can use this map to:

  • compare capabilities, offerings, and functionality.
  • understand where providers intersect and diverge.
  • identify shortlists of choices to suit enterprise needs.

The latest version of the map can be downloaded here.

The Data Day, A few days: November 24 – December 9, 2015

Toward a converged data platform, and more.

And that’s the data day, today.

The Data Day, A few days: October 17-23, 2015

TIBCO targets ‘analytics for all’. And more.

And that’s the data day, today.

Hadoop (disambiguation)

What is Hadoop?

It should be fairly simple: in the beginning there was the Hadoop Distributed File System, Hadoop MapReduce, and the Hadoop Common set of utilities. Even with the addition of Apache YARN in 2013, just four projects officially form the core of Apache Hadoop.

However, this is not what most people refer to when they use the term ‘Hadoop’. Instead most people refer to the combination of Hadoop-related projects that are combined together with the Hadoop core to create Hadoop distributions.

As 451 Research’s Periodic Table of Hadoop illustrates, there are at least 40 projects that could be considered part of the Hadoop ecosystem (our table is comprised of Hadoop-related Apache Software Foundation projects, as well as other open source projects included in more than one Hadoop distribution). So ‘Hadoop’ represents pretty much any combination of more than 40 projects.

table

Hadoop’s creator Doug Cutting has asserted that Hadoop will evolve over time from a batch-processing engine to encompass a set of replaceable components in a wider distributed data-processing ecosystem. At the same time the word ‘Hadoop’ has evolved to become a catch-all brand for that wider distributed data-processing ecosystem.

That is potentially confusing, especially for for later mainstream adopters as they seek get their heads around what Hadoop is and what it is for. However, that’s not what this blog post is about. I’m less interested in defining what Hadoop is as I am interested in identifying what isn’t Hadoop.

When is Hadoop not Hadoop?

Recent announcements from the original Hadoop commercial supporter, Cloudera, have highlighted the significance of this question. First it anointed Spark as the successor to MapReduce, then it launched Kudu, a new storage engine and potential alternative to the Hadoop Distributed File System (HDFS).

If the company’s plans for Spark and Kudu play out, pretty soon we could see a whole lot of ‘Hadoop deployments’ that make use of neither MapReduce nor HDFS – the primary initial Hadoop core projects. This isn’t just a potential outcome. Already today it is perfectly plausible that a ‘Hadoop deployment’ might not involve MapReduce or HDFS – it could involve Spark accessing data in AWS S3 for example.

Both Spark and Kudu are open source and are clearly part of the wider Hadoop ecosystem, but where do you draw the line in terms of what is and isn’t ‘Hadoop’?

Vendors are increasingly layering additional proprietary components on top of this Hadoop ecosystem for differentiation. MapR has most obviously blurred the lines between Hadoop and not Hadoop, but Cloudera Enterprise could also arguably be put in a ‘Hadoop+’ category along with things like Pivotal Big Data Suite, and IBM BigInsights.

Then there are things that aren’t even claimed to be Hadoop but on closer inspection bear a close resemblance as ‘Hadoop’ evolves beyond its core. For example, the Stratio Platform is based on Apache Spark and other Apache projects including Flume and Kafka. It is isn’t claimed to be Hadoop but it enables data to be stored in the Hadoop Distributed File System (as well as AWS S3, Elasticsearch, MongoDB, Apache Cassandra, Redis, and relational databases) so it is surely part of the same wider family of data platforms.

animals

If not Hadoop, then what?

So what should we call this wider family of data platforms – including Hadoop+ and ‘other’? Due to the pick-and-mix nature of the Hadoop ecosystem there is no easy way to answer that in terms of technology or use-cases. The products and services will be designed specifically to deliver a mix of data processing and storage capabilities, including MapReduce, SQL engines and stream processing, as well as HDFS, HBase, S3 and Kudu, and much more besides, both proprietary and open source.

Indeed it is probably easier to think about this not in terms of technologies but the symbols that represent them. If Hadoop was originally symbolised by an elephant then what symbol best conveys the category of data platforms based on the wider Hadoop ecosystem and beyond?

Given the veritable menagerie of animals (and inanimate objects) that represent the various Hadoop ecosystem projects – elephant, pig, bee, tortoise, falcon, giraffe, orca, squirrel, hippopotamus, antelope, phoenix, kylin, roadrunner, hummingbird – there is surely only one choice: the Chimera.

256px-Chimera_di_Arezzo
Source: Wikimedia

For those not acquainted with Greek mythology the Chimera was a fire-breathing, multi-headed hybrid creature composed of the parts of more than one animal. While Chimera was classically composed of the features of a lion, a snake and a goat, the term chimera can be used to describe any animal with parts taken from various animals.

As such it is perfect to symbolise the multi-headed hybrid Hadoop-based data platforms we see evolving. We are therefore tempted to use the term Chimeric Data Platform to describe this wider category of data platforms that are building on and expanding from Hadoop.

The fact that Merriam Webster further defines chimera as “something that exists only in the imagination and is not possible in reality” is an added bonus that appeals to our sense of humour.