Google’s enterprise search: in the cloud & in a box

Google has changed the name the scope of its Website search it offers to Website owners that want a little more than simply to know that their site is being indexed by Google, but don’t want to go as far as buying one of its blue or yellow search appliances. 451 clients can read what we thought of it here.

Google has three levels of Website search to offer organizations – completely free but with no control as to which parts of your website are indexed and when, known as Custom Search Edition/AdSense for Search (CSE/AFS); the newly rebranded Google Site Search; and  the Google search appliances, which it sells in Mini and Search Appliance form factors, which can be used both for external-facing Website search as well as intranet search.

Google stopped issuing customer numbers for its appliances in October 2007. The number of organizations it had sold to at that point was about 10,000 customers. I suspect that number is around 11,500 now, though I don’t have any great methodology to back that up, I’m just extrapolating from previously-issued growth figures. That’s an extraordinary amount of organizations with a Google box.

To give some perspective, Autonomy has ~17,000 customers now. But the vast majority came from Verity. When Autonomy bought Verity in November 2005, Verity had about 15,000 customers (and Autonomy had about 1,000). But Verity got about 8,000 of those customers via its acquisition of Cardiff Software in February 2004. So in about 2.5 years Autonomy has added about 1,000 customer, but of course has done of lot of up-selling to its base and doesn’t play in the low-cost search business anymore (mainly because of Google).

The actual number of Google appliances sold is higher of course as many organizations have multiple appliances. I’ll never forget 18 months or so ago standing in  a room of a top 3 Wall Street investment bank with its top ~25 technologists gathered in a room and seeing about 6 of them put up their hands when asked who has a Google appliance - most of those weren’t known about to their boss or to each other.

But Google appliance proliferation is commonplace in large organizations. The things are so cheap and so relatively easy to install they are bought often under the radar of IT . The problem comes when times get tough (as they are in investment banking IT, that’s for sure) the organization wants to ring more out of the assets it has - even if it didn’t know it had those assets until relatively recently.

That’s why we strongly expect Google to come out with some sort of management layer this year to handle this sort of unintended (by the customer that is) proliferation. Watch this space.

Text analysis + content management = insight

We have long wondered why more content management vendors don’t fully embrace text analysis (or even enterprise search for that matter).

These guardians of most organizations unstructured data were beaten to the punch in terms of exploiting text by business intelligence companies, which are more accustomed to manipulating structured data. It’s great that the BI companies are starting (slowly) to embrace the idea of unlocking the value locked within unstructured text, it’s somewhat bizarre why content management vendors didn’t get there first.

We said this many years ago, in the most coherent form in mid 2005 with our report called Text-aware applications: the endgame for unstructured data (the clue’s in the title).

In report that we said:

“…while the penetration of content management systems is relatively high when compared with other ways of managing unstructured data, these systems do little at present to help analyze that unstructured data.”

and somewhat optimistically:

“Indeed, despite the CMS’s [content management systems] ability to organize, most implementations rarely attempt to push into anything that could be considered a semantic understanding of the content. This may be set to change, however, with some vendors, such as EMC, making headway in automatically parsing documents at a deeper level than just file-level metadata.”

That was a tad premature on our part.

Think about the main players and what they do to understand what resides in the documents they ‘manage.’

EMC Documentum - it has its content intelligence services classification engine sure, and it bought a federated search product many moons ago, but neither are exactly front and central to its product strategy. And ILM (try searching on that now at EMC and see what you get) only dealt with file-level metadata, not semantic metadata. However the X-Hive acquisition was an interesting one from this standpoint (see below for more on XML databases).

Vignette - bar an OEM relationship with Autonomy (which most vendors have) nothing much doing here despite the need for Web content management to increase its understanding of the text its managing to make websites more attractive to advertisers (think of using text analysis to build links to other content automatically to keep visitors on the site longer).

Interwoven - Metatagger isn’t exactly at the bleeding edge any more, although the idea is sound.

IBM Filenet - here there is hope. IBM has taken a classifier it got from its iPhrase acquisition and used it to do initial classification to help determine what should or should not be deemed a record. IBM has all sorts of text analysis toys to play with and we expect more from it in the future.

Open Text - it once had five search engines, and was a pioneer in that space. But I’m not aware of anything it does to extract meaning from the content it manages.

Autonomy - Its tagline is ‘Meaning-based computing.’ It owns a powerful classification engine but now also owns records management and a bunch of other stuff. It’s the one company that checks most of the boxes here (but isn’t a document or Web content management vendor). But as the company currently refuses to talk to us, we’re in the dark as to which bit fits where and are unable to tell our clients what benefits Autonomy could bring them as a result. If the company cares to get in touch with me, I’m here.

This post was prompted partly by a recent conversation I had with Nstein . It is morphing from being a struggling text analysis vendor laden with debt (it’s publicly traded in Canada, so the numbers don’t lie) to a fast-growing combination of Web content management, digital asset management (via acquisitions in 2006 and 2007) and text analysis, built atop an XML database licensed from IxiaSoft. Its focusing exclusively on the largest publishing companies, using the text analysis to automatically create links between new and archived content (thus pushing it up Google rankings). It competes with Mark Logic and Interwoven, mainly.

Any Gmail user that looks in their spam folder and see ads for “Spam Swiss Pie - Bake 45-55 minutes or until eggs are set,” can appreciate how crude keyword matching against content is next to useless.

There’s so much more that can be done here and so much insight being left on the table, whether it be in better website management to attract readers, voice of the customer analysis tied to BI, or government intelligence.

Tools that manage content need to understand that content - its language, its meaning, its sentiment. Otherwise, they are missing a trick.

Our take on M&A in enterprise search

I’ve gathered all my current thinking on potential M&A in enterprise search in a SectorIQ that we published earlier this week to our customers. In it, I look at four main potential targets plus a few other small ones and look at a few of the likely acquirers. (This is the way we write all our Sector IQs, btw and they’re a great way of getting a quick grasp on what might be coming down the pike in any particular sector of the IT industry)

Fortunately those of you that are not our customers (yet!) are able to read it via our arrangement with the New York Times DealBook section. Click here to see the NY Times posting or go here to go straight to the report - and while you’re there, sign up for a trial of our M&A KnowledgeBase, where we’ve been collecting details of every IT, internet and telecoms deal since the start of 2002!

Finally, a quick word about the headline. We like to have some fun here at 451 with these things and while I appreciate that this one might have been pushing things a little in terms of clearly explaining what the report was about, when else would I be able to use it? ;)