||[Sep. 16th, 2004|11:41 am]
So for the past few days at work I've been working on a profile of the enterprise search market for one of our potential clients. I hit a surprise this morning when I saw some exciting technology in action, so I thought I'd talk about it here, a bit. |
First off, here's why search is important:
The internet is the largest repository of what's called 'unstructured data' ever to exist. Some people argue with the term unstructured data because all data is structured somehow, but come on--if you’ve read most people’s blogs you know what we mean when we say unstructured. It’s not a database, it’s a bunch of different file format and content types and no one program can read them all, let alone load them all at once.
All that data needs to be organized somehow in order to be useful. The traditional way of doing this is with a search engine, like Google. But the problem with traditional searching is that you have to know what you’re looking for. In the business world, that’s not good. You need to data mine--find out when things are happening that you don’t already know about. Traditionally the only way to that is to search for a million specific things and set up disgusting numbers of alarm bells so that you can be reasonably sure that you catch everything significant. It’s costly and wasteful and very, very imperfect.
There are two factors I think matter: depth and breadth. Google is a very, very deep search engine--it can get you all kinds of relevancy, but pretty much only from text or xml documents. On the other hand, a company called Convera is probably one of the broadest search engines out there--it’s under heavy-and-growing utilization by government because it makes a suite of products that can search text documents, perform voice recognition and searching for audio calls, object recognize in video recordings, and more. I would call the product of breadth and depth a “contextual coefficient,” which would act as a qualitative metric of what the most generally useful search product it is. Convera is one of the broadest out there, but it’s not deep enough yet to have a marketable contextual coefficient.
So what you need is an automated system that searches through the unstructured content out there and tells you what’s relevant based on a low-precision question. The comparison I made in conversation with Will recently was to Star Trek--you should not have to say “Tea, Earl Gray, sixty-eight degrees, and in a cup this time!” “Tea, Earl Gray, Hot” should be sufficient. More importantly, though, the government should be able to have an automated program that searches all the available information from blogs, criminal networks, wiretaps, carnivore, et al. and have only the contextually dangerous stuff show up so it doesn’t end up arresting everyone under the sun on suspicion of terrorist plotting.
This is a picture of where we are, currently. As you can see, the highest-ranked thing on the graph is a manpower-generated content organizer, the extremely popular blog BoingBoing. Corey, Xeni et al. post cool stuff they find, and it takes a lot of man-hours to get all that stuff up there.
But technology is getting closer all the time.
This morning I had my first consumer-end encounter with this sort of predictive contextualization, and it was fantastic. I was searching for the ingenious High Fidelity quote in which John Cusack talks about women’s panties (They save the best pairs for when they know they’re gonna sleep with somebody, but he just wants the white cotton pairs now, etc. etc.). I thought the quote had the word ‘panties’ in it, so I was searching for combinations involving that word. Finally I hit upon the search "high fidelity quotes" panties and found my quote (a small piece of it, anyway) in the only English search result produced. But the interesting thing about this is that the located page does not, in fact contain the word panties at all.
I have checked and rechecked the following assumption with other search results and I am now essentially positive that Google is now smarter than ever. It has created a contextualized definition of the word ‘panties’ that includes references to women’s underwear. Since it couldn’t find links with the word panties in them, directly, it instead is using that contextual definition, defined by the aggregate of the content containing that word which it indexes, and substituting some other word set (roughly, “women’s underwear”) in order to produce search results which might help me. And it succeeded.
It’s pretty incredible that the service can interpolate to that extent. What defines human intelligence except the ability to contextualize and make indirect links or ‘intuitive’ jumps? I think that sort of search relevancy is what will eventually lead directly to real, working, TNG-level AI, and it’s pretty beautiful. More importantly, it’s not as far off as you all think. When search technology produces strong relevency from vague queries (or even completely unasked for) through a sufficient variety of media, what else can you say besides "Can I be your friend when you become self-aware?"