A Thoughtful Search Engine

Search engines are ok if you already know what you’re searching for: a hotel in Kyoto, the train times to Darmstadt, or Peter Stringer’s international career to date. You enter a couple of keywords, slap that search box, and voila!   If instead you just want to have fun, you can scan what companies want to advertise to you; or what many other folks have already searched for, or even have a gawk at what your friends want to recommend to you.

Isn’t this cool ? Isn’t this great to navigate your way through much of the immense mediocrity of the world wide web ?

Well, actually, no.

I know that an incredible resource of human experience, thought and wisdom really should be in front of me via my screen: but our search engines are for dummies, and they make it incredibly tedious to gain insights and to learn. Engineers, researchers,  scientists,  analysts, strategists, philosophers, historians, explorers, journalists, innovators, creators, artists, business case writers, case lawyers, authors, patent managers, medical case reviewers, poets, policy makers, thinkers – in fact, almost all of us – want to find inspiration, insights and ideas by browsing and exploring the wisdom and experience of humankind.

Give search engines a couple of keywords, abracadabras or shibboleths and they will return an intricate interlaced spaghetti noodle goulash of references and citations, biased by some apparently arbitrary mixture of advertisers and popularity. But to explore and learn, you have to scan most of these search results yourself, and mentally trash almost all of them whilst trying to discover the really valuable nugget lost in there somewhere. I don’t want a search engine: I want a discovery engine, an insight catalyst, and a thought provoker.

Here’s an example of what I mean.

Hans Modrow is a professional politician. He was the last communist premier of East Germany, after the Berlin Wall fell on the 9th November 1987. Look him up on Google and you’ll get a screenful like this:

So, what was Modrow’s role in the fall of the Berlin Wall ? How did he react to Helmut Kohl’s proposal to unite the currencies of East and West Germanies ? What was his role in dismantling the East German spy network after the Cold War ? Let’s scroll on down Google’s findings and (forlornly) try and find out:

Ummmmmm. Define Hans Modrow ? Like, yeah. Explore and discuss him ? Perhaps. A Hans Modrow horoscope ? Thanks, but don’t call me. Maybe we really can get his email, address, phone number (everything!) for him??  I kinda doubt it..

Is Google then the right tool to explore and learn ? It perhaps is only really good for pointing us to a myriad of wiki entries, biographies, definitions, encyclopedia entries and other scraps that may or may not discover what we’re after.

Google does has a “related searches” feature. But it really is pretty dismal:

It has a “nearby” search option, so maybe we can discover related topics about Hans Modrow. But, this is what we get:

So: Google says there is nothing nearby.

Google even has a wonderfully weird “wonder” wheel:

How “wonder”ful is this spokey disk ? I’ll let you decide….

Bing really isn’t much better:

And as for Yahoo:

And you can try other lesser well-known engines. Here’s Dogpile:

And I’ll let you decide what the result of a Dogpile is.

Last year, blekko was launched as a curated search engine. Human editors have carefully chosen a set of web sites and manually annotated the results of common searches with tag values (“slashtags”) to help provide a better experience. For example, searching for “Kona /weather” will give you the weather forecast for Kona. Crowd sourcing (as in wikipedia) is used to increase the portfolio and accuracy of the slashtags.

So let’s try blekko:

Well, that gave us the usual interlacing of results that a traditional search engine would give us.

So let’s try and guide blekko a little by using one of its slashtags to help it:

These results are interesting, because they’re bringing to our attention other people relevant to Hans Modrow’s career.

So, lets go further, and perhaps research Modrow’s involvement in the German currency crisis:

Oh dear. No slashtag on currency yet, although we’re invited to go make it ourselves.  Thanks.

We quickly can see how blekko’s curated search can help guide common queries, but may not always be as helpful when conducting research, trying to discover linkages and insights that few others – and indeed maybe nobody else – have yet seen.

It seems kinda ironic. The world wide web was originally envisioned by Tim Berners Lee in 1984 as a resource for sharing the results of physics research, enabling scientists to much more efficiently discover and insightfully deduce from the work of their peers. But 26 years later,  the current search engines still really don’t seem to help this mission very efficiently.

Tim Berners Lee has been promoting the semantic web as an emerging successor to the current web:   the semantic web promises to make the web content of human mumblings more accessible and amenable to machine processing. It should make searching an insightful experience, including across scientific domains. The semantic web is a tremendous research initiative and vision,  but which has yet to achieve widespread delivery (footnote disclosure:  I’m on the advisory board of DERI).

A different perspective has resulted from Web 2.0 technologies: maybe it would be easier just to tag articles with the keywords therein, and thus guide searching and browsing. Delicious, Flikr and Youtube are all examples of websites which use tagging to help classify web content. The curating site blekko adds to this trend.

There are also software tools to automate tagging. Here’s the experience of the New York Times, in building an online archive of their articles:

“The New York Times Annotated Corpus contains over 1.8 million articles written and published by the New York Times between January 1, 1987 and June 19, 2007 with article metadata provided by the New York Times Newsroom, the New York Times Indexing Service and the online production staff at nytimes.com. The corpus includes:
• Over 1.8 million articles (excluding wire services articles that appeared during the covered period).
• Over 650,000 article summaries written by library scientists.
• Over 1,500,000 articles manually tagged by library scientists with tags drawn from a normalized indexing vocabulary of people, organizations, locations and topic descriptors.
• Over 275,000 algorithmically-tagged articles that have been hand verified by the online production staff at nytimes.com.
• Java tools for parsing corpus documents from .xml into a memory resident object.

As part of the New York Times’ indexing procedures, most articles are manually summarized and tagged by a staff of library scientists. This collection contains over 650,000 article-summary pairs which may prove to be useful in the development and evaluation of algorithms for automated document summarization. Also, over 1.5 million documents have at least one tag. Articles are tagged for persons, places, organizations, titles and topics using a controlled vocabulary that is applied consistently across articles. For instance if one article mentions “Bill Clinton” and another refers to “President William Jefferson Clinton”, both articles will be tagged with ‘CLINTON, BILL’.”

Wow! 1.8 million articles, of which 1.5 million have been read, carefully and expensively tagged by a dedicated team of archivists. At least one tag per article. Just 275,000 of the articles were tagged by some software tool and, even then, all these automated tags had to be checked by hand.

And all of this intense effort is just for 20 years of archives, from 1987 to 2007. But the full archive of NY Times goes back to 1851 and has over 13 million articles!

Lets look up “Hans Modrow” in the NY Times archive: we get

followed by:

We’re getting back a somewhat random list of different articles all of which have something to do with Hans Modrow, but without any clear structure to the results of the search: does sorting by “closest match” mean random order, or date order, or byte size, or what ???

It seems such a good idea: we desperately need to get a better search tool, so let’s augment articles with meta-data – whether it be tags, ontologies, taxonomies, dictionaries, thesauri, whatever…  This can all be automated, perhaps..   Then having gone to the bother of doing all this, sometimes the search results really don’t seem to be significantly better….

Why can’t we build an engine which can understand the information in situ, without the need for externally provided meta-data ?

We can. The team at Sophia Search have done just that. Here’s what Sophia gives us when we ask her what the New York Times archives have on Hans Modrow:

Immediately we can see different aspects of Modrow’s career. Sophia has automatically collected together articles on each of a number of themes relating to Modrow: Modrow’s involvement in the currency issue (50 articles); separately, a theme about Modrow’s involvement in the fall of the Berlin Wall (42 articles); separately, a theme about Modrow’s role in reaching out to the non-Communist opposition parties (37 articles); and so on.

Clicking into any of these themes (say, the second theme on the currency crisis after the fall of the Berlin Wall) gives associated neighbouring articles:

We can click through on any of the articles to scan and read them:

Our original discovery term “Hans Modrow” appears in blue in the article. The Sophia web interface has given us back the raw text of the article, having stripped out the formatting and so the layout looks a little ugly. However the URL to the original article is given just below the title and author, so we can easily retrieve the original:

Going back to the articles listed in the (second) theme as above, we can explore the neighbours of each document:

We learn that as Modrow assumed power, apologising contritely for the failures of the communist party in serving the citizens of East Germany, the Constitution was amended to remove the privileged position of the party:

Ok.    Sophia has carefully sorted all the NY Times articles relating to Hans Modrow into a group of different themes, with each theme pertaining to a different aspect of his career. Within each theme its pretty easy to explore what happened, and also to discover additional details and insights via neighbours of any article which we choose to examine.

Pretty cool.

And Sophia has no need or use whatsoever for any ontology, taxonomy, human written tags, software generated tags, dictionaries, whatever. And yes, the core engine and algorithms are natural language independent: the Sophia guys have a Russian language version simply out of the box.

Pretty,  Pretty Cool.

But here’s the best bit. Did you notice the “related documents” tab ? Sophia thoughtfully suggests other documents to us which do NOT have our specified keyword terms, but nevertheless are pertinent to the topic and theme which we are exploring.  She understands that we’re looking for certain search terms,  and suggests to us additional items which are relevant even though they do not contain the search terms we gave….

So, clicking the “related documents” tab gives us a bunch of other articles about our chosen second theme (the fall of the Berlin Wall) which do not explicitly mention our search term (“Hans Modrow”) but which are still very useful and insightful:

Selecting the first of these, just as an example, confirms that the term “Hans Modrow” doesn’t actually appear:

(here’s the bottom half of the article:)

This is really neat. Not only does Sophia return articles containing the keywords we specify, but also articles which don’t have those keywords, but nevertheless are highly relevant.

Now, if you are an engineer, researcher,  scientist,  analyst, strategist, philosopher, historian, explorer, journalist, innovator, creator, artist, business case writer, case lawyer, author, patent manager, medical case reviewer, poet, policy maker, or thinker – and I suspect you are at least one of these – consider how useful Sophia is going to be if you use it with your favourite corpus.

Sophia is not an internet search – not yet, anyway… It (currently) is intended to help knowledge workers become far more productive and insightful with their chosen collections of unstructured data and documents.

How does Sophia do it ?? No taxonomies ? No ontologies ? No tags ? No dictionaries ? No meta-data at all ? Nothing ???   And for any natural language ????

Crudely, you can compare the Sophia core engine to a hash table. It computes a magic number based on the contents of each item, rather like a hash value. It turns out that documents which have related content come out with similar hash values…

Or if you ever studied information theory, and in particular Shannon’s pioneering work, you’ll have an idea of the intrinsic information content of a stream of encoding symbols. Sophia builds on this approach to derive the actual entropy of the information within (the symbol streams given by) documents, and collects and curates articles having similar entropy.

Sophia is really really really cool. Give the Sophia guys a call,  we can give you a live demo (and webex if you wish),   and get a trial going on your chosen document corpus..

And oh further disclosure: I’m the Chair and also an investor.


About chrisjhorn

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s