Originally we conceived of Sophia as a semantic search engine, targeted at the enterprise. Sophia builds its indexes over a given corpus of unstructured data and documents, and then automatically clusters together documents and data on related themes. You can then use the search interface to browse and explore. Sophia will bring to your attention related documents even if they don’t explicitly match your chosen search terms. Sophia is remarkable in that its engine directly works for a range of natural languages and so does not require reconfiguration. There is no external meta-data required: no ontology, no dictionary, no thesaurus, no RDF. It just scans your data corpus and does it all on its own. I previously blogged an example, showing the basic search interface.
But as well as offering our own search capability, we quickly realised that Sophia could work to augment and improve other search engines. Whilst still offering the full search offering, we extended the core engine to automatically build and output tags, categories and themes for a given set of documents. This meta-data generated by the Sophia Digital Librarian can then be used to semantically augment other enterprise search engines and associated tools, including specifically the open source Lucene-Solr; Google Search Appliance; Microsoft FAST and Sharepoint; MarkLogic and others.
I’ve been road testing the Digital Librarian against one or two other semantic tagging generators. In particular Thomson Reuters have a very interesting and competent generator in OpenCalais. This was acquired by Reuters from ClearForest in 2007, and has been extended and improved. In particular it added “social tagging” in 2009. It uses a top-down semantic approach, using natural language processing (currently the website says it supports English, Spanish and French); Artificial Intelligence techniques, and pretty enormous databases. Based on what I have read on their web-site, the social tagging functionality is so far limited to English and unavailable in French and Spanish.
Anyway, I set up Sophia and OpenCalais side by side to compare and contrast. Sophia does not (yet..) do entity extraction, which OpenCalais does well in identifying specific named individuals, locations, etc. On the other hand, OpenCalais does behave idiosyncratically from time to time…
OpenCalais is part of the Linked Open-Data cloud. According to its website when analyzing a document, it currently uses DBpedia, Wikipedia, Freebase, Reuters.com, GeoNames, Shopping.com, IMDB and LinkedMDB to help “understand” a given text.
Sophia, by contrast, has to be explicitly built over a corpus. As a test and online demonstration, the New York Times archives annotated corpus (with permission) containing 1.8M documents, from January 1, 1987 and June 19, 2007 are one online example available from Sophia. You need to be registered to login and try out Sophia, but it is free.
As a test case, I chose an article published within the timeline of the NYTimes archives, but not an article directly from the archives themselves. Instead (pretty much at random), I used a 19th July 1983 report from the BBC News service on the unveiling of a skeleton of a new dinosaur species at the Natural History Museum in London.
The intent of my experiment was to test the consistency of the reporting by both tools. Starting with just a single sentence from the BBC report, I added further sentences and compared the results as more and more text is given to each tool. A priori, I expected the results from each tool to initially present a fairly generic set of generated tags, and then to improve relative to the text and its contents, as more and more sentences are added.
The first sentence of the BBC report is “A huge new dinosaur skeleton has been unveiled to the media at the Natural History Museum in London.” Here’s what the (free) OpenCalais Viewer gave back:
The topic returned is “Hospitality Recreation” which is probably reasonable, based on the London Natural History Museum. The list of social tags also appears reasonable: John Gurche is an artist specializing prehistoric life; and the Naturhistorisches Museum is in Vienna.
As a side-note, for some reason the OpenCalais Viewer currently only supports the Internet Explorer and Firefox browsers. I am using Firefox here, although my normal browser is Safari. When I try Safari on the same examples here, OpenCalais appears to work but in fact does give worse results than Firefox. So, I’m sticking to Firefox in the remaining examples below.
Anyway, lets see what happens when we give OpenCalais the second sentence from the BBC article as well (“Plumber and amateur fossil hunter Bill Walker, 55, found a foot-long claw belonging to the flesh-eating beast at a clay pit in Surrey in January”):
As you might expect, the list of social tags becomes improved and more relevant.
Somehow, “he found the rock” and “he tapped it” have encouraged OpenCalais to believe that the article is something to do with Entertainment (well, yes, arguably); Games, and Electronic Games in particular! Primal Rage is a (pretty gruesome) early game involving in dinosaurs: however I’m not at all certain that finding a rock and tapping it are a part of the game (can anyone clarify ?). Equally, OpenCalais has decided to drop the Natural History Museum (even though it is explicitly given in the first sentence), and the United Kingdom.
Its not at all clear to me at least, why OpenCalais is behaving in this way. Certainly it is quirky, to be polite🙂
Let’s throw in the fourth sentence (“Palaeontologists reconstructed it and dated the remains at 125 million years old, describing them as the find of the century”):
The fourth sentence has caused OpenCalais to return the same set of social tags although, interestingly, the weighting given to the “Hospitality Recreation” has decreased from the three stars it had been previously given.
This is really really weird! The addition is of the fifth sentence has resulted in OpenCalais now thinking that collectively the five sentences are only about Hospitality Recreation. It has dropped all the other social tags which it previously had for the first four (or fewer) sentences – whether relating to dinosaurs, natural history or even electronic gaming!!
Throwing in the sixth sentence (“Group leader and head of the Dinosaur Department at the Natural History Museum Dr Alan Charig explained: ‘It is a totally new species of dinosaur. Even more important, this is the first record of any meat-eating dinosaur being found in rock this age anywhere in the world'”) does result in OpenCalais at last behaving as one might hope that it might:
But overall: for this particular example taken from the BBC News, OpenCalais’s behavior is clearly unstable and unpredictable.
Let’s try Sophia. Remember that Sophia in this case is using the NYTimes archives – rather than the set of databases which OpenCalais is using as listed above.
Correlating against what is in the NYTimes archives, Sophia has catergorised the opening sentence as relating to Museums, with sub-categories of the Museum of Modern Art and the Metropolitan Museum of Art. The Document Tags are tags which Sophia has found in the given sentence. The Semantic Tags are a list of other tags which Sophia believes are relevant, even though they do not explicitly appear in the text. The Neighbours list is a list of titles of specific (in this case) NYTimes articles which Sophia believes are relevant to the given text. These can of course be linked directly into the archive database to read the full text of the associated articles. Finally the Distance metric is a number between 0 and 1 relating to how “close” the given article is to the text supplied.
The OpenCalais Viewer tool on its own does not have an equivalent of the Neighbours list, and I have not experimented with the rest of the OpenCalais toolkit to explore what articles OpenCalais might suggest as related to the given search content.
Note that the output generated here is purely for human checking and demonstration. Sophia can generate XML or other formats to provide appropriate input and guidance to established search engines (as I listed earlier).
The introduction of Bill Walker in the second sentence has revised the topic and subtopics from museums to “Fossil”, “Fossil Record” and “Homo”. The tags and semantic tags are now more oriented to dinosaurs, fossils and paleontologists. “Bird” has been emerged, driven by the mention of “claw” and the prehistoric heritage with dinosaurs. The Neighbours list has surfaced new article titles which appear more relevant to the two sentences which we have so far given from the BBC News article. A reminder that Sophia does not yet do entity extraction, and so Sophia has not explictly identified “Bill Walker”.
The semantic tags have slightly changed, but the output is essentially the same. The Neighbours list has been further refined, with one or two articles rising in the distance order; others being dropped in favour of new articles apparently closer to the search text.
“Species” has been added as a new subtopic. “Homo erectus”, “discovery” and “artifacts” have emerged as closer tags. The Neighbours list continues to be refined.
The “homo erectus” and “homo sapiens” semantic tags have now been replaced by “human” as Sophia realizes the document is less about prehistoric “humans” but prehistoric dinosaurs and today’s humans. The Neighbours list clearly has a number of articles relating to the discovery of dinosaur fossils, together another relating to an exhibit in a natural history museum.
So: things are interesting. OpenCalais is a remarkably good tool, but sometimes appears to have unusual behavior. Sophia has limitations (like no specific entity extraction) but uses a very different approach indeed from OpenCalais to produce high quality tags.
A reminder that my intent was to test the stability of the two semantic tagging tools as more and more content is unveiled to them. Clearly one would expect that the more sample (search) text that can be given, the better the tools should perform. But in practice, it is not unusual for knowledge workers to supply relatively short search criteria. Applying a automatic semantic tagging system over short texts — such as in particular Twitter and other social media tools — is also pertinent, and we ourselves have some excellent results from Sophia (e.g. we built a new database from Twitter containing all tweets relating to Rory McIlroy’s US Open win, and automatically categorised and themed the very varied content :-)).
OpenCalais does appear, at least from this admittedly very limited example, to have some surprises. I’ve little doubt that OpenCalais will continue to improve: as I noted above, it is already a very good tool, albeit at this time largely limited to the English language. We will of course continue to extend Sophia and its Digital Librarian…
How does Sophia do it ? Well, if you’ve ever read Dan Brown‘s Da Vinci Code or Angels and Demons or The Lost Symbol, you’ll have thought about signs and symbols in language. Umberto Eco‘s novels and analyses – such as in particular The Name of the Rose, or Foucault’s Pendulum or even Kant and the Platypus – play even more strongly to the linguistic importance of signs. Eco is a semiotician, and Sophia uses algorithms based on semiotics to identify and categorize documents. Per se, it does not have any knowledge about any specific natural language, and so instead analyses the patterns of words and constructs appearing in the documents which it is given – clustering together documents which have similar semiotics.
Contact the Sophia folks directly to learn more and try it out.