Tuesday, May 29, 2007

Is Google high on LSI?

Ah, the headline caught your attention :-) Well, don't worry. LSI is not a new fancy designer-drug and although the G company has a history of flying high, I doubt they are on anything stronger than Coke Zero.
But yesterday I just came across this excellent post by fellow blogspot blogger, John Colascione.
In the post he brings some interesting examples on how Google has implemented LSI (Latent Semantic Indexing). Back in 2005 I had the great pleasure of working with Moses Martiny and Kenneth Vester at Mondosoft while they were writing their Master Thesis on one of my favourite topics of all time, Document Clustering. I remember how I through them learned about LSI which is quite an interesting approach to automatic keyword extraction.
With this technique you can get some amazing results of keywords extracted from documents that doesn't even contain the actual words - although it should have!
If I recall correctly the basic approach is something like making a matrix of documents and words containing the entire document collection, and then use an algorithm like SVD to determine the most distinctive words for each document - even without the document containing the words. Funny stuff!

Naturally I couldn't read John's post without trying Google solution myself, and although it's not every term that has good LSI matches, there was some interesting ones. For instance it would seem that the word "~rap" is connected to both "Eminem" and "Lyrics" as well as "Rdf Api for Php" (the last was obviously the most interesting hit in my humble opinion).

Anyway, it's cool that Google is playing around with this technology - just as all the other search giants (and challengers). Now, if only it was incorporated in the search in a better way than the tilde ("~") query line operator.


Alexey Rusakov said...

Wow, that is pretty cool. I didn't even know about the '~'.

It seems that content management is naturally connected to knowledge management and document management.

Allan Thræn said...

Yeah, I suppose it's all in the within the exciting world of information management :-)