Tuesday, May 29, 2007

Is Google high on LSI?

Ah, the headline caught your attention :-) Well, don't worry. LSI is not a new fancy designer-drug and although the G company has a history of flying high, I doubt they are on anything stronger than Coke Zero.
But yesterday I just came across this excellent post by fellow blogspot blogger, John Colascione.
In the post he brings some interesting examples on how Google has implemented LSI (Latent Semantic Indexing). Back in 2005 I had the great pleasure of working with Moses Martiny and Kenneth Vester at Mondosoft while they were writing their Master Thesis on one of my favourite topics of all time, Document Clustering. I remember how I through them learned about LSI which is quite an interesting approach to automatic keyword extraction.
With this technique you can get some amazing results of keywords extracted from documents that doesn't even contain the actual words - although it should have!
If I recall correctly the basic approach is something like making a matrix of documents and words containing the entire document collection, and then use an algorithm like SVD to determine the most distinctive words for each document - even without the document containing the words. Funny stuff!

Naturally I couldn't read John's post without trying Google solution myself, and although it's not every term that has good LSI matches, there was some interesting ones. For instance it would seem that the word "~rap" is connected to both "Eminem" and "Lyrics" as well as "Rdf Api for Php" (the last was obviously the most interesting hit in my humble opinion).

Anyway, it's cool that Google is playing around with this technology - just as all the other search giants (and challengers). Now, if only it was incorporated in the search in a better way than the tilde ("~") query line operator.

WCF: Duplex is awesome!

One of my first assignments in the new job has been to lookup into a couple of the fun new features in .NET 3.0, like WCF and WWF.

So far, I've found Communication Foundation really interesting, albeit a bit annoying to work with.
It's main force is it's flexibility. Where you would usually have to decide on either building a Web Service, your own custom coded TCP Server or use remoting, you can now just build a standard service and then just put in the configuration file which protocol it should use (like HTTP, TCP, Named Pipes, MSMQ, etc) - or well, at least thats the theory. The downside to this is of course that since there's a lot of stuff thats configurable, you really need to understand all the configuration concepts (like Bindings, MetadataExchange, Endpoints, Security, Contracts, etc) properly and configure it well in order to use it.
Another potential problem could be the performance of this communication since all the communication is handled using SOAP (which means that there's a lot of XML serialization and deserialization going on).

It would also seem that a couple of the problems known from WebServices has been addressed. For instance you now no longer need to put the webservices in IIS in order to use web-services - they'll just open a HTTP port for you and act as their own server. It also looks like error-handling has been improved and it looks like there's now some cross-service exception-handling (although not perfect. An ApplicationException thrown from the server will appear to be a "FaultException" on the client - but perhaps I'm missing something here).

The most awesome feature I've stumbled across until now in WCF is the possibility to make Duplex services, e.g. services that are able to initiate communication with the client.
Sure, you could yourself make each client a service as well as a client and then let them exchange connection information, but now this functionality is build into the communications framework.
Naturally this requires some coding/configuration inconviniences, but once they are done it's easy to implement a state-of-the-art Observer pattern across various machines.
Setup a service, allow multiple clients to call the service to register themself as subscribers to various events, and then let the Service notify them when the event occurs.
Jeff Barnes has put a great article on Codeproject with an example of this.

Sunday, May 20, 2007

New Job!

After about a month filled with interesting job interviews and exiting offers, I finally decided where I'm going to work!
As of tomorrow I'm working as a Software Architect at Infopaq International.

I find Infopaq to be a really interesting company, mainly because of their ideas and infrastructure. They monitor a lot (!) of media, write news resumees and distribute the relevant news to their customers, along with doing some really fancy media analyzing. In other words, they seem to have tons of data and information to play around with - something thats always fascinated me. I'm looking very much forward into taking the step from Information Retrieval into working for a genuine Information Provider.

On top of that they seem like a fun company with a lot of clever guys and with some great ideas and visions. I can't wait to get started and learn all about this new company in details.

Wednesday, May 16, 2007

Playing around with Embedded Objects in IE

Yesterday I felt like playing around a bit with embedded objects in IE - you know, showing Windows Forms in a browser. An old .NET trick that I've used a couple of times.

For some reason it has never quite become the market standard it was supposed to (probably because it's ugly, inefficient and very browser specific) - but it would have been nice with a good alternative to java applets!

I made a quick adaptation of the SudokuSolver from my previus post to see how it would work as an embedded object. This is what I did:
  1. Made a Windows Forms Library project
  2. Made a Windows Forms User Control
  3. Moved the UI from the Sudoku application to the new User Control (as well as the Code Behind)
  4. Compiled and put on a web-server
  5. Made a HTML Page that includes it as an embedded object, and put the page on the same webserver (this is important)
This is the HTML I used, notice how you provide it with the URL to the Windows Forms DLL, and then the full path (including namespace) to the control to display:


<object id="SudokuControl" height="240" width="206"
classid="http://www.thraen.dk/Download/SudokuWinLib.dll#SudokuWinLib.Sudoku">
</object>


If you are watching this blog in IE, and you have .NET 2.0 installed, and your security settings is just right, there is a chance that you might actually see the Sudoku Solver here:


Solving Sudokus as CSP with Forward Checking

When you think about it, a Sudoku is a perfect example of a Constraint Satisfaction Problem, and so it made good sense that the final programming assignment in the IAIP course I'm taking was to solve a Sudoku as a CSP, using forward checking.

The basic idea is to define the problem as a set of variables (in this case 9*9=81 variables), where each has a finite domain of potential values (by default values 1-9 in this Sudoku example) and then there's a number of constraints connecting the variables. For the Sudoku these constraints ensure that the numbers 1-9 are distinct in every horizontal and vertical line, as well as in the same 3x3 square.

Once the problem is defined as a CSP there's several approaches to solving it. The assignment this time was to implement Forward Checking. The basic idea with forward checking is to make a backtracking depth-first search in the solution space, but every time a variable is assigned, the domains of all the affected variables (due to constraints) will be modified. If a domain becomes totally empty, we backtrack. It all sounds terribly complicated but it's really a straightforward approach, assigning the variables one at a time and whenever something is assigned revise the possible values for the rest of the affected variables.

As usual we were supplied with a nice codebase to extend to (in fact all we had to do was more or less to implement the Forward Checking part), but also as usual the entire codebase was in Java. This time it made the most sense to simply port it to C# and it didn't take much more than an hour to port the main solver class and even refactor a little.

The report can be seen here.
The project files can be found here.

Saturday, May 12, 2007

Solving NQueens using Reduced Ordered Binary Decisision Diagrams


Here's one of the things I've been too busy to post:
At my AI course we've completed our last 2 assignments. The one that I'll post here is the N-Queens problem solver, that helps solving the n-queens problem using BDD's.
BDD's is a really clever technique that can help solve some satisfiability problems quite fast.

The N-Queens problem is a classic problem, described lots of places. The derives from the 8-queens problem, thats how to fit 8-queens onto an 8x8 chess board without any of them threatening the others (according to the rules of chess).
It can be solved in many ways, but BDD's is a quite efficient way of doing it.
The basic idea is to build a binary decision diagram, where every node corresponds to a variable in the problem, construct the diagram according to the rules (constraints) that apply to the specific problem and along the way reduce the diagram according to a couple of rules (and to avoid duplicate node-childnode patterns).
How we choose to assign the nodes can be found in our report (as always written with Thomas Gravgaard and Peter Thygesen). Also, a final web-version of our solution can be tried out here.

The joy of being unemployed!

Ahh..posting again... I know, it's been a while, but now I hope to be back again. Now I've been through the first month of unemployed life, and I must admit that it has been nothing like I had imagined. It's been crazy! No time for sitting in front of the local supermarket drinking 6-packs of beers all day (as I had been told unemployed people are supposed to do), no time for playing around with technologies and ideas (as I had hoped) and no time for blogging either...Well, perhaps it's just a bad excuse, but I've been really, really busy.
During the first couple of weeks I got contacted by 40+ companies that wanted to discuss job oppertunities! Early on I was really eager and polite and noted everybody down and did my best to keep track of them, but I must admit that after a couple of weeks I started to sort them roughly at the initial contact. For instance I turned down all consultancy companies, cause I don't feel like body-shopping. At least not now.
But even with thoroughly sorting it mounted to around 15 initial interviews I had to go to. This part is really terrific and quite an ego boost (currently my bloated ego has reached a size where it can hardly fit into living room), that can definetly be recommended. Each interview was around 1,5h - most of the time talking about one of my favourite subject, me. As you can imagine it took quite a lot of time to go through all these talks, and following up on them afterwards. Meanwhile I constantly kept track of all advantages and disadvantages to each job, and after a couple of weeks of initial interviews I had a list of 10 companies offering me 15 different jobs, ranging from software architect, product manager, senior developer to researcher.
Then it started getting tricky!
Cause all these jobs on my shortlist sounded really good and I could easily see myself enjoying each and every one of them. But only one can be picked. So I decided to measure companies on more criterias and narrowed down the list to 5 companies that I decided to have a second talk with.
Now I'm down to considering 3 companies, and of those I must admit I do have a favourite - but I'm not telling which - at least not until the contract is signed and everybody's happy!