Allan on Technology: March 2007

Thursday, March 22, 2007

Why, oh WHY?!?!?!

Currently I'm totally in the state of chock after finding out that C# supports GOTO statements.
To some people it's probably old news (that says more about them than me), but I just found out. Let me for the record state that the only way I found out was when looking through some code on codeproject.
I'm sure there's those that claim that cases exist where it actually makes sense to use GOTO, but oddly enough these people can usually never think of them, when asked for an example.
But even if there exists good usages, we've got to stop up and ask ourselves if it's really worth ruining such a nice language as C# for?! There is always another way!
Mr. Hejlsberg, in the future just say NO :-)

Wednesday, March 21, 2007

Another day another tool: XPath

Here's a small 5-minute tool I made the other day, simply because I needed it and it was faster to make it than finding a good one online that does what I want it to (although I'm sure hundreds of them exist).

It's a simple XPath tester, that can test an XPath against a piece of XML, or a URL pointing to XML.
It even attempts to do some simple syntax highlighting of the XML.

The tool is online at www.mizar.dk/XPath. (In case you are wondering, Mizar.dk is an old domain I currently use all my silly projects and for various other demo/lab purposes. It's actually a leftover from a webdesign company I had together with Jesper years ago).

The XPath tester can also be used for simply showing XML, and it accepts a couple of query-line parameters so you can send link to XML with applied XPaths like this.
The query-line parameters are: XML (raw xml for source) XMLURL (if you instead want to retrieve xml from a specific location) and finally XPATH (guess what thats for).

Enjoy!

Monday, March 19, 2007

J# to the Rescue

I've always wondered what the purpose of J# was. I mean, Java and C# are quite similar, so I'd suppose that if people (for some weird reason) would prefer to code Java (instead of drinking it), they'd do so purely because they wanted to work within the java environment and the java VM. If they eventually decided to overcome their Microsoft fears and turn to the .NET framework, C# would seem like a natural choice. So it has indeed been puzzling my mind why J# is included with Visual Studio.
But today I found a use for it, and it really saved my day!
The next programming assignment (after the Connect 4 game) in the AI class I'm attending was handed out today and even though the assignment looks fairly interesting (implementation of DPLL algorithm to check satisfiable of Boolean expressions in CNF form) the source code that we are supposed to base our solution on was pure old-fashioned java. Of course, we are free to choose which language to solve the task with, but if we choose anything else than Java we'll have quite a programming task ahead of us, just making the code-base that we're supposed to extend on. After first considering learning java I was happy to remember that there might be another way out :-)
With shaking hands and a heart beating way too fast I started VS2005 and created a J# Class Library project. Then I copied the original source java code into newly created .jsl files with identical file names (copy+paste). As you probably can imagine the excitement was overwhelming when I tried to build for the first, but surprisingly there was only very few compiler errors...
A java HashMapSet that I discovered should be changed to an ArrayList in .NET, a couple of framework methods where Mocca (sorry, Java) used wrong casing, and finally a method that appeared to have been renamed in .NET, and I was good to go.
All that was left to do then, was to create a C# project in the same solution, reference the J#-library and I was good to go - well almost. In order to inherit one of the java-classes and override the methods we're supposed to implement we also needed to reference "vjslib" in our c# project in order to recognize the parameters to the methods we were overriding.

All in all I was very happily surprised to find that J# actually cured my pain.

Now we (me, Thomas and Peter) just need to actually solve the programming assignment we were given :-)

Connect 4: The code

Yesterday I handed in my code for the Connect4 game. Since the deadline has now passed I figured I might as well post it here, including the document I wrote to describe it. In case you want to play it again, do it here.

Download the code for this article here.

Introduction

In order to complete the assignment of making an implementation of a computer player for the game "Connect four", I started out by making an environment, a state model and a simple console based interface to test it.

The console based interface works by showing the current state as 7 columns with 6 rows in each. When it's the users turn, he/she should choose one of the 7 columns, by using the numbers 1-7. It is a very simple and basic UI because I wanted to my time on developing the search algorithm that will determine the computers moves.

To run the code, compile the project with a C# 2.0 compiler and run the executable. I've included solution and project files for VS2005 as well.

The console program also contains some experimental functionality that allows you to set up the computer to play against itself. This code was used when I was optimizing the weights of the evaluation algorithm, but is now commented out.

The Connect Four game has been "solved" by Victor Allis, and there is a "Perfect Play" path you can follow, that will force a win to the starting player. However I have chosen to disregard that playing strategy in this implementation since the goal was to learn about AI game-playing.

Class overview

Connect4: The main game class, that holds the current state and controls the flow of the game. In turn calls each of the two players, typically Computer and Human and executes their actions.

StateType: A state, containing the board, the number of moves made, and who is next in turn. It also contains methods to clone itself and perform a move, as well as evaluate if it is in a game-over state.

Computer: The class that holds all the logic related to the search algorithm. It is called like a player using the PlayerTurn(StateType state) method, and then begins to perform a modified MiniMax search with Alpha-Beta pruning to determine which action to choose. Its functionality is described below.

Human: The Human player. The PlayerTurn(StateType state) method is called by Connect4 every time it's the users turn. This then presents the state to the console and waits for input. When an action is retrieved it returns the action, a move is made and the turn changes.

Search Algorithm

The Search algorithm is a modified MiniMax algorithm with Alpha-Beta pruning. It is implemented in the "Computer" class, which controls the computer player.

Using double recursion is performs a depth-first search in the state-space until the cut-off method evaluates to true. At this point it uses the heuristic evaluation method to determine how close it (the computer player) is to winning. This value is passed up through the state tree, where each node either selects the maximum or the minimum of its values, depending on if it represents the computers turn or the players turn. The procedure is in fact quite similar to how a typical human player subconsciously would play the game: First enumerate the possible actions from the current state, then estimate the opponents move from any of the resulting states, and decide which would leave you in the most favorable state. However human (non-expert) players will typically only look 2-3 moves ahead, while the computer can search a much larger state-space.

In order to further improve the performance Alpha Beta pruning has been applied. This essentially is to remember the best (or worst) options for each node, so no time will be wasted exploring branches of the trees that's already identified as path that will not be played.

Heuristic Evaluation method

The Evaluation method is supposed to evaluate how good a given state is for the computer, e.g. how close is it to winning the game. After having experimented with several different models I found an approach that suited me the best in evaluation a state. In order to win the player must have 4 fields in a row, horizontally, vertically or diagonally. To find out how close both players was to this, I decided to examine all possible combinations of 4 fields that appear in a row in the 7*6 field board. First I would break it down to lines, horizontally, vertically and diagonally and evaluate each line. All lines I examine will of course need to be 4 fields or longer.

An example, a line with 7 fields would lead to the following possible winning combinations:

1 2 3 4 5 6 7 => 1234, 2345, 3456, 4567

For each of these combinations my evaluation method would examine the amount of them taken by each player. If none of the players or both of the players has selected fields in the same 4-field combination, the combination will be instantly discarded. Otherwise it will add to a state-score depending on the number of fields taken (how close it is to be a 4-in-a-row). Depending on the number, and if it's computer or player fields, a certain weight will be applied.

Determining the evaluation weights

Initially I started out with weights that was the square of the number of occupied fields in each block of four (e.g. 3-in-a-row would give a score of 9, 2 would be 4, 1=1) and if state resulted in 4 in a row it would get a big bonus / punishment of 10000/-10000.

However, it quickly became evident that I would at least need to raise the punishment if the player had 4 in a row, or take into account whose turn it was, to keep the computer player from focusing solely on its own block-building. After that optimization it became increasingly difficult to manually fine-tune the optimization of weights, since no matter how I adjusted them, I still couldn't beat the computer.

So I set up a game to allow two computer players to play against each other. The first player had the starting advantage, and the current weights. The second player would have almost the same weights, except one little difference at a time. If the second player would win in spite of the first player having the starter’s advantage, I would use the new weights as the current weights.

After having gone through this manual iteration 15-20 times, I decided that it was fairly optimized. If time had permitted I could imagine automating this process, perhaps using Simulated Annealing. However I felt that this was outside the scope of the current assignment.

Cut-off Method

Since the state space will typically be too large to explore in reasonable time, it's necessary to cut-off the search at a certain point.

First of all the cut-off method should be able to detect if we've actually reached a winning state, in which case it should always cut-off, alternatively it should determine if it is feasible to stop the search at the current level.

After experimentation I decided on a simple cut-off mechanism that cuts of the search after looking 6 moves ahead. 7 moves also seem possible within the allowed time-frame but I'm too impatient to wait more than 1-2 seconds for every move when playing.

An ideal approach would of course be a more intelligent cut-off mechanism that would keep searching until all available time was used, however time did not allow for such a solution at this time.

Friday, March 16, 2007

MondoSearch Result Authentication

A very typical request I often hear from customers and partners is the ability to return only the results that the current user is allowed to see. This desire is very natural, but can often present quite a challenge to 3rd party search engines like MondoSearch. The problem is that it varies a lot from each individual setup how authorization works, and hence no general solution can be made. We can only deliver specific solutions of authorization to specific systems (like we have done for EPiServer or Sitecore) or provide general toolkits/examples that makes it easier to custom-build an integration.
The problem with authenticated problems can really be divided into two sub problems:

Indexing secure content
Searching in secure content

Indexing isn't that big of a problem. There's many ways to make that content available to the search engine. MondoSearch has built-in support for basic-authorization, challenge-response (integrated authorization) and forms log in, just as well as it's quite easy in many CMS systems to override the security if the client originates from a specific IP, or has a specific HTTP Request setting. Generally we see only very few problems in actually indexing the content. The only thing that can be tricky is when the content on the individual pages vary based on who is logged in. In order to handle that, would require the Search Engine to index the same URL, as all the different users that can access it. Luckily pages with user-dependent contents are typically portal pages that are not all that interesting to index. The articles, documents and database content that's interesting to index are not a problem.

Searching in Secure Content is really the main challenge when it comes to authenticated contents. Even though security for the individual pages typically is checked when you try to access a page, it can still be quite revealing when the title (and perhaps description) of a page is displayed on the result page. In fact, to be totally safe, a user who doesn't have access to certain documents must not even know of their existence from the result page! (Suppose I searched for "invasion plan Iran" on Pentagon's website and was told that there were 10.000 documents I that matched the phrase, but none I was allowed to see).
In order to achieve this there's generally three approaches:

Authentication by filtering. Store access rights when indexing the documents and use them in the search
Authentication by exclusion. When performing a search, manually check that the current user has permission to see each of the results, before returning it.
Rules based authentication. Where a number of specific filters is defined for each user-group.

In general I prefer to use Filtering to perform search result authentication.
With MondoSearch this typically means adding Meta-tags (/data) to all documents defining which groups / users are allowed to view them. And perhaps even which groups/users have specifically denied access.
A Meta-tag like that could look something like this:
<meta name="ALLOW" contents=";53;124;351;33;12341"/>
Then, on the result page, all you'll need is a piece of code that extracts the user-id and the group-ids of the current user and then adding search filters to the search query. Suppose we have a user with user-id "42" and who belongs to the group "users" (id: 351) who performs a search that returns a document with the above meta-tag. The MQL that is sent to the search engine would then have to have these filters added:

"... FILTERS ALLOW CONTAINS ';42;' OR ALLOW CONTAINS ';351;' ...."

To also enforce DENY is a bit more tricky, but certainly just as doable.
The obvious benefits here are: It's very (!) fast, it's clean, it's easy
However there's also a number of downsides:

Not all CMS systems support outputting permission-lists to the crawler
If access-rules change, they will not be propagated to the index until next crawl
It typically doesn't work for non-html documents like Office and PDF (since it's kinda hard dynamically to attach meta-data to these types). However there is a number of workaround to this problem.

The alternative to filtering, is exclusion which in my eyes is definitely not pretty, but sometimes necessary. Authentication by exclusion calls for a custom method is defined that checks if the current user has access to a given URL. A pointer (delegate) of this method is then passed to the search engine that will call it and evaluate every result in the result-set.
The obvious problem is the performance of this solution. On a result-set of 10 pages, with a fast-checking method, it can be acceptable, but often result-sets can be very large. Imagine having to call a custom-made method for every one of 100.000 results - or worse!!
Another problem is that in order to pass a delegate to the search engine the search-engine needs to be installed on the same server as the CMS - something that doesn't always fit into the desired machine architecture.
Of course the performance can be increased of such a method in some cases: intelligent caching, only check the results on the first page, etc. but in my experience it's never a really good solution. In my eyes the only really acceptable use of this is as a compliment to the filtering search (for instance to check access for non-html documents) - or where no other solution works.
In order to set this up on a MondoSearch template, assign a method handler to the "OnAuthorize" event in the SearchControl, like this: OnAuthorize="CheckAuthorization" .
Then define the method elsewhere:


public bool CheckAuthorization(string url){
   return true;
}

The last authentication method I will briefly touch in this post is to use a number of rules.
The idea here is that by applying knowledge about the security setup on a website, a couple of simple rules might do the trick.
Imagine a simple setup where only two types of visitors exist on a web-site: logged-in and not-logged-in, and that all the content that only the logged-in users were allowed to see is in the sub-directory "/secure".
In this case you could simply apply some additional MQL when a visitor performs a search:
if(!logged-in){ mql+="FILTERS @CHANNEL!='secure'"; }

This is an ideal approach, but it doesn't work on all sites.

Thursday, March 15, 2007

Performance: ArrayList vs List<>

Quite recently I was trying to optimize/cleanup some .net 2.0 code that had recently been moved from framework 1.1. Naturally various old fashioned Arraylists was scattered over most of the code, and I began the tedious work of tidying up and using generic lists.
As I was doing this I began to wonder how big a performance difference there actually is with various list types, and I decided to investigate this.
The three list types I decided to check were these:

ArrayList. Leftover from .net 1.0, basically a dynamically growing array of objects. Everything has to be boxed to an object in order to use this.
List<>. The generic List in 2.0. For valuetypes, this doesn't box them, but stores them quite effiecently.
Normal arrays. For comparison.

So far I've only decided to test adding/setting the values. I expect similar measurements can be made with regards to removing, accessing and finding. Later on it might also be interesting to look at searching and sorting.
I decided to compare how the ArrayList and ListType does, when adding strings, integers and finally when adding strings after the list has been initialized to the expected size. Just for fun I also tested how the ArrayList performed in framework 1.1.

For the integer tests I added the number 42 to the collections, and for the string tests I added "42".
These are the results I got(times are in ms, running on my laptop):

Items	1 mil	2 mil	4 mil	8 mil
ArrayList Add (int) 1.1	200	500	1121	3000
ArrayList Add (int)	150	450	1100	2600
ArrayList Add (string) 1.1	40	150	250	530
ArrayList Add (string)	70	150	290	630
ArrayList Add (string) PreSize	20	60	110	230
List Add (int)	30	50	120	230
List Add (string)	60	120	280	520
List Add (string) PreSize	20	50	100	200
Array (int) Set	10	20	30	60
Array (string) Set	20	30	70	170

Graphically the data looks something like this:

The conclusions I draw from this is as follows:

Always use arrays or List's when possible, when working with valuetypes. ArrayLists are naughty. The big overhead is probably due to the boxing that takes place (an Int32 object has to be created for every number, and a reference to it is stored in the ArrayList).
When working with reference-type data (like strings) it's typically not worth the effort to change it to Lists unless you are dealing with really big amounts of data. The List is only around 110 ms faster than the arraylist when dealing with 8 mil. items and both seem to scale linearly.
When possible it really does help a lot to initialize your list or ArrayList to the expected size. It makes the performance approach that of a normal array.
When Microsoft claims that they've improved performance a great deal for collection-types in framework 2.0, it's only true for the "new" types like List. There's almost no difference in ArrayList performance!

Tuesday, March 6, 2007

MOSS vs EPiServer

Here's another reason for always making sure your feed-reader is up to speed. Today I had almost missed this great post by Patrick Tisseghem because I had forgotten to update the url to his feed. Luckily I got tipped off (thanks, Jesper!).
I already know EPiServer quite well - due to my attachement to the "MondoSearch for EPiServer" project. And just a week ago I spend all day getting an intriguing presentation of our other product suite Ontolica's upcoming version for MOSS. As you can imagine I read Patricks post carefully - it's always interesting to get his view on things. And the geek-cruise does sound like a lot of fun (Hint, Hint)!

AI: Connect 4 Game

I have been kinda slow in posting these last couple of weeks. One of the reasons is of course that I've been busy coding (as always). This time however, it's actually real homework thats been keeping me busy.

In the AI course I'm taking at ITU we were assigned the task of making our own version of "Connect 4" - the well known board game where you drop coins into a board from the top and try to get four in a row, horizontally, vertically or diagonally - and preferably before your opponent.

I know that this game has already been "solved" and there exist a perfect solution for it. But nevertheless it's still interesting to make an algorithm that calculates the computers move.

In order to do this I've used a variant of the MiniMax algorithm, optimized with Alpha/Beta pruning. The basic idea is to search through a tree of possible actions and thereby thinking ahead to find the best possible action in order to go from an initial state to a goal state (win).
A state would typically consist of current game-board, and who's turn it is. From a state there's often 7 possible actions - each of the columns that it's possible to choose. If a column is full the number of possible actions decrease.
So, in theory, every time it's the computers time to move we would like to build a tree, starting at the initial (current) state and then branching out with the possible actions. Each of the possible actions will lead to a new node, that indicates the opponents move. The opponent again have a number of possible actions, and we assume that he will always pick the action that is best for him (= worst for us, the computer). This means that we can build a double-recursive method to traverse this tree, taking turns choosing the action with the maximum and minimum outcomes for us and thereby in the end evaluate which of the current actions available to us is the best (= where it's most likely we will win, even if the opponent does his best to screw it up).

If you haven't played around with algorithms like this before it might sound kinda complicated but it's really just applying the same method as most of us do in our heads when playing a game like this: "So, let me see.....if I select this column, then he will most likely select that column which means that I'm certain to win in the following move". However, where as humans often have difficulty thinking more than a couple of moves ahead, computers often have a better chance.

But there's one catch: Even with a game as simple as Connect 4, thinking several moves ahead scales terrible. Except for when the columns starts to fill up, there's 7 possible actions. For each of these actions the opponent can again choose 7 actions bringing us up to 7^2 (=49) states we need to consider. I've found that a typical game of Connect 4 often goes to at least 30 moves before a winner is found, meaning that we would have to examine something like 7^30 (=22539340290692258087863249) states. Of course Alpha-Beta pruning can help a lot on that, but in the end it'll still be too much to calculate in reasonable time.
Thats why it's a bit of a modified Minimax with AB pruning algorithm I use.
Instead of searching all states till it reaches a terminal state it will search until it reaches a certain cut-off depth. At that time it will apply some heuristic evaluation to the cut-off states to determine how close it is to winning in those states.
By experimentation I've so far found that to think 6-7 moves ahead makes the most sense performance wise. Perhaps even 8 will be a possibility if I optimize some more and allow the computer more time to think for each move.
The trick to achieving success even after "only" thinking 6 moves ahead is to have a good evaluation function that gives an accurate impression of how well the computer is doing compared to the player. I seem to have found one that works pretty well (so far only one of my friends have reported to have beat the computer once). The details of the evaluation function, along with code for the rest of my project you'll have to wait for until the project is officially handed in (there's going to be a computer vs computer contest in my class - not unlike the poker robot tournament I'm working on).

As for now, you can try out my game here! You always get to start first, you are red computer is blue. (And yes, I realize there are a couple of bugs still - no need to report them I'm working to fix them).