As you might have noticed these last couple of days has been kind of slow tech-blog wise.
This is of course due to the birth of our wonderful boy, Maximilian, who was born monday morning.
I expect it'll be a couple of weeks before I again have time to write tech-blog entries :-)
Check my personal blog for personal updates.
.NET coding, Search, Information Access, CMS Systems, AJAX, Information Retrieval, Content Management
Wednesday, July 25, 2007
Tuesday, July 17, 2007
NXT: My first 'bot
As I mentioned earlier I recently got a cool Lego MindStorms NXT to play with.
Now, with the help of Jesper, we managed to build the basic humanoid robot - and it's awesome!
I've coded it to do a few things:
My initial idea was to build in some simple learning mechanism (perhaps reinforcement learning) but I havn't gotten around to that yet. So far I'm just blinded by all the possibilities :-)
As handsome as Mr. Bot is now, you better take a good look, cause now he's being taken apart. Jesper suggest a really funny use for the NXT - something that every household needs: So now I'm going to build a machine that can automatically color-sort M&M's - and perhaps send out maydays if Olga comes too close to the candy!
Now, with the help of Jesper, we managed to build the basic humanoid robot - and it's awesome!
I've coded it to do a few things:
- Whenever the light is on, Mr. Bot will say "Goodmorning" and start walking.
- When the light is turned off he'll say "Goodnight" and stop walking
- When he's 12 cm from an obstacle he'll say "Please" (...move) and stop.
- When he hears a noise (like somebody clapping their hands) he'll look around to see whats going on
- If he's hand-button is pushed he'll greet you with the words "Have a nice day".
My initial idea was to build in some simple learning mechanism (perhaps reinforcement learning) but I havn't gotten around to that yet. So far I'm just blinded by all the possibilities :-)
As handsome as Mr. Bot is now, you better take a good look, cause now he's being taken apart. Jesper suggest a really funny use for the NXT - something that every household needs: So now I'm going to build a machine that can automatically color-sort M&M's - and perhaps send out maydays if Olga comes too close to the candy!
WCF: Sharing Types between Server and Client
A discussion I've run into time and time again through the last few months when I've been working with WCF is whether to use the generated proxy classes client-side or think of something else (like inherting the proxy-classes, creating your own proxies, or somehow try to make the proxy classes identical to the source classes).
I guess the discussions arise as a result of people not being sure if they should consider WCF like Web Services which has a loose coupling (generated client proxies) or like Remoting which often has a tighter coupling between the interacting participants (shared dll).
Until now I've mainly been a fan of the loose coupling because of three things:
I guess the discussions arise as a result of people not being sure if they should consider WCF like Web Services which has a loose coupling (generated client proxies) or like Remoting which often has a tighter coupling between the interacting participants (shared dll).
Until now I've mainly been a fan of the loose coupling because of three things:
- It's the easiest and fastest just click the right buttons in your VS and you're set! (yes, I can be quite lazy at times but remember that lazy developers often are the best)
- There's no dependency of a specific version dll between the client and the server. If the Server gets an update that breaks the service-convention the client just has to regenerate it's proxy and you're set.
- I havn't seen a clean and pretty alternate solution before. Mostly it's been messy.
Friday, July 13, 2007
Automatic Language Detection
A classical task when dealing with textual information is to automatically identify which language a text is written in (no, geeks - it's not a question of VB or C# - I mean human languages!).
Here's my attempt at a very simple, yet useful approach: character-bigram statistics.
I've basically made some extensive statistics on several languages on the frequency of all bigrams, and using that it's now possible to determine which language a given text resembles the most.
Try out my Language Detector here!
The text-corpus I used was another classic, the proceedings of the European Parlament through several years (can be found here).
My first step was to construct a class to contain bigram statistics for some text (LangStat).
In the class I also included code to determine the euclidean distance between two sets of bigram statistics (useful when trying to determine which language a text is most similar to). I implemented it as an operator overload for "-", so you can always determine the distance between two bigram-statistics by simply subtracting them from each other.
Then I build a Console trainer application, that is able to load the corpus text files for a given language, clean up any unwanted tags in them and then adds the text to a bigram statistic.
When it's done, it use the System.CodeDom to generate source-code for a class that inherits the LangStat, but which is specific to the current language. That way I'll have my languages precompiled and ready to be compared to custom textual content.
This might not be the most efficient approach, but it sure was funny to play around with CodeDom (an interesting namespace that I get to use far to seldom).
Finally I just had to build a simple windows testing app, that will compare the text written to the languages. Download the solution here.
Here's my attempt at a very simple, yet useful approach: character-bigram statistics.
I've basically made some extensive statistics on several languages on the frequency of all bigrams, and using that it's now possible to determine which language a given text resembles the most.
Try out my Language Detector here!
The text-corpus I used was another classic, the proceedings of the European Parlament through several years (can be found here).
My first step was to construct a class to contain bigram statistics for some text (LangStat).
In the class I also included code to determine the euclidean distance between two sets of bigram statistics (useful when trying to determine which language a text is most similar to). I implemented it as an operator overload for "-", so you can always determine the distance between two bigram-statistics by simply subtracting them from each other.
//Calculates euclidean distance between two LangStat's
public static double operator -(LangStat a,LangStat c)
{
//Operator overload
double tot = 0;
foreach (Bigram b in a.Bigrams.Keys)
{
if (c.Bigrams.ContainsKey(b))
{
//Bigram exist in remote
double me = (double)a.Bigrams[b] / a.Count;
double them = (double)c.Bigrams[b] / c.Count;
tot += Math.Pow(Math.Abs(me - them), 2);
}
}
return Math.Sqrt(tot);
}
Then I build a Console trainer application, that is able to load the corpus text files for a given language, clean up any unwanted tags in them and then adds the text to a bigram statistic.
When it's done, it use the System.CodeDom to generate source-code for a class that inherits the LangStat, but which is specific to the current language. That way I'll have my languages precompiled and ready to be compared to custom textual content.
This might not be the most efficient approach, but it sure was funny to play around with CodeDom (an interesting namespace that I get to use far to seldom).
static void Main(string[] args)
{
string lang = "sv";
string langname = "Swedish";
string[] files = Directory.GetFiles((...language folder...));
//Build language statistics from file-corpus
LangStat l=new LangStat();
foreach(string f in files){
Console.WriteLine("Examining file: "+f);
StreamReader sr=new StreamReader(f);
string s=sr.ReadToEnd();
sr.Close();
//File loaded
s=Regex.Replace(s,"<[^>]*>"," ",RegexOptions.Multiline);
//Tags removed
l.AddText(s);
}
//Generate Code
System.CodeDom.CodeNamespace ns =
new System.CodeDom.CodeNamespace("Allan.Language.Detection");
CodeTypeDeclaration tp = new CodeTypeDeclaration(langname);
tp.BaseTypes.Add(typeof(LangStat));
tp.IsClass = true;
ns.Types.Add(tp);
CodeConstructor cc = new CodeConstructor();
cc.Attributes = MemberAttributes.Public;
tp.Members.Add(cc);
cc.BaseConstructorArgs.Add(
new CodePrimitiveExpression(l.Bigrams.Count));
foreach (Bigram b in l.Bigrams.Keys)
{
//Could be done much nicer, but I'm in a hurry
cc.Statements.Add(
new CodeSnippetExpression(
"_bigrams.Add(new Bigram('"+b.A+"','"+b.B+"'),"+
l.Bigrams[b].ToString()+")"));
}
cc.Statements.Add(
new CodeAssignStatement(
new CodeVariableReferenceExpression("_count"),
new CodePrimitiveExpression(l.Count)));
System.CodeDom.Compiler.ICodeGenerator gen =
new CSharpCodeProvider().CreateGenerator();
StreamWriter sw=File.CreateText(langname+".cs");
gen.GenerateCodeFromNamespace(ns, sw,
new System.CodeDom.Compiler.CodeGeneratorOptions());
sw.Close();
}
Finally I just had to build a simple windows testing app, that will compare the text written to the languages. Download the solution here.
Wednesday, July 11, 2007
Majestic
A major problem for most global search engines is the simple fact that the net grows so rapidly that no matter how many serverfarms they build, pages are being created or updated faster than the search engines can detect and index them.
I recently came across Majestic that has a really interesting approach to this problem: Distributed crawlers. They've made a simple crawler-client that can help distribute the indexing among all the volunteers who provide spare bandwidth and computertime to this noble task in much the same way as some people donate time and bandwidth to the SETI@HOME project or my personal favourite, the search for the next Mersenne prime.
However, the idea with distributing the search seems really useful. Now, if only they had done something novel to the search-end instead of just copying Google I would have been thrilled. But I like the idea anyway. Check it out at http://www.majestic12.co.uk
Oh yeah, while you're there, check out the C# source for their HTML Parser. It's awesome. Fast and furious!
I recently came across Majestic that has a really interesting approach to this problem: Distributed crawlers. They've made a simple crawler-client that can help distribute the indexing among all the volunteers who provide spare bandwidth and computertime to this noble task in much the same way as some people donate time and bandwidth to the SETI@HOME project or my personal favourite, the search for the next Mersenne prime.
However, the idea with distributing the search seems really useful. Now, if only they had done something novel to the search-end instead of just copying Google I would have been thrilled. But I like the idea anyway. Check it out at http://www.majestic12.co.uk
Oh yeah, while you're there, check out the C# source for their HTML Parser. It's awesome. Fast and furious!
Tuesday, July 10, 2007
Code Challenge: Michael the Math Maniac
Time for another summer code-challenge. Hopefully this one is a bit easier than the last one :-)
Mr. Michael was a lucky man, cause today, 20070710 (ISO standard) was his birthday!
But Michael wasn't your average lucky birthday boy. He was a Math Maniac. And on this special day, he was wondering: How many of the numbers between 0 and 1.000.000.000 contains the ciphers "20070710" (in that order) somewhere within the number?
Design a method int CountNumbers(int min, int max, int SequenceToFind); that returns the count of numbers which contains the SequenceToFind.
1st prize goes to first valid entry, 2nd prize to best performing entry.
The prizes are still "honour & mocking rights".
May the best developer win.
Mr. Michael was a lucky man, cause today, 20070710 (ISO standard) was his birthday!
But Michael wasn't your average lucky birthday boy. He was a Math Maniac. And on this special day, he was wondering: How many of the numbers between 0 and 1.000.000.000 contains the ciphers "20070710" (in that order) somewhere within the number?
Design a method int CountNumbers(int min, int max, int SequenceToFind); that returns the count of numbers which contains the SequenceToFind.
1st prize goes to first valid entry, 2nd prize to best performing entry.
The prizes are still "honour & mocking rights".
May the best developer win.
Friday, July 6, 2007
Code Challenge Results: No luck for the Hash-Party
So far there hasn't been a lot of entries to the latest Code Challenge so I suppose I might have overestimated the abilities of you, my honorable readers.
In fact, the only entry I received was from Peter Thygesen and he admits to actually just having adopted an algorithm by Paul Hsieh.
However just for the fun I compared it to the build-in string hashing algorithm (.GetHashCode()).
The comparison I did was fairly simple: I took 1.000.000 fairly random unique strings (well - actually Guids as strings) and timed how long time it cumulative took to run the algorithms. I also checked how many duplicate hash-codes each algorithm resulted in.
It turns out they were pretty equal.
The build-in algorithm had 114 duplicate hash-codes and took 15275 ms. while Mr. Thygesens entry had 115 duplicates and took 15318 ms.
Thanks for playing, Peter - but I think we have to declare this a no-win :-)
In fact, the only entry I received was from Peter Thygesen and he admits to actually just having adopted an algorithm by Paul Hsieh.
However just for the fun I compared it to the build-in string hashing algorithm (.GetHashCode()).
The comparison I did was fairly simple: I took 1.000.000 fairly random unique strings (well - actually Guids as strings) and timed how long time it cumulative took to run the algorithms. I also checked how many duplicate hash-codes each algorithm resulted in.
It turns out they were pretty equal.
The build-in algorithm had 114 duplicate hash-codes and took 15275 ms. while Mr. Thygesens entry had 115 duplicates and took 15318 ms.
Thanks for playing, Peter - but I think we have to declare this a no-win :-)
EPiServer 5 CMS - First impressions
A couple of weeks ago I wanted to check out how the new EPiServer 5 looked, so I downloaded a free trial version of the RC2.
It comes in two flavors. There's the traditional installer that installs the Manager which allows you to setup new EPiServer websites with a default look & feel, but on top of that there's also a new Visual Studio integration available that I instantly knew I just had to try out.
The install itself was very (!) easy and without any problems or hickups I had a lot of new features in my visual studio.
For instance I now had the possibility of creating a new EPiServer Project which I instantly did.
This template created blank episerver website, db, etc. for me ready to use.
It's really clear to see that with this new release the clever guys at EPiServer has been focussing a lot on improving the quality of life for all the developers out there who use it as an every day tool to make websites.
At the same EPiServer is now even tighter coupled with the newest Microsoft technologies, basing their CMS on standard ASP.NET 2.0 things like Master pages and ASP.NET User/Role configuration. They've also done a tremendous job of integrationg Workflow Foundation into the core functionality - and to this date this seems like one of the best usages of WWF I've seen so far.
Seen from a developer perspective the new SDK makes me think of EPiServer as a huge toolbox that gives me a lot of tools to efficiently create cool websites and webfunctionality in a standard ASP.NET way, while taking care of a lot of the tedious details. But from an Editor / Administrator perspective you still get the well-known intuitive webbased interface for administrering and editing the website. Cool.
The editor and administrator interface hasn't changed all that much since last version and the entrypoint is still the "famous" right-click menu for logged-in editors. It seems to me like the Editor interface hasn't gotten all that much work done except for a paint-job and perhaps some improved versioning/comparison features (however I could be mistaking, having never been a real-life editor :-) ). Thats okay, though. Rome wasn't build in a day and I certainly prefer the improved SDK and architectural changes.
Yes, I am the kind of guy who cares more about whats under the hood of my car, than the color, shape and sexiness of it's exterior. However it still wouldn't hurt to give a bit of attention to improve the (already good) usability for editors and administrators in a future version. Perhaps AJAX is a good approach here.
While I'm at it, here's another few things for my wishlist for future versions: WCF support for easier data / functionality access and a couple of nice fully-featured demo-sites / templates for the SDK. It could be nice to a couple of ready-to-go samples as VS Templates.
All-in-all I'm very impressed with the RC2 version of EPiServer 5 and I can't wait to play around with it some more. Don't be surprised if a couple of modules start appearing on this blog for free download in the near future. EPiServer continues to be a powerful workhorse in the CMS world, not as flashy and shiny as some competitors but intuitive, strong and flexible.
The install itself was very (!) easy and without any problems or hickups I had a lot of new features in my visual studio.
For instance I now had the possibility of creating a new EPiServer Project which I instantly did.
This template created blank episerver website, db, etc. for me ready to use.
It's really clear to see that with this new release the clever guys at EPiServer has been focussing a lot on improving the quality of life for all the developers out there who use it as an every day tool to make websites.
At the same EPiServer is now even tighter coupled with the newest Microsoft technologies, basing their CMS on standard ASP.NET 2.0 things like Master pages and ASP.NET User/Role configuration. They've also done a tremendous job of integrationg Workflow Foundation into the core functionality - and to this date this seems like one of the best usages of WWF I've seen so far.
Seen from a developer perspective the new SDK makes me think of EPiServer as a huge toolbox that gives me a lot of tools to efficiently create cool websites and webfunctionality in a standard ASP.NET way, while taking care of a lot of the tedious details. But from an Editor / Administrator perspective you still get the well-known intuitive webbased interface for administrering and editing the website. Cool.
The editor and administrator interface hasn't changed all that much since last version and the entrypoint is still the "famous" right-click menu for logged-in editors. It seems to me like the Editor interface hasn't gotten all that much work done except for a paint-job and perhaps some improved versioning/comparison features (however I could be mistaking, having never been a real-life editor :-) ). Thats okay, though. Rome wasn't build in a day and I certainly prefer the improved SDK and architectural changes.
Yes, I am the kind of guy who cares more about whats under the hood of my car, than the color, shape and sexiness of it's exterior. However it still wouldn't hurt to give a bit of attention to improve the (already good) usability for editors and administrators in a future version. Perhaps AJAX is a good approach here.
While I'm at it, here's another few things for my wishlist for future versions: WCF support for easier data / functionality access and a couple of nice fully-featured demo-sites / templates for the SDK. It could be nice to a couple of ready-to-go samples as VS Templates.
All-in-all I'm very impressed with the RC2 version of EPiServer 5 and I can't wait to play around with it some more. Don't be surprised if a couple of modules start appearing on this blog for free download in the near future. EPiServer continues to be a powerful workhorse in the CMS world, not as flashy and shiny as some competitors but intuitive, strong and flexible.
Zattoo is awesome!
Yesterday I came across Zattoo which is a really cool p2p live-tv service. A bit the same concept as Joost, but with zattoo it's not on demand. Instead you get high-quality streaming of live channels..And quite a lot already! It's easy to get started and it work surprisingly well.
I'll definetly remember that I have it installed next time my wife wants to watch "America's next top model" when I wanna watch the news!
Speaking of news and online TV, I've already become a regular viewer of DR Update (sorry, danish only news). Good quality and nice to see news-videos produced specificly to the web. Way to go DR!
I'll definetly remember that I have it installed next time my wife wants to watch "America's next top model" when I wanna watch the news!
Speaking of news and online TV, I've already become a regular viewer of DR Update (sorry, danish only news). Good quality and nice to see news-videos produced specificly to the web. Way to go DR!
Monday, July 2, 2007
Code Challenge: Fun with Hash
No, this is not what you expected, crackhead. This post doesn't include getting high on anything stronger than your coding skills. It's time for another code challenge!
The challenge
Sometimes it can be very handy to make a small fingerprint of a piece of textual information so you can easily compare it to other pieces of text and check if they are identical without doing a full textual comparison.
The friendly folks at Microsoft have even been kind enough to include a "ToHashCode()" method in the .NET framework, but in this challenge I kindly ask you to ignore that.
The challenge is to code your own method that returns an integer hashcode for any string, so that two identical strings will have the same hashcode and that probability of two different strings sharing the same fingerprint is as small as possible.
So, write a method with the signature: static int MakeHash(string s); in C# and post it as a comment here.
Any posts that use ToHashCode(), MD5, or any other build-in hashing mechanism is disqualified along with posts that is almost identical to prior entries.
Post before friday and I'll make comparisons between the submissions on two different parameters:
The challenge
Sometimes it can be very handy to make a small fingerprint of a piece of textual information so you can easily compare it to other pieces of text and check if they are identical without doing a full textual comparison.
The friendly folks at Microsoft have even been kind enough to include a "ToHashCode()" method in the .NET framework, but in this challenge I kindly ask you to ignore that.
The challenge is to code your own method that returns an integer hashcode for any string, so that two identical strings will have the same hashcode and that probability of two different strings sharing the same fingerprint is as small as possible.
So, write a method with the signature: static int MakeHash(string s); in C# and post it as a comment here.
Any posts that use ToHashCode(), MD5, or any other build-in hashing mechanism is disqualified along with posts that is almost identical to prior entries.
Post before friday and I'll make comparisons between the submissions on two different parameters:
- Performance
- Duplicate Hashcodes for non-identical strings
Good luck, Gentlemen!
Sunday, July 1, 2007
Happy birthday to me!
Today is my birthday. I love birthdays..lots of cake and many presents. The last couple of years (okay...ever since I lost my childhood innocense) the presents have gotten more and more "boring" (= practical and nice but not really play-toys).
Being the eternal kid that I am, I was naturally extremly pleased this year when my wonderful wife (!!!) gave me Mindstorms NXT.
I can't wait to start playing around with it and code C# applications to it.
A quick googling showed that this could be a good place to start!
Being the eternal kid that I am, I was naturally extremly pleased this year when my wonderful wife (!!!) gave me Mindstorms NXT.
I can't wait to start playing around with it and code C# applications to it.
A quick googling showed that this could be a good place to start!
Subscribe to:
Posts (Atom)