Friday, July 13, 2007

Automatic Language Detection

A classical task when dealing with textual information is to automatically identify which language a text is written in (no, geeks - it's not a question of VB or C# - I mean human languages!).
Here's my attempt at a very simple, yet useful approach: character-bigram statistics.
I've basically made some extensive statistics on several languages on the frequency of all bigrams, and using that it's now possible to determine which language a given text resembles the most.
Try out my Language Detector here!

The text-corpus I used was another classic, the proceedings of the European Parlament through several years (can be found here).

My first step was to construct a class to contain bigram statistics for some text (LangStat).
In the class I also included code to determine the euclidean distance between two sets of bigram statistics (useful when trying to determine which language a text is most similar to). I implemented it as an operator overload for "-", so you can always determine the distance between two bigram-statistics by simply subtracting them from each other.

//Calculates euclidean distance between two LangStat's
public static double operator -(LangStat a,LangStat c)
//Operator overload
double tot = 0;
foreach (Bigram b in a.Bigrams.Keys)
if (c.Bigrams.ContainsKey(b))
//Bigram exist in remote
double me = (double)a.Bigrams[b] / a.Count;
double them = (double)c.Bigrams[b] / c.Count;
tot += Math.Pow(Math.Abs(me - them), 2);
return Math.Sqrt(tot);

Then I build a Console trainer application, that is able to load the corpus text files for a given language, clean up any unwanted tags in them and then adds the text to a bigram statistic.

When it's done, it use the System.CodeDom to generate source-code for a class that inherits the LangStat, but which is specific to the current language. That way I'll have my languages precompiled and ready to be compared to custom textual content.
This might not be the most efficient approach, but it sure was funny to play around with CodeDom (an interesting namespace that I get to use far to seldom).

static void Main(string[] args)
string lang = "sv";
string langname = "Swedish";
string[] files = Directory.GetFiles((...language folder...));

//Build language statistics from file-corpus
LangStat l=new LangStat();
foreach(string f in files){
Console.WriteLine("Examining file: "+f);
StreamReader sr=new StreamReader(f);
string s=sr.ReadToEnd();
//File loaded
s=Regex.Replace(s,"<[^>]*>"," ",RegexOptions.Multiline);
//Tags removed

//Generate Code
System.CodeDom.CodeNamespace ns =
new System.CodeDom.CodeNamespace("Allan.Language.Detection");
CodeTypeDeclaration tp = new CodeTypeDeclaration(langname);
tp.IsClass = true;
CodeConstructor cc = new CodeConstructor();
cc.Attributes = MemberAttributes.Public;
new CodePrimitiveExpression(l.Bigrams.Count));
foreach (Bigram b in l.Bigrams.Keys)
//Could be done much nicer, but I'm in a hurry
new CodeSnippetExpression(
"_bigrams.Add(new Bigram('"+b.A+"','"+b.B+"'),"+
new CodeAssignStatement(
new CodeVariableReferenceExpression("_count"),
new CodePrimitiveExpression(l.Count)));
System.CodeDom.Compiler.ICodeGenerator gen =
new CSharpCodeProvider().CreateGenerator();
StreamWriter sw=File.CreateText(langname+".cs");
gen.GenerateCodeFromNamespace(ns, sw,
new System.CodeDom.Compiler.CodeGeneratorOptions());


Finally I just had to build a simple windows testing app, that will compare the text written to the languages. Download the solution here.


Anonymous said...

Under what license can i use your source code?

Allan Thræn said...