Friday, January 26, 2007

On-the-fly conversion of MSOffice to raw text

So, this is what I needed the HTTPModule for: a HTTPModule that on the fly converts office (and other weird formats) to text/html. I can imagine several scenarios where this can come in useful (but I'll keep all the great ideas to myself for now).
The POC is something like this: a user requests a non-HTML document (office, pdf, etc) located on a website with "convert=true" in the querystring (http://myserver/test.doc?convert=true). Then the document is automatically converted to text and returned.

In order to do this we'll use ifilters, and to easily access them from C# I found this neat little library by Eyal Post on codeproject.

The code for my module is here:


using System;
using System.Web;
using EPocalipse.IFilter;
using System.IO;

namespace Allan.Tools
{
public class IFilterModule : IHttpModule
{

public void Init(System.Web.HttpApplication application)
{
application.AuthorizeRequest +=
new EventHandler(this.Application_AuthorizeRequest);
}


private void Application_AuthorizeRequest(object sender, System.EventArgs e)
{
HttpApplication app = ((HttpApplication)(sender));
HttpContext context = app.Context;

if (context.Request.QueryString["convert"] != null)
{
TextReader tr = new FilterReader(context.Request.PhysicalPath);
string s = tr.ReadToEnd();
tr.Close();
context.Response.Write(s);
context.Response.End();
}
}


public void Dispose()
{
}
}
}


In order to build it, download Eyal's library and reference it in the project. Then after building the module include it in your web.config like this:



<httpModules>
<add type="Allan.Tools.IFilterModule,IFilterHTTPModuleTest" name="IFilterModule"/>
</httpModules>


Enjoy!

1 comment:

Alexey Rusakov said...

Cool idea! Too bad you cannot display images from the word files