Friday, January 26, 2007

On-the-fly conversion of MSOffice to raw text

So, this is what I needed the HTTPModule for: a HTTPModule that on the fly converts office (and other weird formats) to text/html. I can imagine several scenarios where this can come in useful (but I'll keep all the great ideas to myself for now).
The POC is something like this: a user requests a non-HTML document (office, pdf, etc) located on a website with "convert=true" in the querystring (http://myserver/test.doc?convert=true). Then the document is automatically converted to text and returned.

In order to do this we'll use ifilters, and to easily access them from C# I found this neat little library by Eyal Post on codeproject.

The code for my module is here:

using System;
using System.Web;
using EPocalipse.IFilter;
using System.IO;

namespace Allan.Tools
public class IFilterModule : IHttpModule

public void Init(System.Web.HttpApplication application)
application.AuthorizeRequest +=
new EventHandler(this.Application_AuthorizeRequest);

private void Application_AuthorizeRequest(object sender, System.EventArgs e)
HttpApplication app = ((HttpApplication)(sender));
HttpContext context = app.Context;

if (context.Request.QueryString["convert"] != null)
TextReader tr = new FilterReader(context.Request.PhysicalPath);
string s = tr.ReadToEnd();

public void Dispose()

In order to build it, download Eyal's library and reference it in the project. Then after building the module include it in your web.config like this:

<add type="Allan.Tools.IFilterModule,IFilterHTTPModuleTest" name="IFilterModule"/>


1 comment:

Alexey Rusakov said...

Cool idea! Too bad you cannot display images from the word files