Friday, March 16, 2007

MondoSearch Result Authentication

A very typical request I often hear from customers and partners is the ability to return only the results that the current user is allowed to see. This desire is very natural, but can often present quite a challenge to 3rd party search engines like MondoSearch. The problem is that it varies a lot from each individual setup how authorization works, and hence no general solution can be made. We can only deliver specific solutions of authorization to specific systems (like we have done for EPiServer or Sitecore) or provide general toolkits/examples that makes it easier to custom-build an integration.
The problem with authenticated problems can really be divided into two sub problems:

  • Indexing secure content
  • Searching in secure content
Indexing isn't that big of a problem. There's many ways to make that content available to the search engine. MondoSearch has built-in support for basic-authorization, challenge-response (integrated authorization) and forms log in, just as well as it's quite easy in many CMS systems to override the security if the client originates from a specific IP, or has a specific HTTP Request setting. Generally we see only very few problems in actually indexing the content. The only thing that can be tricky is when the content on the individual pages vary based on who is logged in. In order to handle that, would require the Search Engine to index the same URL, as all the different users that can access it. Luckily pages with user-dependent contents are typically portal pages that are not all that interesting to index. The articles, documents and database content that's interesting to index are not a problem.

Searching in Secure Content is really the main challenge when it comes to authenticated contents. Even though security for the individual pages typically is checked when you try to access a page, it can still be quite revealing when the title (and perhaps description) of a page is displayed on the result page. In fact, to be totally safe, a user who doesn't have access to certain documents must not even know of their existence from the result page! (Suppose I searched for "invasion plan Iran" on Pentagon's website and was told that there were 10.000 documents I that matched the phrase, but none I was allowed to see).
In order to achieve this there's generally three approaches:
  • Authentication by filtering. Store access rights when indexing the documents and use them in the search
  • Authentication by exclusion. When performing a search, manually check that the current user has permission to see each of the results, before returning it.
  • Rules based authentication. Where a number of specific filters is defined for each user-group.
In general I prefer to use Filtering to perform search result authentication.
With MondoSearch this typically means adding Meta-tags (/data) to all documents defining which groups / users are allowed to view them. And perhaps even which groups/users have specifically denied access.
A Meta-tag like that could look something like this:
<meta name="ALLOW" contents=";53;124;351;33;12341"/>
Then, on the result page, all you'll need is a piece of code that extracts the user-id and the group-ids of the current user and then adding search filters to the search query. Suppose we have a user with user-id "42" and who belongs to the group "users" (id: 351) who performs a search that returns a document with the above meta-tag. The MQL that is sent to the search engine would then have to have these filters added:

"... FILTERS ALLOW CONTAINS ';42;' OR ALLOW CONTAINS ';351;' ...."

To also enforce DENY is a bit more tricky, but certainly just as doable.
The obvious benefits here are: It's very (!) fast, it's clean, it's easy
However there's also a number of downsides:
  • Not all CMS systems support outputting permission-lists to the crawler
  • If access-rules change, they will not be propagated to the index until next crawl
  • It typically doesn't work for non-html documents like Office and PDF (since it's kinda hard dynamically to attach meta-data to these types). However there is a number of workaround to this problem.
The alternative to filtering, is exclusion which in my eyes is definitely not pretty, but sometimes necessary. Authentication by exclusion calls for a custom method is defined that checks if the current user has access to a given URL. A pointer (delegate) of this method is then passed to the search engine that will call it and evaluate every result in the result-set.
The obvious problem is the performance of this solution. On a result-set of 10 pages, with a fast-checking method, it can be acceptable, but often result-sets can be very large. Imagine having to call a custom-made method for every one of 100.000 results - or worse!!
Another problem is that in order to pass a delegate to the search engine the search-engine needs to be installed on the same server as the CMS - something that doesn't always fit into the desired machine architecture.
Of course the performance can be increased of such a method in some cases: intelligent caching, only check the results on the first page, etc. but in my experience it's never a really good solution. In my eyes the only really acceptable use of this is as a compliment to the filtering search (for instance to check access for non-html documents) - or where no other solution works.
In order to set this up on a MondoSearch template, assign a method handler to the "OnAuthorize" event in the SearchControl, like this: OnAuthorize="CheckAuthorization" .
Then define the method elsewhere:


public bool CheckAuthorization(string url){
return true;
}

The last authentication method I will briefly touch in this post is to use a number of rules.
The idea here is that by applying knowledge about the security setup on a website, a couple of simple rules might do the trick.
Imagine a simple setup where only two types of visitors exist on a web-site: logged-in and not-logged-in, and that all the content that only the logged-in users were allowed to see is in the sub-directory "/secure".
In this case you could simply apply some additional MQL when a visitor performs a search:
if(!logged-in){ mql+="FILTERS @CHANNEL!='secure'"; }

This is an ideal approach, but it doesn't work on all sites.

No comments: