Introduction
At LBi we build web sites in a wide range of platforms and technologies. For many .Net sites which have a content management element we find EPiServer is a great and cost effective framework.
Increasingly web sites are becoming more dependent on search technology. Where search once was limited to locating content on the site, it now forms an integral component to aggregating site content and providing navigational constructs.
Facetted search is an important tool allowing users to explore information navigating multiple axes independently. While LBi use a number of high end enterprise facetted information retrieval engines, they are typically very expensive and fit poorly in the price point which makes EPiServer attractive.
In the open source arena the products Apache Lucene and Apache Solr create the interesting opportunity of low end enhanced search functionality more economically than previously possible. Many CMS products now integrate this technology as the search engine of choice. Although there already exists a Lucene integration for EPiServer, it is the rich functionality including its facetted functions which are more interesting.
With this in mind LBi have explored and developed a reusable deployment of Solr tailored specifically to an EPiServer installation.
Goals
EPiSolr is the name given to the deployment package of Solr and associated .Net integration components that have been developed by LBi. Integrating commercial search technologies can be very expensive and often hard to develop, manage and deploy.
Amongst the key objectives for EPiSolr are:
- Simple deployment of Solr on a .Net or Unix platform
- Seamless integration into an EPiServer installation, even after the event
- Aim to minimise configuration and development requirements
- Provide robust full text and facetted search functionality
Some of the features of Solr which make its attractive as an economic search platform include:
- Zoned full text search
- Hit highlighting
- Faceted search & Analysis
- Caching
- Replication
- Pluggable Architecture
- Real-time Updates
- Presented via XML/HTTP and JSON APIs
EPiSolr Platform
The platform is comprised of a number of components.
Figure 1: Logical architecture of an EPiSolr site
Apache Solr
The core Solr package exists as a Java runtime which can run on Unix or Windows. Sadly, the Windows deployment of Solr is very basic and lacks the necessary components to properly deploy it on the Windows platform. Thankfully, the necessary components do exist in other packages and they have been aggregated to derive a runtime which can be deployed appropriately in a production environment under Unix or Windows.
Deployment merely requires the copying of the latest directory structure and execution of a service installation script. The core deployment has already been configured for standard EPiServer page data, security and meta-data constructs, spell checking, highlighting and auto suggest. On developing a new site all that is required is to add the specific EPiServer content type property definitions that need to be searched or have facets built against.
SolrNet
Communication with Solr is normally via query string and XML over HTTP.
SolrNet is an open source .Net API layer for interacting with Solr which provides an object and interface model to programme against, rather than composing query string and XML requests and interpreting XML responses.
SolrNet is a DLL used by EPiSolr and is deployed as a DLL with the target site package.
SolrTools
Although extensions to Solr support the indexing if data files, reliability and support for file types is limited.
LBi’s SolrTools component provides data file to text stream support using Microsoft’s IFilter interface. Through this, EPiSolr can index file content and include it in the Solr index, either independently or attached to an EPiServer page.
Again, SolrTools are deployed as a DLL as part of the site.
EPiSolr
EPiSolr is the glue which joins EPiServer to Solr.
EPiSolr is a pluggable architecture which allows customisation and extension of content type and property indexing behaviour.
The default deployment has extensive customisation options, however the default content type and property handlers intelligently index most cases and only require configuration to change or enhance their behaviour. When behaviour cannot be supported by the default handlers, extensions can be deployed for individual properties or content types as a whole.
All primitive EPiServer constructs are indexed including Category and Access Control List (ACLs).
EPiSolr is responsible to hooking the EPiServer events and populating and managing the Solr search index.
Index management is asynchronous with indexing operations running independently of editorial or publishing activity.
EPiSolr has its own configuration handler and section in the web.config.
Table 1: Example EPiSolr Configuration
Deployment is achieved by including the DLL in the site binaries and including the appropriate web.config sections.
EPiSolr hooks into the EPiServer event model by installing it as an HttpHandler which provides a convenient mechanism to control and instantiate the entry point for service registration.
EPiSolrAdminPlugin
The EPiSolrAdminPlugin provides the administrative interface to EPiSolr. It provides tools to selectively re-index content and execute diagnostic queries.
The component and interfaces are implemented as EPiServer plugins and all compiled into a single DLL which uses a VirtualPathProvider to deliver admin page templates and PagePlugIn bootstrap to initialise the VirtualPathProvider. As such, deployment merely requires the DLL to be included with the deployment and no extra configuration.
In Action
Executing searches and extracting facets is straight forward through SolrNet. Specific abstractions for individual facets can be created to make implementation typed. The majority of work to incorporate search and facetted function is in creating the user experience.
Controls to commoditise user interface constructs for facetted navigation are being developed to further speed development and reduce costs.
Below is a screenshot of a site which uses EPiSolr almost entirely in order to manage, segment and navigate its complex subscription repository.
In addition to being able to provide multiple axes of breakdown of the document collection, it is also able to determine the differences between the different subscription result sets indicating documents the visitor is entitled to as well as identifying those requiring additional subscription.
Figure 2: EPiSolr in action
Conclusions
Solr is no Endeca or FAST, and it is very important to differentiate high end facetted navigation and taxonomic analysis. However, we have been very pleasantly surprised what can be achieved using Solr and EPiServer. With a pragmatic view, Solr and EPiSolr provides a valuable cost effective entry point to vastly improving the value of data and ease of use in a data centric EPiServer deployment. Especially in these challenging times, from a cost benefit perspective, in many but the most demanding scenarios, it’s hard to see how Solr could be ignored as an important option for search and facetted navigation.
Deployment of Solr and EPiSolr is even easier than that of EPiServer itself. With every saving there is a cost, and the only concern is that as Solr is open source, when it comes to runtime issues, you may have to diagnose and fix the problem yourself. But at the end of the day, there is a huge community investing and evolving Solr, so hopefully this will be a minor an irrelevant point.
With EPiSolr and Solr, LBi look forward to being able to offer its clients a whole new range of richer content managed search solutions while keeping delivery and ongoing costs highly competitive.