<img height="1" width="1" style="display:none" src="https://www.facebook.com/tr?id=1639164799743833&amp;ev=PageView&amp;noscript=1">
Diagram Views

Apache's Lucene Search At A Glance

Brad McDavid
#Hosting Insights, #Open Source
Published on October 24, 2017
warren-wong-323107-unsplash-1

Search is a critical component of a website that can go beyond a simple search box with results. Today, we look at Apache's free and open sourced Lucene search.

In today’s web landscape, search is a critical component and goes beyond presenting a simple place where text is entered and results get displayed. Many sections of a website may use search without the visitor even knowing. Common examples include news archive listings, which may go back decades, or finding content that is related to the current page being viewed. There are many different search engines to provide this functionality for a web site; but in this article, we are looking at the free and open source Lucene.

Lucene is a widely used search tool, built in Java, allowing it to run on many different platforms. It has even been ported to run natively in the .NET stack. At its most basic level, Lucene is a collection of documents called an index. The documents in the index contain a list of name-value pairs called, fields. Field values may be stored in the index for retrieval or sorting, and may also be analyzed (which is useful for free-form content searching). Documents can then be searched using a Domain Specific Language (DSL) querying language to evaluate for matching results.

A common pitfall for Lucene field values is no underlying data type exists. All field values are just treated as strings, which often can be problematic for .NET DateTime fields. If the default string conversion is used for this type, it returns a value of 10/23/2017 1:35:40 PM. Worse yet, this conversion is dependent on the culture information used, so the same conversion done on a web server set in the United Kingdom (en-GB culture) returns: 23/10/2017 13:35:40. Now the very bad part is that neither of these values lend themselves to being good for searching or sorting. A better approach is to use the format yyyyMMddHHmmss, which returns 20171023133540 and does not vary when changing countries. This format is also usable for sorting date and time when stored as a string value. By itself, Lucene doesn't do conversions, the responsibility falls upon the software using Lucene to create the indexable documents.  

Over the years, I’ve experienced some poor implementations of Lucene. These implementations typically share the characteristics of being difficult to configure/troubleshoot or use default string conversions for types which often results in missing or outdated searchable documents that cannot be searched as the developer would expect. They usually fall short on functionality, such as indexing uploaded media files, and PDF and Microsoft Word documents. Also, implementations often reside on the same server as the website, where resources must be shared and may impact the overall performance and speed.

In recent years, I’ve encountered much better implementations of search engines that are built on-top of the Lucene search engine and offer functionality to compensate for the previously mentioned short-comings. Elasticsearch and Apache SOLR are the most prevalent search engines used in building today’s websites. This article isn’t a comparison between them since there are countless existing ones out there already. Both search engines will allow websites to have separated indexes from the website server. This frees up resources, like separating the persistent data store (SQL).

At Diagram, we typically use Elasticsearch more often because ease of configuration, REST-based API, and available NuGet packages. Elasticsearch also excels when the content models, typically plain old class objects (POCOs), are able to be serialized to JSON documents. This serialization does a much better job at converting values for the underlying Lucene index fields. Add-ons also exist Elasticsearch to allow extended functionality like Mapper Attachmentswhich help consume raw binary data for files like PDFs and Microsoft Office documents making them indexable fields for a document.

Apache Lucene is a great search tool because of its free and open souce functionality. However, when Lucene isn't properly implemented, it can lead to slower website performance and slower search speeds. Choosing a Lucene derivative such as Elasticsearch or Apache SOLR frees up resources and allows for separate indexes from your website server. Have you dabbled in Lucene? I'd enjoy hearing about your experiences with it. Leave a comment below!