Faceted search: choosing good facet suggestions

This post was originally made on the Assanka blog. Assanka was acquired by the Financial Times in January 2012, and became what is now FT Labs. Learn more.

Faceted search is everywhere, making our online shopping experience easier, organising our photos, searching our DVD collection. By showing filters in categories, you can allow users to search the way they want to, rather than in a prescribed category hierarchy of your choice. We’ve recently used faceted search in a number of applications and found that one area is a particular challenge: choosing the suggested values under each category.

FT Tilt's faceting uses a blend of strategies
The site Assanka recently launched for FT Tilt uses a variety of faceting strategies to give the most appropriate suggestions in each category

Most implementations of faceting always show the same suggestions in any given category, but may change the filters available based on the search the user has done so far. For example, if you’re searching a computer supplies retailer for a new hard disk, most of the products matching your search will have a ‘capacity’ property, so the retailer will probably offer you a capacity filter with all available capacities listed. It probably won’t offer you a ‘pages per minute’ filter, because resolution is not relevant to hard disks, and your existing search keywords have probably eliminated any product (like a monitor) for which a resolution choice would make sense.

So it’s pretty easy to choose which filters to display. The more difficult problem is choosing which facets to display within each filter.

Sometimes it’s impractical to show all possible facets in a filter. eBay, for example, allows you to restrict your search by seller, and of course there are millions of those. So there are a number of possible strategies:

  1. Display every option available
  2. Display just a few hard coded options, such as aggregate options designed to always match something (eg eBay’s ‘Top rated sellers’ option), or editorially chosen ‘top picks’. Either way, the point is that the suggestions are a subset of the full range available, and are not sensitive to the context created by the user’s search query
  3. Use the list of options as per (1) or (2), but hide any that, if selected, would produce no results. This refinement makes the suggestions context-sensitive in a very rudimentary way, in that they are at least reacting to the current state of the user’s search, but still do not surface any long-tail options when they become more relevant.
  4. Generate options based on running the query the user has put together so far, including any facets selected, and analysing the resultset for the top facet refinements in each filter category.
  5. As per (4), but for each filter category, run the search excluding any options already selected in that category.
  6. As per (4) or (5), but where the values are all numeric, determine the facets not from the frequency of occurrence of specific values, but by analysing the distribution of values within the resultset and constructing boundaries that provide a sensible number of divisions with approximately the same number of results in each division.

There are arguments for and against all of these, depending on the distribution of the metadata within your document index.

An e-commerce site will typically have a set of properties on a product, where each property has one value, and won’t use all the available property names. The values available for each property will also typically be a range set, and will efficiently cover the whole available range. For example a hard disk will have a ‘capacity’ property, where the options might include 200GB, 300GB, 500GB and 1TB. Only one of these can apply to any given hard disk. A hard disk, as previously discussed, will also not make use of other available properties, such as resolution, that might apply to other types of product, such as monitors. A property like ‘pages per minute’ is really only going to be used on a very small subset of your product catalogue (only printers). ‘Capacity’ might get a bit more limelight as it applies to USB sticks, RAM and storage appliances as well as hard disks (though consider that the option values required in these types of products might be in a different range), and some properties like ‘manufacturer’ would apply to virtually the entire product catalogue.

Sometimes, a product might have more than one applicable value in the same category. Take an example category “Special offers available”. A single product might qualify for “Buy one get one free” as well as “Free delivery”. This kind of thing actually happens more in the non-retail world, and a better example would be a film library, where a film may have more than one actor, more than one screenwriter, more than one content advisory. Where this is the case, filtering on one actor does not necessarily mean you’ve excluded all the others from the resultset.

The distribution is also relevant. In the ‘capacity’ example, we could expand the number of values to include more granularity below 50GB to allow for solid state devices, and above 1TB to allow for storage appliances, but in any given search, results will tend to form an unequal distribution with a peak, or multiple peaks, in particular capacities. Across a range of hard disks, at time of writing 100-500GB would likely be the most popular value. On the other hand take a category like ‘Actor’ in the case of a film library, and the distribution looks a lot flatter. An actor can only do so many films, and there are a lot of actors, so there isn’t a strong head to this distribution – it’s all tail.

Looking at each filter category in turn, the logic for deciding how to choose suggested values therefore comes down to a number of questions about how the category is used to classify your content:

  1. Are there few enough values that you could display them all together?
  2. Do the values form a continuous range (like capacity) or are they discrete options (like actor)?
  3. Is it possible for any single item of content to have more than one value from the same category?
  4. Is there a ‘head’ of a few values (few enough to display all of them together) which, combined, apply to a majority of your content?

It’s important to stress that these are not decisions to take for your site as a whole – they need to be applied to each filter category individually.

There is one final consideration. Is your faceting feature intended to help users narrow their search, change it, or broaden it?

Going back to the possible strategies for determining facets, displaying every option available works for small categories, and using hard coded options groups like ‘top 100 sellers’ is basically a solution for displaying every option available by consolidating many options into one. Doing this where necessary (and then also hiding any options that would result in no matches) gives you about the best solution you are likely to get without going context sensitive.

Amazon BBFC ratings

The DVD search on Amazon.co.uk displays all possible options in the BBFC rating category, and hides any that don’t apply to the results. In this case, the results found by the search only contain films rated PG, 15 and 12.

It starts to get interesting when you have a lot of possible values, in a flattish distribution, and want to present specific, context sensitive options. The sense of ‘context’ depends on whether you want to help broaden or narrow the search.

Take a search for the location “UK” and animal “Dog”, which will give you results referring to dogs in the UK. One of the facet categories might be location, in which there is one option that is already a term in the search – UK. Determining the facets to suggest in the location category by only looking at the results returned in the existing search would only yield locations that co-exist on items tagged with UK and dogs, so your filter refinements would be places like “Manchester”, “Birmingham”, “Liverpool”, “London”, “Cardiff”, “Edinburgh”. This helps the searcher to refine their search.

Lovefilm

The film search on Lovefilm presents facets that narrow your search. It also arranges the facets in a hierarchy, though this is an illusion

However, using strategy 5, you run the search once on “UK Dogs” to produce the results to show the user, but you run it again on “Dogs”, excluding the term from the locations category, to generate location facets. This time you get locations globally that co-exist with Dogs, and the suggestions for facets in any given category do not change if you choose to add a facet from that category to your search. In this case the suggestions would be more likely to be “Paris”, “New York”, “France”, “Boston”, “London”, “United States”. This helps to broaden the search by showing the user good suggestions for other locations that also have lots of dogs that they may not have considered.

Finally, where the values are numeric, it may be appropriate to produce dynamic facets as range boundaries, based on an analysis of the values that exist within the resultset. You still follow one or other of the above strategies to get a resultset that either narrows or broadens the search, but then analyse the list of values and construct divisions that evenly partition the data into a small number of ranges. Doing this with a search for say the product type “Hard disks” and the capacity “100-150GB”, the suggested capacity facets would subdivide the selected range, with narrow boundaries covering the most popular capacities, and wider ranges where there are fewer results.

Ebuyer

Ebuyer’s product search contains lots of faceting of numeric categories, some of which are presented in dynamic ranges, and some are treated as standard terms

For numeric categories like this, it’s also worth considering the sort order of the facet list. For categories of non-numeric values and sometimes even for discrete numeric values (say ‘film speed’ or ‘quantity per box’), it’s generally best to present them in decreasing order of popularity within the search context you’ve chosen. For numeric categories where you’re constructing range facets (eg. ‘capacity’, ‘pixel density’, ‘brightness’), they should instead be presented in value order.

Conclusion

There are many ways of faceting search results. Take some time to choose the one that suits your application best, and provides the best search experience for your users. We use the open source search engine Xapian, which is excellent at doing all the faceting described in this post, and I’d like to publicly acknowledge Richard Boulton for his excellent advice when we were designing faceting strategy for FT Tilt.