Search Engine Relevance

Relevance is such a broad topic, I will group some bullet points into 3 buckets: query understanding and query rewrite, query evaluation and ranking, and relevance feedback.

Query understanding and query rewrite:

we want to understand the intention of the query, so that we can find most related results to the query
query should be tokenized to match tokenization algorithm in index, also be processed for plural, x-ing, synonyms, acronyms etc
annotate the query with location, time etc
add query demand, such as category demand histogram, token weight
attach boolean constraints such as category, price range
identify facet and prepare for facet search, eg: Color=White
in-session and personalized search relevance parameters, eg: interests, budget buyer

Query evaluation and ranking:

evaluate/execute the query tree which usually has AND/OR/NOT nodes with CONTAINS, GREATER_THAN, LESS_THAN, PHRASE, EXISTS etc operators
collect the results in a heap to rank them
depending how complex the search engine is, there can be multiple ranking rounds
the first round will be cheap and quick sorting that is applied to all results, sort parameters will be some static relevance indicators, eg static quality score, click through rate, product review score, seller credentials
the second round can be a simple machine learned formula with a few parameters, it will be applied to the top N results after the first round ranking
the third round ranking will be extensive and can be a bit expensive, such as prob decision forest, this will be applied to top few pages of results
the last round of ranking will involve some business logics, such as diversification, or, category or format mixing (based on rules or stats or both), so we can handle problems such as: too many similar items, or, too many newly listed item, or, too much same seller, too much/few top categories etc
we can separate machine learning data into different sectors, such as geo location, user segment, query segment, category etc: so we need a lot data
the above strategy works well for popular and often queries, but for tail queries, the sample size is too small for machine learning, an old fashion relevance model will work better, such as Vector Space Model, or Prob Model, or Language Model; a regular tf-idf formula plus some boolean constraints (cat, price range) weights can work; or query term bi-grams of a long query to “vote” for the top categories.
pagination can become two problems and needs special consideration: 1. how to save for multi-round ranking; 2. large page number can blow up internal memory usage!

Relevance feedback

user behavior data (minus identification information!) should be collected at runtime, and data-mined later, such as page views, click through rates, conversion rates, etc.
explicit user feedback “is this page helpful?” can provide different view points.
human judge is still essential to evaluate relevance of existing and new algorithms.
time-decay (recency) should be applied to many numbers.
feedback data is so huge, injecting all into search index can drain a lot resources, so data update speed could vary:
- view and transaction should be realtime;
- decayed counts can be 1 day;
- user trustworthiness, forward estimation of shipping and handling time can be 24 hours, or, on demand.
- aggregated query demand and category demand for different regions/use-cases can be 24 hours, or on-demand, such as “iPhone X” product launch.
- offline computed document clusters, related query recommendations etc can be 24 hours.

Latest Images

Trending Articles

Latest Images