Note: phiên bản Tiếng Việt của bài này ở link dưới.

https://duongnt.com/query-boosting-elasticsearch-vie

Query boosting in Elasticsearch

Elasticsearch is a popular full-text search engine. It supports a rich set of queries that can be combined in a logical fashion. This allows us to write fine-grained queries to search for the exact documents we need.

However, not all subqueries inside a nested query are created equal; and sometimes we want to put more or less emphasis on one or more subqueries. In those situations, we can use query boosting to control how each subquery contributes to the final relevant score of documents.

Set up a test environment

Run Elasticsearch locally

It is recommended to run Elasticsearch using Docker. Please follow this guide to set up an Elasticsearch cluster.

https://www.elastic.co/guide/en/elasticsearch/reference/current/run-elasticsearch-locally.html

This guide also walks you through installing Kibana so that we can have an easy way to interact with Elasticsearch. This article will assume that you use Kibana’s Dev Tools to send requests to Elasticsearch.

Creating some test data

We will create an index called footballer with ten documents. Each of them is a player with their name, age, position, and salary in thousands of euros.

PUT footballer/_bulk
{ "create": { } }
{ "name": "Ronaldo","position":"fw", "age": 38, "salary": 4430}
{ "create": { } }
{ "name": "Messi","position":"fw", "age": 36, "salary": 1440}
{ "create": { } }
{ "name": "Sancho","position":"lw", "age": 23, "salary": 373}
{ "create": { } }
{ "name": "Antony","position":"lw", "age": 23, "salary": 200}
{ "create": { } }
{ "name": "Salah","position":"rw", "age": 30, "salary": 350}
{ "create": { } }
{ "name": "Vinicius Junior","position":"lw", "age": 22, "salary": 354}
{ "create": { } }
{ "name": "Mahrez","position":"rw", "age": 32, "salary": 160}
{ "create": { } }
{ "name": "Rashford","position":"fw", "age": 25, "salary": 247}
{ "create": { } }
{ "name": "Bukayo Saka","position":"rw", "age": 21, "salary": 70}
{ "create": { } }
{ "name": "Gnabry","position":"rw", "age": 27, "salary": 365}

We can verify that all data has been inserted with a MatchAll query. This should return all ten documents.

GET /footballer/_search
{
  "query": {
    "match_all": {}
  }
}

Finding a suitable winger

The first query

Imagine that you are a manager and you want to find a winger. You don’t care if they are a left winger or a right winger, but you want someone 23 years old or younger because your club is rebuilding. This is a query you can use.

GET /footballer/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "bool": {
            "should": [
              { "term": { "position": "rw" }},
              { "term": { "position": "lw" }}
            ]
          }
        },
        {
          "range": {
            "age": { "lte": 23 }
          }
        }
      ]
    }
  }
}

And here is the result, there are four players matching our conditions: Sancho, Antony, Vinicius Junior, and Bukayo Saka.

"hits": [
  {
    "_score": 2.1451323,
    "_source": {
      "name": "Sancho",
      "position": "lw",
      "age": 23,
    }
  },
  {
    "_score": 2.1451323,
    "_source": {
      "name": "Antony",
      "position": "lw",
      "age": 23,
    }
  },
  {
    "_score": 2.1451323,
    "_source": {
      "name": "Vinicius Junior",
      "position": "lw",
      "age": 22,
    }
  },
  {
    "_score": 1.8938179,
    "_source": {
      "name": "Bukayo Saka",
      "position": "rw",
      "age": 21,
    }
  }
]

Notice that the score of Bukayo Saka is lower than the rest. This is because our index has four RW but only three LW. Since the term rw is more common, Elasticsearch assigns a lower inverse document frequency (IDF) score to it. This leads to documents matching lw having a higher score.

Fine-tuning our query

Now let’s say you actually prefer right wingers, but you want to keep an open door for left wingers as well. In this case, you can use query boosting to boost the independent score of the subquery { "term": { "position": "rw" }}. Let’s apply a boost value of 2 here.

GET /footballer/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "bool": {
            "should": [
              { "term": { "position": { "value": "rw", "boost": 2 }}},
              { "term": { "position": "lw" }}
            ]
          }
        },
        {
          "range": {
            "age": { "lte": 23 }
          }
        }
      ]
    }
  }
}

We still received the same four players, but now Bukayo Saka jumps to the top of our list. Because he is a right winger, his score was boosted from 1.8938179 to 2.7876358.

"hits": [
  {
    "_score": 2.7876358,
    "_source": {
      "name": "Bukayo Saka",
      "position": "rw",
    }
  },
  {
    "_score": 2.1451323,
    "_source": {
      "name": "Sancho",
      "position": "lw",
    }
  },
  {
    "_score": 2.1451323,
    "_source": {
      "name": "Antony",
      "position": "lw",
     }
  },
  {
    "_score": 2.1451323,
    "_source": {
      "name": "Vinicius Junior",
      "position": "lw",
    }
  }
]

Boosting multiple subqueries at once

We are not limited to boosting just one subquery. Let’s change the conditions again. Now we slightly prefer right wingers. And we want either someone younger than 24 years old or someone making less than 200, with a preference for cheap players. Below is the new query.

GET /footballer/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "bool": {
            "should": [
              { "term": { "position": { "value": "rw", "boost": 2 }}},
              { "term": { "position": "lw" }}
            ]
          }
        },
        {
          "bool": {
            "should": [
              { "range": { "age": { "lte": 23 }}},
              { "range": { "salary": {"lte": 200, "boost": 2 }}}
            ]
          }
        }
      ]
    }
  }
}

We saw some interesting changes, in the third place we have Mahrez. Although he is not young, he is still prioritized over players like Sancho or Vinicius Junior because he is a right winger and is making less than 200; both are boosted.

"hits": [
  {
    "_score": 4.787636,
    "_source": {
      "name": "Bukayo Saka",
      "position": "rw",
      "age": 21,
      "salary": 70
    }
  },
  {
    "_score": 4.145132,
    "_source": {
      "name": "Antony",
      "position": "lw",
      "age": 23,
      "salary": 200
    }
  },
  {
    "_score": 3.7876358,
    "_source": {
      "name": "Mahrez",
      "position": "rw",
      "age": 32,
      "salary": 160
    }
  },
  {
    "_score": 2.1451323,
    "_source": {
      "name": "Sancho",
      "position": "lw",
      "age": 23,
      "salary": 373
    }
  },
  {
    "_score": 2.1451323,
    "_source": {
      "name": "Vinicius Junior",
      "position": "lw",
      "age": 22,
      "salary": 354
    }
  }
]

One subquery can end up dominating the rest

When using query boosting, we must be careful to prevent one subquery from dominating everything else. In the next version, we will search for players that satisfy any one of the following conditions.

  • Is a forward.
  • Is a right winger.
  • Is a left winger.
  • Is 23 years old or younger.
  • Has a salary not greater than 200.

And we will give a big boost to being a forward.

GET /footballer/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "bool": {
            "should": [
              { "term": { "position": { "value": "fw", "boost": 10 }}},
              { "term": { "position": { "value": "rw", "boost": 2 }}},
              { "term": { "position": "lw" }},
              { "range": { "age": { "lte": 23 }}},
              { "range": { "salary": {"lte": 200, "boost": 2 }}}
            ]
          }
        }
      ]
    }
  }
}

The query above will return all documents in our index, and here are the first three.

"hits": [
  {
    "_score": 11.451323,
    "_source": {
      "name": "Ronaldo",
      "position": "fw",
      "age": 38,
      "salary": 4430
    }
  },
  {
    "_score": 11.451323,
    "_source": {
      "name": "Messi",
      "position": "fw",
      "age": 36,
      "salary": 1440
    }
  },
  {
    "_score": 11.451323,
    "_source": {
      "name": "Rashford",
      "position": "fw",
      "age": 25,
      "salary": 247
    }
  },
  //... omitted
]

As expected, the first three results are the three forwards: Ronaldo, Messi, and Rashford. Even though they only match one condition (position), they are still ranked higher than players that match three conditions (position, age, salary) like Bukayo Saka or Antony.

Some other notable points

A boost factor can be smaller than one

The only constraint on the boost factor is that it must be non-negative. If we boost a query with a factor smaller than one then documents matching that query would still be selected, but they would have a low ranking. For example, if we change the subquery in the previous section to { "term": { "position": { "value": "fw", "boost": 0.5 }}} then Ronaldo, Messi and Rashford will be ranked last.

A boost factor can even be zero. In that case, documents matching the boosted query are still selected, but that query won’t contribute anything to the final relevant score. For example, changing the subquery above to { "term": { "position": { "value": "fw", "boost": 0 }}} means Ronaldo, Messi and Rashford will be selected with a relevant score of 0.

Why don’t we use the boosting query type?

Some might have noticed that we apply the boost directly to individual subqueries instead of using the boosting query type supported by Elasticsearch. Personally, I prefer applying the boost directly for the following reasons:

  • Boosting query requires both a positive and a negative query. I feel this is an unnecessary constraint, especially in simpler cases where we only need to increase the relevant score of a single subquery.
  • Boosting query only allows for decreasing the relevant score of the negative query. It does not support increasing the relevant score of the positive query.
  • Applying the boost directly aligns with the usual query construction approach in Elasticsearch, making it easier to integrate into existing code or queries.

How Elasticsearch applies the boost?

A common misconception is that the final score of all documents matching a boosted query will be multiplied by the boost factor. But we already know this is not the case. In an earlier example, the score of Bukayo Saka was only increased from 1.8938179 to 2.7876358 even though we applied a boost factor of 2. The actual scoring process is below:

  • Each subquery in a nested query independently scores each document.
  • If a subquery is boosted, only its independent score is multiplied by the boost factor.
  • Then all those scores are combined using a coordination factor to determine the final relevance score for each document.

Conclusion

Query boosting is an interesting technique in Elasticsearch. It lets you control the impact of each query on the final score of matching documents as well as fine-tune the relevance of search results. However, when applying query boosting, we need to understand how it interacts with the overall scoring algorithm.

A software developer from Vietnam and is currently living in Japan.

One Thought on “Query boosting in Elasticsearch”

Leave a Reply