改进基于Elasticsearch的自动完成功能

介绍

键入时搜索

匹配词组查询

析取最大查询

结论

介绍

最近，我调查了我们系统的自动完成功能，因为有很多抱怨它返回不相关的结果。我们采用的方法非常幼稚：我们的后端将查询包装成通配符，并在字段__title、title和commonInfo.RealName上作为query_string执行。索引我们已经执行了搜索包含的实体与_title等于3 foxes，但自动完成查询3 foxes建议BRN-3 / QCK 3 / 19 foxes，AC d / 3 foxes，3 BRW / 1 foxes。根本找不到确切的匹配!

因此，我选择3 foxes作为我的相关性基线，并转而关注有助于自动完成功能的特定Elasticsearch查询。

键入时搜索

顾名思义，键入时搜索似乎非常适合自动完成功能。首先，我已将我的__title字段映射更改为search_as_you_type并直接从文档中执行bool_prefix查询。

{
  "_source": [
    "__title"
  ],
  "from": 0,
  "size": 3,
  "query": {
      "multi_match": {
      "query": "3 foxe",
      "type": "bool_prefix",
      "fields": [
        "__title",
        "__title.2gram",
        "__title.3gram"
      ]
    }
  }
}

现在这更好了！

{
    "took": 21,
    "timed_out": false,
    "_shards": {
        "total": 33,
        "successful": 33,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 272,
            "relation": "eq"
        },
        "max_score": 4.5528774,
        "hits": [
            {
                "_index": "data",
                "_type": "_doc",
                "_id": "cdf7f3aded8745d1827e9c92dea1e8b7",
                "_score": 4.5528774,
                "_source": {
                    "__title": "3 oxe/ 3 foxes"
                }
            },
            {
                "_index": "data",
                "_type": "_doc",
                "_id": "1a42873cead94f18a31d0b102b4fbdcd",
                "_score": 4.285463,
                "_source": {
                    "__title": "3 foxes"
                }
            },
            {
                "_index": "data",
                "_type": "_doc",
                "_id": "6f1588bbe1a440028af1de4337bf8fac",
                "_score": 3.9906564,
                "_source": {
                    "__title": "9 mfx/ 3 oxe/ 3 foxes"
                }
            }
        ]
    }
}

但是，这还不够好，因为精确匹配只排在第二位。由于开箱即用的解决方案没有帮助，我决定继续阅读有关search_as_you_type映射和n-gram字段的信息。在阅读了一段时间后，我了解到n-gram基本上是从文本中提取的单词序列，以随机顺序混合，这允许在我的自动完成查询中乱序搜索单词。这样做的缺点是Elasticsearch集群会消耗额外的内存来存储n-gram，这可能会影响集群状态。花哨的search_as_you_type映射只是意味着n-gram字段是自动创建的。

由于输入乱序词不是我的用例，我决定不搞乱它并改进我的相关性查询时间而不是索引时间。

匹配词组查询

为了提高精确匹配相关性，我切换到匹配词组前缀查询。

{
  "_source": [
    "__title"
  ],
  "from": 0,
  "size": 3,
  "query": {
    "match_phrase_prefix": {
      "__title": {
        "query": "3 foxe"
      }
    }
  }
}

现在这就是我要找的！

{
    "took": 10,
    "timed_out": false,
    "_shards": {
        "total": 33,
        "successful": 33,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 28,
            "relation": "eq"
        },
        "max_score": 12.053555,
        "hits": [
            {
                "_index": "data",
                "_type": "_doc",
                "_id": "1a42873cead94f18a31d0b102b4fbdcd",
                "_score": 12.053555,
                "_source": {
                    "__title": "3 foxes"
                }
            },
            //omited for brevity
        ]
    }
}

我们可以到此结束了吗？没那么快！您可能还记得，我们的自动完成功能使用3个字段，但我们只检查了其中一个。那么我们如何组合多个字段呢？由于match_phrase_prefix不支持多个字段，第一个猜测是普通的旧bool查询。

{
   "_source":[
      "__title",
      "title",
      "commonInfo.RealNameShort"
   ],
   "explain":false,
   "from":0,
   "size":3,
   "query":{
      "bool":{
         "should":[
            {
               "match_phrase_prefix":{
                  "__title":{
                     "query":"3 foxe"
                  }
               }
            },
            {
               "match_phrase_prefix":{
                  "title":{
                     "query":"3 foxe"
                  }
               }
            },
            {
               "match_phrase_prefix":{
                  "commonInfo.RealNameShort":{
                     "query":"3 foxe"
                  }
               }
            }
         ]
      }
   }
}

结果是：

{
    "took": 13,
    "timed_out": false,
    "_shards": {
        "total": 33,
        "successful": 33,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 28,
            "relation": "eq"
        },
        "max_score": 28.880083,
        "hits": [
            {
                "_index": "data",
                "_type": "_doc",
                "_id": "15e4e503cc1d4284aeb34664cb61c5ae",
                "_score": 28.880083,
                "_source": {
                    "__title": "apt 3 foxes",
                    "commonInfo": {
                        "RealNameShort": "apt 3 foxes"
                    },
                    "title": "apartmetnt 3 foxes"
                }
            },
            {
                "_index": "data",
                "_type": "_doc",
                "_id": "83b2653a851c4ca19d3df0410ab1c41f",
                "_score": 26.242756,
                "_source": {
                    "__title": "rest/ 3 foxes",
                    "commonInfo": {
                        "RealNameShort": "rest/ 3 foxes"
                    },
                    "title": "restaraunt/ 3 foxes"
                }
            },
            {
                "_index": "data",
                "_type": "_doc",
                "_id": "1a42873cead94f18a31d0b102b4fbdcd",
                "_score": 23.940828,
                "_source": {
                    "__title": "3 foxes",
                    "title": "3 completely irrelevant to real name words",
                    "commonInfo": {
                        "RealNameShort": "3 foxes"
                    }
                }
            }
        ]
    }
}

嗯？发生了什么？让我们运行相同的explain":true查询来理解。由于输出量很大，我将只关注重要的部分。在最上面的文档中，我们会注意到：

"value": 10.268458,
"description": "weight(__title:\"3 (foxe foxes)\" in 1156) [PerFieldSimilarity], result of:",
...
"value": 8.497357,
"description": "weight(title:\"3 foxes\" in 1156) [PerFieldSimilarity], result of:",
...
"value": 10.114267,
"description": "weight(commonInfo.RealNameShort:\"3 (foxes foxe)\" in 1156) 
                [PerFieldSimilarity], result of:",

这是我们期望最重要的文件：

"value": 12.053555,
"description": "weight(__title:\"3 (foxe foxes)\" in 1180) [PerFieldSimilarity], result of:",
...
"value": 11.887274,
"description": "weight(commonInfo.RealNameShort:\"3 (foxes foxe)\" in 1180) 
                [PerFieldSimilarity], result of:",
...
"value": 0.0,
"description": "match on required clause, product of:",

因此，正如我们所料，包含3 foxes在__title中的文档得分最多的字段是__title。但由于apt 3 foxes在每个感兴趣的领域都包含一些相关的结果，它超过了所需的文档。如果我们能以某种方式通过最相关的匹配来排序文档就好了！

析取最大查询

事实上，我们可以针对这种情况尝试Disjunction max查询。让我们尝试一下文档中的示例：

{
  "_source":[
      "__title",
      "title",
      "commonInfo.RealNameShort"
  ],
  "explain":false,
  "from":0,
  "size":3,
  "query": {
    "dis_max": {
      "queries": [
        {
               "match_phrase_prefix":{
                  "__title":{
                     "query":"3 foxe"
                  }
               }
            },
            {
               "match_phrase_prefix":{
                  "title":{
                     "query":"3 foxe"
                  }
               }
            },
            {
               "match_phrase_prefix":{
                  "commonInfo.RealNameShort":{
                     "query":"3 foxe"
                  }
               }
            }
      ],
      "tie_breaker": 0.7
    }
  }
}

还是不好，但至少分数更接近了。

{
    "took": 13,
    "timed_out": false,
    "_shards": {
        "total": 33,
        "successful": 33,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 28,
            "relation": "eq"
        },
        "max_score": 23.296595,
        "hits": [
            {
                "_index": "data",
                "_type": "_doc",
                "_id": "15e4e503cc1d4284aeb34664cb61c5ae",
                "_score": 23.296595,
                "_source": {
                    "__title": "apt 3 foxes",
                    "commonInfo": {
                        "RealNameShort": "apt 3 foxes"
                    },
                    "title": "apartmetnt 3 foxes"
                }
            },
            {
                "_index": "data",
                "_type": "_doc",
                "_id": "83b2653a851c4ca19d3df0410ab1c41f",
                "_score": 21.053097,
                "_source": {
                    "__title": "rest/ 3 foxes",
                    "commonInfo": {
                        "RealNameShort": "rest/ 3 foxes"
                    },
                    "title": "restaraunt/ 3 foxes"
                }
            },
            {
                "_index": "data",
                "_type": "_doc",
                "_id": "1a42873cead94f18a31d0b102b4fbdcd",
                "_score": 20.374645,
                "_source": {
                    "__title": "3 foxes",
                    "title": "3 completely irrelevant to real name words",
                    "commonInfo": {
                        "RealNameShort": "3 foxes"
                    }
                }
            }
        ]
    }
}

tie_breaker参数的作用似乎并不明显。让我们对其进行调整以找出答案。首先，我们将其设置为1。

{
    "took": 13,
    "timed_out": false,
    "_shards": {
        "total": 33,
        "successful": 33,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 28,
            "relation": "eq"
        },
        "max_score": 28.880083,
        "hits": [
            {
                "_index": "data",
                "_type": "_doc",
                "_id": "15e4e503cc1d4284aeb34664cb61c5ae",
                "_score": 28.880083,
                "_source": {
                    "__title": "apt 3 foxes",
                    "commonInfo": {
                        "RealNameShort": "apt 3 foxes"
                    },
                    "title": "apartmetnt 3 foxes"
                }
            },
            {
                "_index": "data",
                "_type": "_doc",
                "_id": "83b2653a851c4ca19d3df0410ab1c41f",
                "_score": 26.242756,
                "_source": {
                    "__title": "rest/ 3 foxes",
                    "commonInfo": {
                        "RealNameShort": "rest/ 3 foxes"
                    },
                    "title": "restaraunt/ 3 foxes"
                }
            },
            {
                "_index": "data",
                "_type": "_doc",
                "_id": "1a42873cead94f18a31d0b102b4fbdcd",
                "_score": 23.940828,
                "_source": {
                    "__title": "3 foxes",
                    "title": "3 completely irrelevant to real name words",
                    "commonInfo": {
                        "RealNameShort": "3 foxes"
                    }
                }
            }
        ]
    }
}

因此，当我们看到增加时，它会将我们引向错误的方向。让我们完全删除它。

{
    "took": 15,
    "timed_out": false,
    "_shards": {
        "total": 33,
        "successful": 33,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 28,
            "relation": "eq"
        },
        "max_score": 12.053555,
        "hits": [
            {
                "_index": "data",
                "_type": "_doc",
                "_id": "1a42873cead94f18a31d0b102b4fbdcd",
                "_score": 12.053555,
                "_source": {
                    "__title": "3 foxes",
                    "title": "3 completely irrelevant to real name words",
                    "commonInfo": {
                        "RealNameShort": "3 foxes"
                    }
                }
            },
            //omitted for brevity
        ]
    }
}

成功！这正是我们想要的！

结论

当使用Elasticsearch实现自动完成功能时，不要直接跳到原始的query_string方法。首先探索丰富的Elasticsearch查询语言。在索引时利用search_as_you_type映射可能不是灵丹妙药，它的主要目的是通过为您创建n-gram字段来对抗带有乱序词的搜索查询。因此，如果您希望获得更宽松的结果，那么只使用bool_prefix查询类型或match_phrase_prefix查询类型就足够了。

在多个字段上组合自动完成时，您可以使用dis_max查询类型。在这种情况下，增加tie_breaker参数会增加所有字段对结果分数的影响程度。

最后，一旦怀疑为什么查询结果不符合您的期望，您可能会求助于explain":true查询参数。

https://www.codeproject.com/Articles/5323717/Improving-Elasticsearch-based-Autocomplete

改进基于Elasticsearch的自动完成功能

最近更新

热门博客

[ 申请 ]友情链接：