目录
介绍
键入时搜索
匹配词组查询
析取最大查询
结论
介绍最近,我调查了我们系统的自动完成功能,因为有很多抱怨它返回不相关的结果。我们采用的方法非常幼稚:我们的后端将查询包装成通配符,并在字段__title、title和commonInfo.RealName上作为query_string执行。索引我们已经执行了搜索包含的实体与_title等于3 foxes,但自动完成查询3 foxes建议BRN-3 / QCK 3 / 19 foxes,AC d / 3 foxes,3 BRW / 1 foxes。根本找不到确切的匹配!
因此,我选择3 foxes作为我的相关性基线,并转而关注有助于自动完成功能的特定Elasticsearch查询。
键入时搜索顾名思义,键入时搜索似乎非常适合自动完成功能。首先,我已将我的__title字段映射更改为search_as_you_type并直接从文档中执行bool_prefix查询。
{
"_source": [
"__title"
],
"from": 0,
"size": 3,
"query": {
"multi_match": {
"query": "3 foxe",
"type": "bool_prefix",
"fields": [
"__title",
"__title.2gram",
"__title.3gram"
]
}
}
}
现在这更好了!
{
"took": 21,
"timed_out": false,
"_shards": {
"total": 33,
"successful": 33,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 272,
"relation": "eq"
},
"max_score": 4.5528774,
"hits": [
{
"_index": "data",
"_type": "_doc",
"_id": "cdf7f3aded8745d1827e9c92dea1e8b7",
"_score": 4.5528774,
"_source": {
"__title": "3 oxe/ 3 foxes"
}
},
{
"_index": "data",
"_type": "_doc",
"_id": "1a42873cead94f18a31d0b102b4fbdcd",
"_score": 4.285463,
"_source": {
"__title": "3 foxes"
}
},
{
"_index": "data",
"_type": "_doc",
"_id": "6f1588bbe1a440028af1de4337bf8fac",
"_score": 3.9906564,
"_source": {
"__title": "9 mfx/ 3 oxe/ 3 foxes"
}
}
]
}
}
但是,这还不够好,因为精确匹配只排在第二位。由于开箱即用的解决方案没有帮助,我决定继续阅读有关search_as_you_type映射和n-gram字段的信息。在阅读了一段时间后,我了解到n-gram基本上是从文本中提取的单词序列,以随机顺序混合,这允许在我的自动完成查询中乱序搜索单词。这样做的缺点是Elasticsearch集群会消耗额外的内存来存储n-gram,这可能会影响集群状态。花哨的search_as_you_type映射只是意味着n-gram字段是自动创建的。
由于输入乱序词不是我的用例,我决定不搞乱它并改进我的相关性查询时间而不是索引时间。
匹配词组查询为了提高精确匹配相关性,我切换到匹配词组前缀查询。
{
"_source": [
"__title"
],
"from": 0,
"size": 3,
"query": {
"match_phrase_prefix": {
"__title": {
"query": "3 foxe"
}
}
}
}
现在这就是我要找的!
{
"took": 10,
"timed_out": false,
"_shards": {
"total": 33,
"successful": 33,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 28,
"relation": "eq"
},
"max_score": 12.053555,
"hits": [
{
"_index": "data",
"_type": "_doc",
"_id": "1a42873cead94f18a31d0b102b4fbdcd",
"_score": 12.053555,
"_source": {
"__title": "3 foxes"
}
},
//omited for brevity
]
}
}
我们可以到此结束了吗?没那么快!您可能还记得,我们的自动完成功能使用3个字段,但我们只检查了其中一个。那么我们如何组合多个字段呢?由于match_phrase_prefix不支持多个字段,第一个猜测是普通的旧bool查询。
{
"_source":[
"__title",
"title",
"commonInfo.RealNameShort"
],
"explain":false,
"from":0,
"size":3,
"query":{
"bool":{
"should":[
{
"match_phrase_prefix":{
"__title":{
"query":"3 foxe"
}
}
},
{
"match_phrase_prefix":{
"title":{
"query":"3 foxe"
}
}
},
{
"match_phrase_prefix":{
"commonInfo.RealNameShort":{
"query":"3 foxe"
}
}
}
]
}
}
}
结果是:
{
"took": 13,
"timed_out": false,
"_shards": {
"total": 33,
"successful": 33,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 28,
"relation": "eq"
},
"max_score": 28.880083,
"hits": [
{
"_index": "data",
"_type": "_doc",
"_id": "15e4e503cc1d4284aeb34664cb61c5ae",
"_score": 28.880083,
"_source": {
"__title": "apt 3 foxes",
"commonInfo": {
"RealNameShort": "apt 3 foxes"
},
"title": "apartmetnt 3 foxes"
}
},
{
"_index": "data",
"_type": "_doc",
"_id": "83b2653a851c4ca19d3df0410ab1c41f",
"_score": 26.242756,
"_source": {
"__title": "rest/ 3 foxes",
"commonInfo": {
"RealNameShort": "rest/ 3 foxes"
},
"title": "restaraunt/ 3 foxes"
}
},
{
"_index": "data",
"_type": "_doc",
"_id": "1a42873cead94f18a31d0b102b4fbdcd",
"_score": 23.940828,
"_source": {
"__title": "3 foxes",
"title": "3 completely irrelevant to real name words",
"commonInfo": {
"RealNameShort": "3 foxes"
}
}
}
]
}
}
嗯?发生了什么?让我们运行相同的explain":true查询来理解。由于输出量很大,我将只关注重要的部分。在最上面的文档中,我们会注意到:
"value": 10.268458,
"description": "weight(__title:\"3 (foxe foxes)\" in 1156) [PerFieldSimilarity], result of:",
...
"value": 8.497357,
"description": "weight(title:\"3 foxes\" in 1156) [PerFieldSimilarity], result of:",
...
"value": 10.114267,
"description": "weight(commonInfo.RealNameShort:\"3 (foxes foxe)\" in 1156)
[PerFieldSimilarity], result of:",
这是我们期望最重要的文件:
"value": 12.053555,
"description": "weight(__title:\"3 (foxe foxes)\" in 1180) [PerFieldSimilarity], result of:",
...
"value": 11.887274,
"description": "weight(commonInfo.RealNameShort:\"3 (foxes foxe)\" in 1180)
[PerFieldSimilarity], result of:",
...
"value": 0.0,
"description": "match on required clause, product of:",
因此,正如我们所料,包含3 foxes在__title中的文档得分最多的字段是__title。但由于apt 3 foxes在每个感兴趣的领域都包含一些相关的结果,它超过了所需的文档。如果我们能以某种方式通过最相关的匹配来排序文档就好了!
析取最大查询事实上,我们可以针对这种情况尝试Disjunction max查询。让我们尝试一下文档中的示例:
{
"_source":[
"__title",
"title",
"commonInfo.RealNameShort"
],
"explain":false,
"from":0,
"size":3,
"query": {
"dis_max": {
"queries": [
{
"match_phrase_prefix":{
"__title":{
"query":"3 foxe"
}
}
},
{
"match_phrase_prefix":{
"title":{
"query":"3 foxe"
}
}
},
{
"match_phrase_prefix":{
"commonInfo.RealNameShort":{
"query":"3 foxe"
}
}
}
],
"tie_breaker": 0.7
}
}
}
还是不好,但至少分数更接近了。
{
"took": 13,
"timed_out": false,
"_shards": {
"total": 33,
"successful": 33,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 28,
"relation": "eq"
},
"max_score": 23.296595,
"hits": [
{
"_index": "data",
"_type": "_doc",
"_id": "15e4e503cc1d4284aeb34664cb61c5ae",
"_score": 23.296595,
"_source": {
"__title": "apt 3 foxes",
"commonInfo": {
"RealNameShort": "apt 3 foxes"
},
"title": "apartmetnt 3 foxes"
}
},
{
"_index": "data",
"_type": "_doc",
"_id": "83b2653a851c4ca19d3df0410ab1c41f",
"_score": 21.053097,
"_source": {
"__title": "rest/ 3 foxes",
"commonInfo": {
"RealNameShort": "rest/ 3 foxes"
},
"title": "restaraunt/ 3 foxes"
}
},
{
"_index": "data",
"_type": "_doc",
"_id": "1a42873cead94f18a31d0b102b4fbdcd",
"_score": 20.374645,
"_source": {
"__title": "3 foxes",
"title": "3 completely irrelevant to real name words",
"commonInfo": {
"RealNameShort": "3 foxes"
}
}
}
]
}
}
tie_breaker参数的作用似乎并不明显。让我们对其进行调整以找出答案。首先,我们将其设置为1。
{
"took": 13,
"timed_out": false,
"_shards": {
"total": 33,
"successful": 33,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 28,
"relation": "eq"
},
"max_score": 28.880083,
"hits": [
{
"_index": "data",
"_type": "_doc",
"_id": "15e4e503cc1d4284aeb34664cb61c5ae",
"_score": 28.880083,
"_source": {
"__title": "apt 3 foxes",
"commonInfo": {
"RealNameShort": "apt 3 foxes"
},
"title": "apartmetnt 3 foxes"
}
},
{
"_index": "data",
"_type": "_doc",
"_id": "83b2653a851c4ca19d3df0410ab1c41f",
"_score": 26.242756,
"_source": {
"__title": "rest/ 3 foxes",
"commonInfo": {
"RealNameShort": "rest/ 3 foxes"
},
"title": "restaraunt/ 3 foxes"
}
},
{
"_index": "data",
"_type": "_doc",
"_id": "1a42873cead94f18a31d0b102b4fbdcd",
"_score": 23.940828,
"_source": {
"__title": "3 foxes",
"title": "3 completely irrelevant to real name words",
"commonInfo": {
"RealNameShort": "3 foxes"
}
}
}
]
}
}
因此,当我们看到增加时,它会将我们引向错误的方向。让我们完全删除它。
{
"took": 15,
"timed_out": false,
"_shards": {
"total": 33,
"successful": 33,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 28,
"relation": "eq"
},
"max_score": 12.053555,
"hits": [
{
"_index": "data",
"_type": "_doc",
"_id": "1a42873cead94f18a31d0b102b4fbdcd",
"_score": 12.053555,
"_source": {
"__title": "3 foxes",
"title": "3 completely irrelevant to real name words",
"commonInfo": {
"RealNameShort": "3 foxes"
}
}
},
//omitted for brevity
]
}
}
成功!这正是我们想要的!
结论当使用Elasticsearch实现自动完成功能时,不要直接跳到原始的query_string方法。首先探索丰富的Elasticsearch查询语言。在索引时利用search_as_you_type映射可能不是灵丹妙药,它的主要目的是通过为您创建n-gram字段来对抗带有乱序词的搜索查询。因此,如果您希望获得更宽松的结果,那么只使用bool_prefix查询类型或match_phrase_prefix查询类型就足够了。
在多个字段上组合自动完成时,您可以使用dis_max查询类型。在这种情况下,增加tie_breaker参数会增加所有字段对结果分数的影响程度。
最后,一旦怀疑为什么查询结果不符合您的期望,您可能会求助于explain":true查询参数。
https://www.codeproject.com/Articles/5323717/Improving-Elasticsearch-based-Autocomplete