24深度探秘搜索技术_实战通过ngram分词机制实现index-time搜索推荐

Dongguo丶发布时间：2021-11-21 20:29:20 ，浏览量：5

1、ngram和index-time搜索推荐原理

edge_ngram和ngram是ElasticSearch自带的两个分词器，一般设置索引映射的时候都会用到，设置完步长之后，就可以直接给解析器analyzer的tokenizer赋值使用。

什么是ngram

在索引时准备数据意味着要选择合适的分析链，这里部分匹配使用的工具是 n-gram 。可以将 n-gram 看成一个在词语上滑动窗口， n 代表这个 “窗口” 的长度。如果我们要 n-gram quick 这个词 —— 它的结果取决于 n 的选择长度：

ngram是从每一个字符开始,按照步长,进行分词,适合前缀中缀检索

比如quick，有5种长度下的ngram

ngram length=1，q u i c k ngram length=2，qu ui ic ck ngram length=3，qui uic ick ngram length=4，quic uick ngram length=5，quick

POST _analyze
{
  "tokenizer": "ngram",
  "text": "quick"
}

响应结果

{
  "tokens": [
    {
      "token": "q",
      "start_offset": 0,
      "end_offset": 1,
      "type": "word",
      "position": 0
    },
    {
      "token": "qu",
      "start_offset": 0,
      "end_offset": 2,
      "type": "word",
      "position": 1
    },
    {
      "token": "u",
      "start_offset": 1,
      "end_offset": 2,
      "type": "word",
      "position": 2
    },
    {
      "token": "ui",
      "start_offset": 1,
      "end_offset": 3,
      "type": "word",
      "position": 3
    },
    {
      "token": "i",
      "start_offset": 2,
      "end_offset": 3,
      "type": "word",
      "position": 4
    },
    {
      "token": "ic",
      "start_offset": 2,
      "end_offset": 4,
      "type": "word",
      "position": 5
    },
    {
      "token": "c",
      "start_offset": 3,
      "end_offset": 4,
      "type": "word",
      "position": 6
    },
    {
      "token": "ck",
      "start_offset": 3,
      "end_offset": 5,
      "type": "word",
      "position": 7
    },
    {
      "token": "k",
      "start_offset": 4,
      "end_offset": 5,
      "type": "word",
      "position": 8
    }
  ]
}

ngram分词器默认会产生最小长度为1，最大长度为2的N-grams序列。上述查询语句的输出是 [ Q, Qu, u, ui, i, ic, c, ck, k, "k ", " “, " F”, F, Fo, o, ox, x ]

也可以自定义ngram tokenizer的一些配置：

min_gram：指定产生的最小长度的字符序列，默认为1
max_gram：指定产生的最大长度的字符序列，默认为2
token_chars: 指定生成的token应该包含哪些字符.对没有包含进的字符进行分割，默认为[],即保留所有字符
- letter - eg: a,b,字
- digit - eg: 3,7
- whitespace - eg: " ", “\n”
- punctuation - eg: !, "
- symbol - eg: $,√

定义min_gram和max_gram应该按照使用场景来定。使用ngram的一个常见场景就是自动补全。如果单个字符也进行自动补全，那么可能匹配的suggestion太多，导致没有太大意义。另一个需要考虑的便是性能，产生大量的ngram占用空间更大，搜索时花费的事件也更多。

什么是edge ngram

quick，anchor基于首字母后进行ngram

q qu qui quic quick

使用edge ngram将每个单词都进行进一步的分词切分，用切分后的ngram来实现前缀搜索推荐功能

edge_ngram是从第一个字符开始,按照步长,进行分词,适合前缀匹配场景,比如:订单号,手机号,邮政编码的检索

POST _analyze
{
  "tokenizer": "edge_ngram",
  "text": "quick"
}

响应结果

{
  "tokens": [
    {
      "token": "q",
      "start_offset": 0,
      "end_offset": 1,
      "type": "word",
      "position": 0
    },
    {
      "token": "qu",
      "start_offset": 0,
      "end_offset": 2,
      "type": "word",
      "position": 1
    }
  ]
}

edge_ngram分词器默认会产生最小长度为1，最大长度为2的N-grams序列 [q,qu]

Edge Ngram也有着和ngram相同的配置

min_gram：指定产生的最小长度的字符序列，默认为1
max_gram：指定产生的最大长度的字符序列，默认为2
token_chars: 指定生成的token应该包含哪些字符.对没有包含进的字符进行分割，默认为[],即保留所有字符
- letter - eg: a,b,字
- digit - eg: 3,7
- whitespace - eg: " ", “\n”
- punctuation - eg: !, "
- symbol - eg: $,√

hello world、hello we被分词为：

h he hel hell hello

w wo wor worl world we

如果搜索hello w ，doc1，doc2匹配hello w，而且position也匹配，所以doc1、doc2返回

搜索的时候，不用再根据一个前缀，然后扫描整个倒排索引了; 简单的拿前缀去倒排索引中匹配即可，如果匹配上了，那么就好了; 和match，全文检索一样

2、实验一下ngram

重建索引，设置"max_gram": 3

DELETE /my_index

PUT /my_index
{
    "settings": {
        "analysis": {
            "filter": {
                "autocomplete_filter": { 
                    "type":     "edge_ngram",
                    "min_gram": 1,
                    "max_gram": 3
                }
            },
            "analyzer": {
                "autocomplete": {
                    "type":      "custom",
                    "tokenizer": "standard",
                    "filter": [
                        "lowercase",
                        "autocomplete_filter" 
                    ]
                }
            }
        }
    }
}

使用autocomplete查询

GET /my_index/_analyze
{
  "analyzer": "autocomplete",
  "text": "hello world"
}

响应结果

{
  "tokens": [
    {
      "token": "h",
      "start_offset": 0,
      "end_offset": 5,
      "type": "",
      "position": 0
    },
    {
      "token": "he",
      "start_offset": 0,
      "end_offset": 5,
      "type": "",
      "position": 0
    },
    {
      "token": "hel",
      "start_offset": 0,
      "end_offset": 5,
      "type": "",
      "position": 0
    },
    {
      "token": "w",
      "start_offset": 6,
      "end_offset": 11,
      "type": "",
      "position": 1
    },
    {
      "token": "wo",
      "start_offset": 6,
      "end_offset": 11,
      "type": "",
      "position": 1
    },
    {
      "token": "wor",
      "start_offset": 6,
      "end_offset": 11,
      "type": "",
      "position": 1
    }
  ]
}

重建my_index索引，设置"max_gram": 20

DELETE /my_index

设置自定义的edge_ngram分词器autocomplete

PUT /my_index
{
    "settings": {
        "analysis": {
            "filter": {
                "autocomplete_filter": { 
                    "type":     "edge_ngram",
                    "min_gram": 1,
                    "max_gram": 20
                }
            },
            "analyzer": {
                "autocomplete": {
                    "type":      "custom",
                    "tokenizer": "standard",
                    "filter": [
                        "lowercase",
                        "autocomplete_filter" 
                    ]
                }
            }
        }
    }
}

使用autocomplete查询

GET /my_index/_analyze
{
  "analyzer": "autocomplete",
  "text": "hello world"
}

响应结果

{
  "tokens": [
    {
      "token": "h",
      "start_offset": 0,
      "end_offset": 5,
      "type": "",
      "position": 0
    },
    {
      "token": "he",
      "start_offset": 0,
      "end_offset": 5,
      "type": "",
      "position": 0
    },
    {
      "token": "hel",
      "start_offset": 0,
      "end_offset": 5,
      "type": "",
      "position": 0
    },
    {
      "token": "hell",
      "start_offset": 0,
      "end_offset": 5,
      "type": "",
      "position": 0
    },
    {
      "token": "hello",
      "start_offset": 0,
      "end_offset": 5,
      "type": "",
      "position": 0
    },
    {
      "token": "w",
      "start_offset": 6,
      "end_offset": 11,
      "type": "",
      "position": 1
    },
    {
      "token": "wo",
      "start_offset": 6,
      "end_offset": 11,
      "type": "",
      "position": 1
    },
    {
      "token": "wor",
      "start_offset": 6,
      "end_offset": 11,
      "type": "",
      "position": 1
    },
    {
      "token": "worl",
      "start_offset": 6,
      "end_offset": 11,
      "type": "",
      "position": 1
    },
    {
      "token": "world",
      "start_offset": 6,
      "end_offset": 11,
      "type": "",
      "position": 1
    }
  ]
}

建立mapping

PUT /my_index/_mapping/my_type
{
  "properties": {
      "title": {
          "type":     "text",
          "analyzer": "autocomplete",
          "search_analyzer": "standard"
      }
  }
}

插入数据

PUT /my_index/my_type/1
{
  "title" : "hello we"
}

PUT /my_index/my_type/2
{
  "title" : "hello win"
}

PUT /my_index/my_type/3
{
  "title" : "hello world"
}

PUT /my_index/my_type/4
{
  "title" : "hello dog"
}

GET /my_index/my_type/_search 
{
  "query": {
    "match": {
      "title": "hello w"
    }
  }
}

响应结果

{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 0.9055367,
    "hits": [
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "2",
        "_score": 0.9055367,
        "_source": {
          "title": "hello win"
        }
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "3",
        "_score": 0.3758317,
        "_source": {
          "title": "hello world"
        }
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "1",
        "_score": 0.3594392,
        "_source": {
          "title": "hello we"
        }
      }
    ]
  }
}

关注

打赏

1688896170

查看更多评论

24深度探秘搜索技术_实战通过ngram分词机制实现index-time搜索推荐

[ 申请 ]友情链接：