您当前的位置: 首页 >  搜索

Dongguo丶

暂无认证

  • 2浏览

    0关注

    472博文

    0收益

  • 0浏览

    0点赞

    0打赏

    0留言

私信
关注
热门博文

24深度探秘搜索技术_实战通过ngram分词机制实现index-time搜索推荐

Dongguo丶 发布时间:2021-11-21 20:29:20 ,浏览量:2

1、ngram和index-time搜索推荐原理

edge_ngram和ngram是ElasticSearch自带的两个分词器,一般设置索引映射的时候都会用到,设置完步长之后,就可以直接给解析器analyzer的tokenizer赋值使用。

什么是ngram

在索引时准备数据意味着要选择合适的分析链,这里部分匹配使用的工具是 n-gram 。可以将 n-gram 看成一个在词语上 滑动窗口 , n 代表这个 “窗口” 的长度。如果我们要 n-gram quick 这个词 —— 它的结果取决于 n 的选择长度:

ngram是从每一个字符开始,按照步长,进行分词,适合前缀中缀检索

比如quick,有5种长度下的ngram

ngram length=1,q u i c k ngram length=2,qu ui ic ck ngram length=3,qui uic ick ngram length=4,quic uick ngram length=5,quick

POST _analyze
{
  "tokenizer": "ngram",
  "text": "quick"
}

响应结果

{
  "tokens": [
    {
      "token": "q",
      "start_offset": 0,
      "end_offset": 1,
      "type": "word",
      "position": 0
    },
    {
      "token": "qu",
      "start_offset": 0,
      "end_offset": 2,
      "type": "word",
      "position": 1
    },
    {
      "token": "u",
      "start_offset": 1,
      "end_offset": 2,
      "type": "word",
      "position": 2
    },
    {
      "token": "ui",
      "start_offset": 1,
      "end_offset": 3,
      "type": "word",
      "position": 3
    },
    {
      "token": "i",
      "start_offset": 2,
      "end_offset": 3,
      "type": "word",
      "position": 4
    },
    {
      "token": "ic",
      "start_offset": 2,
      "end_offset": 4,
      "type": "word",
      "position": 5
    },
    {
      "token": "c",
      "start_offset": 3,
      "end_offset": 4,
      "type": "word",
      "position": 6
    },
    {
      "token": "ck",
      "start_offset": 3,
      "end_offset": 5,
      "type": "word",
      "position": 7
    },
    {
      "token": "k",
      "start_offset": 4,
      "end_offset": 5,
      "type": "word",
      "position": 8
    }
  ]
}

ngram分词器默认会产生最小长度为1,最大长度为2的N-grams序列。上述查询语句的输出是 [ Q, Qu, u, ui, i, ic, c, ck, k, "k ", " “, " F”, F, Fo, o, ox, x ]

也可以自定义ngram tokenizer的一些配置:

  • min_gram: 指定产生的最小长度的字符序列,默认为1
  • max_gram: 指定产生的最大长度的字符序列,默认为2
  • token_chars: 指定生成的token应该包含哪些字符.对没有包含进的字符进行分割,默认为[],即保留所有字符
    • letter - eg: a,b,字
    • digit - eg: 3,7
    • whitespace - eg: " ", “\n”
    • punctuation - eg: !, "
    • symbol - eg: $,√

定义min_grammax_gram应该按照使用场景来定。使用ngram的一个常见场景就是自动补全。如果单个字符也进行自动补全,那么可能匹配的suggestion太多,导致没有太大意义。 另一个需要考虑的便是性能,产生大量的ngram占用空间更大,搜索时花费的事件也更多。

什么是edge ngram

quick,anchor基于首字母后进行ngram

q qu qui quic quick

使用edge ngram将每个单词都进行进一步的分词切分,用切分后的ngram来实现前缀搜索推荐功能

edge_ngram是从第一个字符开始,按照步长,进行分词,适合前缀匹配场景,比如:订单号,手机号,邮政编码的检索

POST _analyze
{
  "tokenizer": "edge_ngram",
  "text": "quick"
}

响应结果

{
  "tokens": [
    {
      "token": "q",
      "start_offset": 0,
      "end_offset": 1,
      "type": "word",
      "position": 0
    },
    {
      "token": "qu",
      "start_offset": 0,
      "end_offset": 2,
      "type": "word",
      "position": 1
    }
  ]
}

edge_ngram分词器默认会产生最小长度为1,最大长度为2的N-grams序列 [q,qu]

Edge Ngram也有着和ngram相同的配置

  • min_gram: 指定产生的最小长度的字符序列,默认为1
  • max_gram: 指定产生的最大长度的字符序列,默认为2
  • token_chars: 指定生成的token应该包含哪些字符.对没有包含进的字符进行分割,默认为[],即保留所有字符
    • letter - eg: a,b,字
    • digit - eg: 3,7
    • whitespace - eg: " ", “\n”
    • punctuation - eg: !, "
    • symbol - eg: $,√

hello world、hello we被分词为:

h he hel hell hello

w wo wor worl world we

如果搜索hello w ,doc1,doc2匹配hello w,而且position也匹配,所以doc1、doc2返回

搜索的时候,不用再根据一个前缀,然后扫描整个倒排索引了; 简单的拿前缀去倒排索引中匹配即可,如果匹配上了,那么就好了; 和match,全文检索一样

2、实验一下ngram

重建索引,设置"max_gram": 3

DELETE /my_index
PUT /my_index
{
    "settings": {
        "analysis": {
            "filter": {
                "autocomplete_filter": { 
                    "type":     "edge_ngram",
                    "min_gram": 1,
                    "max_gram": 3
                }
            },
            "analyzer": {
                "autocomplete": {
                    "type":      "custom",
                    "tokenizer": "standard",
                    "filter": [
                        "lowercase",
                        "autocomplete_filter" 
                    ]
                }
            }
        }
    }
}

使用autocomplete查询

GET /my_index/_analyze
{
  "analyzer": "autocomplete",
  "text": "hello world"
}

响应结果

{
  "tokens": [
    {
      "token": "h",
      "start_offset": 0,
      "end_offset": 5,
      "type": "",
      "position": 0
    },
    {
      "token": "he",
      "start_offset": 0,
      "end_offset": 5,
      "type": "",
      "position": 0
    },
    {
      "token": "hel",
      "start_offset": 0,
      "end_offset": 5,
      "type": "",
      "position": 0
    },
    {
      "token": "w",
      "start_offset": 6,
      "end_offset": 11,
      "type": "",
      "position": 1
    },
    {
      "token": "wo",
      "start_offset": 6,
      "end_offset": 11,
      "type": "",
      "position": 1
    },
    {
      "token": "wor",
      "start_offset": 6,
      "end_offset": 11,
      "type": "",
      "position": 1
    }
  ]
}

重建my_index索引,设置"max_gram": 20

DELETE /my_index

设置自定义的edge_ngram分词器autocomplete

PUT /my_index
{
    "settings": {
        "analysis": {
            "filter": {
                "autocomplete_filter": { 
                    "type":     "edge_ngram",
                    "min_gram": 1,
                    "max_gram": 20
                }
            },
            "analyzer": {
                "autocomplete": {
                    "type":      "custom",
                    "tokenizer": "standard",
                    "filter": [
                        "lowercase",
                        "autocomplete_filter" 
                    ]
                }
            }
        }
    }
}

使用autocomplete查询

GET /my_index/_analyze
{
  "analyzer": "autocomplete",
  "text": "hello world"
}

响应结果

{
  "tokens": [
    {
      "token": "h",
      "start_offset": 0,
      "end_offset": 5,
      "type": "",
      "position": 0
    },
    {
      "token": "he",
      "start_offset": 0,
      "end_offset": 5,
      "type": "",
      "position": 0
    },
    {
      "token": "hel",
      "start_offset": 0,
      "end_offset": 5,
      "type": "",
      "position": 0
    },
    {
      "token": "hell",
      "start_offset": 0,
      "end_offset": 5,
      "type": "",
      "position": 0
    },
    {
      "token": "hello",
      "start_offset": 0,
      "end_offset": 5,
      "type": "",
      "position": 0
    },
    {
      "token": "w",
      "start_offset": 6,
      "end_offset": 11,
      "type": "",
      "position": 1
    },
    {
      "token": "wo",
      "start_offset": 6,
      "end_offset": 11,
      "type": "",
      "position": 1
    },
    {
      "token": "wor",
      "start_offset": 6,
      "end_offset": 11,
      "type": "",
      "position": 1
    },
    {
      "token": "worl",
      "start_offset": 6,
      "end_offset": 11,
      "type": "",
      "position": 1
    },
    {
      "token": "world",
      "start_offset": 6,
      "end_offset": 11,
      "type": "",
      "position": 1
    }
  ]
}

建立mapping

PUT /my_index/_mapping/my_type
{
  "properties": {
      "title": {
          "type":     "text",
          "analyzer": "autocomplete",
          "search_analyzer": "standard"
      }
  }
}

插入数据

PUT /my_index/my_type/1
{
  "title" : "hello we"
}

PUT /my_index/my_type/2
{
  "title" : "hello win"
}

PUT /my_index/my_type/3
{
  "title" : "hello world"
}

PUT /my_index/my_type/4
{
  "title" : "hello dog"
}
GET /my_index/my_type/_search 
{
  "query": {
    "match": {
      "title": "hello w"
    }
  }
}

响应结果

{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 0.9055367,
    "hits": [
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "2",
        "_score": 0.9055367,
        "_source": {
          "title": "hello win"
        }
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "3",
        "_score": 0.3758317,
        "_source": {
          "title": "hello world"
        }
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "1",
        "_score": 0.3594392,
        "_source": {
          "title": "hello we"
        }
      }
    ]
  }
}
关注
打赏
1638062488
查看更多评论
立即登录/注册

微信扫码登录

0.0357s