edge_ngram和ngram是ElasticSearch自带的两个分词器,一般设置索引映射的时候都会用到,设置完步长之后,就可以直接给解析器analyzer的tokenizer赋值使用。
什么是ngram在索引时准备数据意味着要选择合适的分析链,这里部分匹配使用的工具是 n-gram 。可以将 n-gram 看成一个在词语上 滑动窗口 , n 代表这个 “窗口” 的长度。如果我们要 n-gram quick
这个词 —— 它的结果取决于 n 的选择长度:
ngram是从每一个字符开始,按照步长,进行分词,适合前缀中缀检索
比如quick,有5种长度下的ngram
ngram length=1,q u i c k ngram length=2,qu ui ic ck ngram length=3,qui uic ick ngram length=4,quic uick ngram length=5,quick
POST _analyze
{
"tokenizer": "ngram",
"text": "quick"
}
响应结果
{
"tokens": [
{
"token": "q",
"start_offset": 0,
"end_offset": 1,
"type": "word",
"position": 0
},
{
"token": "qu",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 1
},
{
"token": "u",
"start_offset": 1,
"end_offset": 2,
"type": "word",
"position": 2
},
{
"token": "ui",
"start_offset": 1,
"end_offset": 3,
"type": "word",
"position": 3
},
{
"token": "i",
"start_offset": 2,
"end_offset": 3,
"type": "word",
"position": 4
},
{
"token": "ic",
"start_offset": 2,
"end_offset": 4,
"type": "word",
"position": 5
},
{
"token": "c",
"start_offset": 3,
"end_offset": 4,
"type": "word",
"position": 6
},
{
"token": "ck",
"start_offset": 3,
"end_offset": 5,
"type": "word",
"position": 7
},
{
"token": "k",
"start_offset": 4,
"end_offset": 5,
"type": "word",
"position": 8
}
]
}
ngram分词器默认会产生最小长度为1,最大长度为2的N-grams序列。上述查询语句的输出是 [ Q, Qu, u, ui, i, ic, c, ck, k, "k ", " “, " F”, F, Fo, o, ox, x ]
也可以自定义ngram tokenizer的一些配置:
- min_gram: 指定产生的最小长度的字符序列,默认为1
- max_gram: 指定产生的最大长度的字符序列,默认为2
- token_chars: 指定生成的token应该包含哪些字符.对没有包含进的字符进行分割,默认为[],即保留所有字符
-
- letter - eg: a,b,字
-
- digit - eg: 3,7
-
- whitespace - eg: " ", “\n”
-
- punctuation - eg: !, "
-
- symbol - eg: $,√
定义min_gram
和max_gram
应该按照使用场景来定。使用ngram的一个常见场景就是自动补全。如果单个字符也进行自动补全,那么可能匹配的suggestion太多,导致没有太大意义。 另一个需要考虑的便是性能,产生大量的ngram占用空间更大,搜索时花费的事件也更多。
quick,anchor基于首字母后进行ngram
q qu qui quic quick
使用edge ngram将每个单词都进行进一步的分词切分,用切分后的ngram来实现前缀搜索推荐功能
edge_ngram是从第一个字符开始,按照步长,进行分词,适合前缀匹配场景,比如:订单号,手机号,邮政编码的检索
POST _analyze
{
"tokenizer": "edge_ngram",
"text": "quick"
}
响应结果
{
"tokens": [
{
"token": "q",
"start_offset": 0,
"end_offset": 1,
"type": "word",
"position": 0
},
{
"token": "qu",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 1
}
]
}
edge_ngram分词器默认会产生最小长度为1,最大长度为2的N-grams序列 [q,qu]
Edge Ngram也有着和ngram相同的配置
- min_gram: 指定产生的最小长度的字符序列,默认为1
- max_gram: 指定产生的最大长度的字符序列,默认为2
- token_chars: 指定生成的token应该包含哪些字符.对没有包含进的字符进行分割,默认为[],即保留所有字符
-
- letter - eg: a,b,字
-
- digit - eg: 3,7
-
- whitespace - eg: " ", “\n”
-
- punctuation - eg: !, "
-
- symbol - eg: $,√
hello world、hello we被分词为:
h he hel hell hello
w wo wor worl world we
如果搜索hello w ,doc1,doc2匹配hello w,而且position也匹配,所以doc1、doc2返回
搜索的时候,不用再根据一个前缀,然后扫描整个倒排索引了; 简单的拿前缀去倒排索引中匹配即可,如果匹配上了,那么就好了; 和match,全文检索一样
2、实验一下ngram重建索引,设置"max_gram": 3
DELETE /my_index
PUT /my_index
{
"settings": {
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 3
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
}
}
}
使用autocomplete查询
GET /my_index/_analyze
{
"analyzer": "autocomplete",
"text": "hello world"
}
响应结果
{
"tokens": [
{
"token": "h",
"start_offset": 0,
"end_offset": 5,
"type": "",
"position": 0
},
{
"token": "he",
"start_offset": 0,
"end_offset": 5,
"type": "",
"position": 0
},
{
"token": "hel",
"start_offset": 0,
"end_offset": 5,
"type": "",
"position": 0
},
{
"token": "w",
"start_offset": 6,
"end_offset": 11,
"type": "",
"position": 1
},
{
"token": "wo",
"start_offset": 6,
"end_offset": 11,
"type": "",
"position": 1
},
{
"token": "wor",
"start_offset": 6,
"end_offset": 11,
"type": "",
"position": 1
}
]
}
重建my_index索引,设置"max_gram": 20
DELETE /my_index
设置自定义的edge_ngram分词器autocomplete
PUT /my_index
{
"settings": {
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
}
}
}
使用autocomplete查询
GET /my_index/_analyze
{
"analyzer": "autocomplete",
"text": "hello world"
}
响应结果
{
"tokens": [
{
"token": "h",
"start_offset": 0,
"end_offset": 5,
"type": "",
"position": 0
},
{
"token": "he",
"start_offset": 0,
"end_offset": 5,
"type": "",
"position": 0
},
{
"token": "hel",
"start_offset": 0,
"end_offset": 5,
"type": "",
"position": 0
},
{
"token": "hell",
"start_offset": 0,
"end_offset": 5,
"type": "",
"position": 0
},
{
"token": "hello",
"start_offset": 0,
"end_offset": 5,
"type": "",
"position": 0
},
{
"token": "w",
"start_offset": 6,
"end_offset": 11,
"type": "",
"position": 1
},
{
"token": "wo",
"start_offset": 6,
"end_offset": 11,
"type": "",
"position": 1
},
{
"token": "wor",
"start_offset": 6,
"end_offset": 11,
"type": "",
"position": 1
},
{
"token": "worl",
"start_offset": 6,
"end_offset": 11,
"type": "",
"position": 1
},
{
"token": "world",
"start_offset": 6,
"end_offset": 11,
"type": "",
"position": 1
}
]
}
建立mapping
PUT /my_index/_mapping/my_type
{
"properties": {
"title": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "standard"
}
}
}
插入数据
PUT /my_index/my_type/1
{
"title" : "hello we"
}
PUT /my_index/my_type/2
{
"title" : "hello win"
}
PUT /my_index/my_type/3
{
"title" : "hello world"
}
PUT /my_index/my_type/4
{
"title" : "hello dog"
}
GET /my_index/my_type/_search
{
"query": {
"match": {
"title": "hello w"
}
}
}
响应结果
{
"took": 0,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0.9055367,
"hits": [
{
"_index": "my_index",
"_type": "my_type",
"_id": "2",
"_score": 0.9055367,
"_source": {
"title": "hello win"
}
},
{
"_index": "my_index",
"_type": "my_type",
"_id": "3",
"_score": 0.3758317,
"_source": {
"title": "hello world"
}
},
{
"_index": "my_index",
"_type": "my_type",
"_id": "1",
"_score": 0.3594392,
"_source": {
"title": "hello we"
}
}
]
}
}