C3D0-KD345 C3K5-DFG65 C4I8-UI365
C4I8-UC365
搜索C3 --> 上面这两个都搜索出来 --> 根据字符串的前缀去搜索
手动建的新索引
PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"title": {
"type": "keyword"
}
}
}
}
}
填充数据
PUT /my_index/my_type/1
{
"title" : "C3-D0-KD345"
}
PUT /my_index/my_type/2
{
"title" : "C3-K5-DFG65"
}
PUT /my_index/my_type/3
{
"title" : "C4-I8-UI365"
}
PUT /my_index/my_type/4
{
"title" : "C4-I8-UC365"
}
前缀搜索
GET my_index/my_type/_search
{
"query": {
"prefix": {
"title": {
"value": "C3"
}
}
}
}
响应结果
{
"took": 19,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "my_index",
"_type": "my_type",
"_id": "2",
"_score": 1,
"_source": {
"title": "C3-K5-DFG65"
}
},
{
"_index": "my_index",
"_type": "my_type",
"_id": "1",
"_score": 1,
"_source": {
"title": "C3-D0-KD345"
}
}
]
}
}
GET my_index/my_type/_search
{
"query": {
"prefix": {
"title": {
"value": "C4"
}
}
}
}
响应结果
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "my_index",
"_type": "my_type",
"_id": "4",
"_score": 1,
"_source": {
"title": "C4-I8-UC365"
}
},
{
"_index": "my_index",
"_type": "my_type",
"_id": "3",
"_score": 1,
"_source": {
"title": "C4-I8-UC365"
}
}
]
}
}
2、前缀搜索的原理
prefix query不计算relevance score,分数都是默认的1,prefix query与prefix filter唯一的区别就是,filter会cache bitset
扫描整个倒排索引,举例说明
前缀越短,匹配到的odc就越多,要处理的doc越多,性能越差,尽可能用长前缀搜索
前缀搜索,它是怎么执行的?性能为什么差呢?
match:C3-D0-KD345 C3-K5-DFG65 C4-I8-UI365
C4-I8-UC365
如果要进行全文检索,每个字符串都需要被分词,查看分词
GET _analyze
{
"text": "C3-D0-KD345",
"analyzer": "standard"
}
响应结果
{
"took": 0,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "my_index",
"_type": "my_type",
"_id": "2",
"_score": 1,
"_source": {
"title": "C3-K5-DFG65"
}
},
{
"_index": "my_index",
"_type": "my_type",
"_id": "1",
"_score": 1,
"_source": {
"title": "C3-D0-KD345"
}
}
]
}
}
c3 doc1, doc2 d0 kd345 k5 dfg65 c4 i8 ui365
UC365
扫描倒排索引c3 --> 一旦扫描完c3,就可以停了,因为带c3的就2个doc,已经找到了 --> 没有必要继续去搜索其他的term了
match性能往往是很高的
prefix不分词C3-D0-KD345 C3-K5-DFG65 C4-I8-UI365
C4-I8-UC365
c3 --> 先扫描到了C3-D0-KD345,很棒,找到了一个前缀带c3的字符串 --> 还是要继续搜索的,因为后面还有一个C3-K5-DFG65,也许还有其他很多的前缀带c3的字符串 --> 你扫描到了一个前缀匹配的term,不能停,必须继续搜索 --> 直到扫描完整个的倒排索引,才能结束
因为实际场景中,可能有些场景是全文检索解决不了的
比如是
C3D0-KD345 C3K5-DFG65 C4I8-UI365
C4I8-UC365
分词:
c3d0 kd345
c3 --> match --> 扫描整个倒排索引,能找到吗?不能
c3 --> 用prefix,但是prefix性能很差
3、通配符搜索跟前缀搜索类似,功能更加强大
C3-D0-KD345 C3-K5-DFG65 C4-I8-UI365
C4-I8-UC365
匹配包含以c3-开始 和最后一位是 5的文档
c3-*5
通配符去表达更加复杂的模糊搜索的语义
GET my_index/my_type/_search
{
"query": {
"wildcard": {
"title": {
"value": "C3-*5"
}
}
}
}
?:任意字符 *:0个或任意多个字符
响应结果
{
"took": 0,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "my_index",
"_type": "my_type",
"_id": "2",
"_score": 1,
"_source": {
"title": "C3K5-DFG65"
}
},
{
"_index": "my_index",
"_type": "my_type",
"_id": "1",
"_score": 1,
"_source": {
"title": "C3D0-KD345"
}
}
]
}
}
性能一样差,必须扫描整个倒排索引,才ok
搜索以c4开头 以U?365结尾
?表示为任意字符
GET my_index/my_type/_search
{
"query": {
"wildcard": {
"title": {
"value": "C4*U?365"
}
}
}
}
响应结果
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "my_index",
"_type": "my_type",
"_id": "4",
"_score": 1,
"_source": {
"title": "C4-I8-UC365"
}
},
{
"_index": "my_index",
"_type": "my_type",
"_id": "3",
"_score": 1,
"_source": {
"title": "C4-I8-UI365"
}
}
]
}
}
4、正则搜索
GET /my_index/my_type/_search
{
"query": {
"regexp": {
"title": "C[0-9].+"
}
}
}
C[0-9].+
[0-9]:指定范围内的数字 [a-z]:指定范围内的字母 .:一个字符 +:字符可以有1个或多个
响应结果
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 4,
"max_score": 1,
"hits": [
{
"_index": "my_index",
"_type": "my_type",
"_id": "2",
"_score": 1,
"_source": {
"title": "C3-K5-DFG65"
}
},
{
"_index": "my_index",
"_type": "my_type",
"_id": "4",
"_score": 1,
"_source": {
"title": "C4-I8-UC365"
}
},
{
"_index": "my_index",
"_type": "my_type",
"_id": "1",
"_score": 1,
"_source": {
"title": "C3-D0-KD345"
}
},
{
"_index": "my_index",
"_type": "my_type",
"_id": "3",
"_score": 1,
"_source": {
"title": "C4-I8-UI365"
}
}
]
}
}
wildcard和regexp,与prefix原理一致,都会扫描整个索引,性能很差
主要是给大家介绍一些高级的搜索语法。在实际应用中,能不用尽量别用。性能太差了。