22深度探秘搜索技术_实战前缀搜索、通配符搜索、正则搜索等技术

Dongguo丶发布时间：2021-11-21 20:06:39 ，浏览量：4

1、前缀搜索

C3D0-KD345 C3K5-DFG65 C4I8-UI365

C4I8-UC365

搜索C3 --> 上面这两个都搜索出来 --> 根据字符串的前缀去搜索

手动建的新索引

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "title": {
          "type": "keyword"
        }
      }
    }
  }
}

填充数据

PUT /my_index/my_type/1
{
  "title" : "C3-D0-KD345"
}

PUT /my_index/my_type/2
{
  "title" : "C3-K5-DFG65"
}

PUT /my_index/my_type/3
{
  "title" : "C4-I8-UI365"
}

PUT /my_index/my_type/4
{
  "title" : "C4-I8-UC365"
}

前缀搜索

GET my_index/my_type/_search
{
  "query": {
    "prefix": {
      "title": {
        "value": "C3"
      }
    }
  }
}

响应结果

{
  "took": 19,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 1,
    "hits": [
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "2",
        "_score": 1,
        "_source": {
          "title": "C3-K5-DFG65"
        }
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "1",
        "_score": 1,
        "_source": {
          "title": "C3-D0-KD345"
        }
      }
    ]
  }
}

GET my_index/my_type/_search
{
  "query": {
    "prefix": {
      "title": {
        "value": "C4"
      }
    }
  }
}

响应结果

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 1,
    "hits": [
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "4",
        "_score": 1,
        "_source": {
          "title": "C4-I8-UC365"
        }
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "3",
        "_score": 1,
        "_source": {
          "title": "C4-I8-UC365"
        }
      }
    ]
  }
}

2、前缀搜索的原理

prefix query不计算relevance score，分数都是默认的1，prefix query与prefix filter唯一的区别就是，filter会cache bitset

扫描整个倒排索引，举例说明

前缀越短，匹配到的odc就越多，要处理的doc越多，性能越差，尽可能用长前缀搜索

前缀搜索，它是怎么执行的？性能为什么差呢？

match：

C3-D0-KD345 C3-K5-DFG65 C4-I8-UI365

C4-I8-UC365

如果要进行全文检索，每个字符串都需要被分词，查看分词

GET _analyze
{
  "text": "C3-D0-KD345",
  "analyzer": "standard"
}

响应结果

{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 1,
    "hits": [
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "2",
        "_score": 1,
        "_source": {
          "title": "C3-K5-DFG65"
        }
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "1",
        "_score": 1,
        "_source": {
          "title": "C3-D0-KD345"
        }
      }
    ]
  }
}

c3 doc1, doc2 d0 kd345 k5 dfg65 c4 i8 ui365

UC365

扫描倒排索引c3 --> 一旦扫描完c3，就可以停了，因为带c3的就2个doc，已经找到了 --> 没有必要继续去搜索其他的term了

match性能往往是很高的

prefix不分词

C3-D0-KD345 C3-K5-DFG65 C4-I8-UI365

C4-I8-UC365

c3 --> 先扫描到了C3-D0-KD345，很棒，找到了一个前缀带c3的字符串 --> 还是要继续搜索的，因为后面还有一个C3-K5-DFG65，也许还有其他很多的前缀带c3的字符串 --> 你扫描到了一个前缀匹配的term，不能停，必须继续搜索 --> 直到扫描完整个的倒排索引，才能结束

因为实际场景中，可能有些场景是全文检索解决不了的

比如是

C3D0-KD345 C3K5-DFG65 C4I8-UI365

C4I8-UC365

分词：

c3d0 kd345

c3 --> match --> 扫描整个倒排索引，能找到吗？不能

c3 --> 用prefix，但是prefix性能很差

3、通配符搜索

跟前缀搜索类似，功能更加强大

C3-D0-KD345 C3-K5-DFG65 C4-I8-UI365

C4-I8-UC365

匹配包含以c3-开始和最后一位是 5的文档

c3-*5

通配符去表达更加复杂的模糊搜索的语义

GET my_index/my_type/_search
{
  "query": {
    "wildcard": {
      "title": {
        "value": "C3-*5"
      }
    }
  }
}

?：任意字符 *：0个或任意多个字符

响应结果

{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 1,
    "hits": [
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "2",
        "_score": 1,
        "_source": {
          "title": "C3K5-DFG65"
        }
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "1",
        "_score": 1,
        "_source": {
          "title": "C3D0-KD345"
        }
      }
    ]
  }
}

性能一样差，必须扫描整个倒排索引，才ok

搜索以c4开头以U?365结尾

?表示为任意字符

GET my_index/my_type/_search
{
  "query": {
    "wildcard": {
      "title": {
        "value": "C4*U?365"
      }
    }
  }
}

响应结果

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 1,
    "hits": [
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "4",
        "_score": 1,
        "_source": {
          "title": "C4-I8-UC365"
        }
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "3",
        "_score": 1,
        "_source": {
          "title": "C4-I8-UI365"
        }
      }
    ]
  }
}

4、正则搜索

GET /my_index/my_type/_search 
{
  "query": {
    "regexp": {
      "title": "C[0-9].+"
    }
  }
}

C[0-9].+

[0-9]：指定范围内的数字 [a-z]：指定范围内的字母 .：一个字符 +：字符可以有1个或多个

响应结果

{
  "took": 4,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 4,
    "max_score": 1,
    "hits": [
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "2",
        "_score": 1,
        "_source": {
          "title": "C3-K5-DFG65"
        }
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "4",
        "_score": 1,
        "_source": {
          "title": "C4-I8-UC365"
        }
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "1",
        "_score": 1,
        "_source": {
          "title": "C3-D0-KD345"
        }
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "3",
        "_score": 1,
        "_source": {
          "title": "C4-I8-UI365"
        }
      }
    ]
  }
}

wildcard和regexp，与prefix原理一致，都会扫描整个索引，性能很差

主要是给大家介绍一些高级的搜索语法。在实际应用中，能不用尽量别用。性能太差了。

关注

打赏

1638062488

查看更多评论

22深度探秘搜索技术_实战前缀搜索、通配符搜索、正则搜索等技术

最近更新

热门博客

[ 申请 ]友情链接：