您当前的位置: 首页 >  搜索

Dongguo丶

暂无认证

  • 2浏览

    0关注

    472博文

    0收益

  • 0浏览

    0点赞

    0打赏

    0留言

私信
关注
热门博文

22深度探秘搜索技术_实战前缀搜索、通配符搜索、正则搜索等技术

Dongguo丶 发布时间:2021-11-21 20:06:39 ,浏览量:2

1、前缀搜索

C3D0-KD345 C3K5-DFG65 C4I8-UI365

C4I8-UC365

搜索C3 --> 上面这两个都搜索出来 --> 根据字符串的前缀去搜索

手动建的新索引

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "title": {
          "type": "keyword"
        }
      }
    }
  }
}

填充数据

PUT /my_index/my_type/1
{
  "title" : "C3-D0-KD345"
}

PUT /my_index/my_type/2
{
  "title" : "C3-K5-DFG65"
}

PUT /my_index/my_type/3
{
  "title" : "C4-I8-UI365"
}

PUT /my_index/my_type/4
{
  "title" : "C4-I8-UC365"
}

前缀搜索

GET my_index/my_type/_search
{
  "query": {
    "prefix": {
      "title": {
        "value": "C3"
      }
    }
  }
}

响应结果

{
  "took": 19,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 1,
    "hits": [
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "2",
        "_score": 1,
        "_source": {
          "title": "C3-K5-DFG65"
        }
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "1",
        "_score": 1,
        "_source": {
          "title": "C3-D0-KD345"
        }
      }
    ]
  }
}
GET my_index/my_type/_search
{
  "query": {
    "prefix": {
      "title": {
        "value": "C4"
      }
    }
  }
}

响应结果

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 1,
    "hits": [
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "4",
        "_score": 1,
        "_source": {
          "title": "C4-I8-UC365"
        }
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "3",
        "_score": 1,
        "_source": {
          "title": "C4-I8-UC365"
        }
      }
    ]
  }
}
2、前缀搜索的原理

prefix query不计算relevance score,分数都是默认的1,prefix query与prefix filter唯一的区别就是,filter会cache bitset

扫描整个倒排索引,举例说明

前缀越短,匹配到的odc就越多,要处理的doc越多,性能越差,尽可能用长前缀搜索

前缀搜索,它是怎么执行的?性能为什么差呢?

match:

C3-D0-KD345 C3-K5-DFG65 C4-I8-UI365

C4-I8-UC365

如果要进行全文检索,每个字符串都需要被分词,查看分词

GET _analyze
{
  "text": "C3-D0-KD345",
  "analyzer": "standard"
}

响应结果

{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 1,
    "hits": [
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "2",
        "_score": 1,
        "_source": {
          "title": "C3-K5-DFG65"
        }
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "1",
        "_score": 1,
        "_source": {
          "title": "C3-D0-KD345"
        }
      }
    ]
  }
}

c3 doc1, doc2 d0 kd345 k5 dfg65 c4 i8 ui365

UC365

扫描倒排索引c3 --> 一旦扫描完c3,就可以停了,因为带c3的就2个doc,已经找到了 --> 没有必要继续去搜索其他的term了

match性能往往是很高的

prefix不分词

C3-D0-KD345 C3-K5-DFG65 C4-I8-UI365

C4-I8-UC365

c3 --> 先扫描到了C3-D0-KD345,很棒,找到了一个前缀带c3的字符串 --> 还是要继续搜索的,因为后面还有一个C3-K5-DFG65,也许还有其他很多的前缀带c3的字符串 --> 你扫描到了一个前缀匹配的term,不能停,必须继续搜索 --> 直到扫描完整个的倒排索引,才能结束

因为实际场景中,可能有些场景是全文检索解决不了的

比如是

C3D0-KD345 C3K5-DFG65 C4I8-UI365

C4I8-UC365

分词:

c3d0 kd345

c3 --> match --> 扫描整个倒排索引,能找到吗?不能

c3 --> 用prefix,但是prefix性能很差

3、通配符搜索

跟前缀搜索类似,功能更加强大

C3-D0-KD345 C3-K5-DFG65 C4-I8-UI365

C4-I8-UC365

匹配包含以c3-开始 和最后一位是 5的文档

c3-*5

通配符去表达更加复杂的模糊搜索的语义

GET my_index/my_type/_search
{
  "query": {
    "wildcard": {
      "title": {
        "value": "C3-*5"
      }
    }
  }
}

?:任意字符 *:0个或任意多个字符

响应结果

{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 1,
    "hits": [
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "2",
        "_score": 1,
        "_source": {
          "title": "C3K5-DFG65"
        }
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "1",
        "_score": 1,
        "_source": {
          "title": "C3D0-KD345"
        }
      }
    ]
  }
}

性能一样差,必须扫描整个倒排索引,才ok

搜索以c4开头 以U?365结尾

?表示为任意字符

GET my_index/my_type/_search
{
  "query": {
    "wildcard": {
      "title": {
        "value": "C4*U?365"
      }
    }
  }
}

响应结果

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 1,
    "hits": [
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "4",
        "_score": 1,
        "_source": {
          "title": "C4-I8-UC365"
        }
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "3",
        "_score": 1,
        "_source": {
          "title": "C4-I8-UI365"
        }
      }
    ]
  }
}
4、正则搜索
GET /my_index/my_type/_search 
{
  "query": {
    "regexp": {
      "title": "C[0-9].+"
    }
  }
}

C[0-9].+

[0-9]:指定范围内的数字 [a-z]:指定范围内的字母 .:一个字符 +:字符可以有1个或多个

响应结果

{
  "took": 4,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 4,
    "max_score": 1,
    "hits": [
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "2",
        "_score": 1,
        "_source": {
          "title": "C3-K5-DFG65"
        }
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "4",
        "_score": 1,
        "_source": {
          "title": "C4-I8-UC365"
        }
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "1",
        "_score": 1,
        "_source": {
          "title": "C3-D0-KD345"
        }
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "3",
        "_score": 1,
        "_source": {
          "title": "C4-I8-UI365"
        }
      }
    ]
  }
}

wildcard和regexp,与prefix原理一致,都会扫描整个索引,性能很差

主要是给大家介绍一些高级的搜索语法。在实际应用中,能不用尽量别用。性能太差了。

关注
打赏
1638062488
查看更多评论
立即登录/注册

微信扫码登录

0.0424s