11深度探秘搜索技术_案例实战基于dis_max实现best fields策略进行多字段搜索

Dongguo丶发布时间：2021-11-20 08:13:47 ，浏览量：4

1、为帖子数据增加content字段

POST /forum/article/_bulk
{ "update": { "_id": "1"} }
{ "doc" : {"content" : "i like to write best elasticsearch article"} }
{ "update": { "_id": "2"} }
{ "doc" : {"content" : "i think java is the best programming language"} }
{ "update": { "_id": "3"} }
{ "doc" : {"content" : "i am only an elasticsearch beginner"} }
{ "update": { "_id": "4"} }
{ "doc" : {"content" : "elasticsearch and hadoop are all very good solution, i am a beginner"} }
{ "update": { "_id": "5"} }
{ "doc" : {"content" : "spark is best big data solution based on scala ,an programming language similar to java"} }

响应结果

{
  "took": 81,
  "errors": false,
  "items": [
    {
      "update": {
        "_index": "forum",
        "_type": "article",
        "_id": "1",
        "_version": 6,
        "result": "updated",
        "_shards": {
          "total": 2,
          "successful": 1,
          "failed": 0
        },
        "status": 200
      }
    },
    {
      "update": {
        "_index": "forum",
        "_type": "article",
        "_id": "2",
        "_version": 6,
        "result": "updated",
        "_shards": {
          "total": 2,
          "successful": 1,
          "failed": 0
        },
        "status": 200
      }
    },
    {
      "update": {
        "_index": "forum",
        "_type": "article",
        "_id": "3",
        "_version": 6,
        "result": "updated",
        "_shards": {
          "total": 2,
          "successful": 1,
          "failed": 0
        },
        "status": 200
      }
    },
    {
      "update": {
        "_index": "forum",
        "_type": "article",
        "_id": "4",
        "_version": 6,
        "result": "updated",
        "_shards": {
          "total": 2,
          "successful": 1,
          "failed": 0
        },
        "status": 200
      }
    },
    {
      "update": {
        "_index": "forum",
        "_type": "article",
        "_id": "5",
        "_version": 3,
        "result": "updated",
        "_shards": {
          "total": 2,
          "successful": 1,
          "failed": 0
        },
        "status": 200
      }
    }
  ]
}

2、搜索title或content中包含java或solution的帖子

下面这个就是multi-field搜索，多字段搜索

GET /forum/article/_search
{
    "query": {
        "bool": {
            "should": [
                { "match": { "title": "java solution" }},
                { "match": { "content":  "java solution" }}
            ]
        }
    }
}

响应结果

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 4,
    "max_score": 0.8849759,
    "hits": [
      {
        "_index": "forum",
        "_type": "article",
        "_id": "2",
        "_score": 0.8849759,
        "_source": {
          "articleID": "KDKE-B-9947-#kL5",
          "userID": 1,
          "hidden": false,
          "postDate": "2017-01-02",
          "tag": [
            "java"
          ],
          "tag_cnt": 1,
          "view_cnt": 50,
          "title": "this is java blog",
          "content": "i think java is the best programming language"
        }
      },
      {
        "_index": "forum",
        "_type": "article",
        "_id": "4",
        "_score": 0.7120095,
        "_source": {
          "articleID": "QQPX-R-3956-#aD8",
          "userID": 2,
          "hidden": true,
          "postDate": "2017-01-02",
          "tag": [
            "java",
            "elasticsearch"
          ],
          "tag_cnt": 2,
          "view_cnt": 80,
          "title": "this is java, elasticsearch, hadoop blog",
          "content": "elasticsearch and hadoop are all very good solution, i am a beginner"
        }
      },
      {
        "_index": "forum",
        "_type": "article",
        "_id": "5",
        "_score": 0.56008905,
        "_source": {
          "articleID": "DHJK-B-1395-#Ky5",
          "userID": 3,
          "hidden": false,
          "postDate": "2021-11-11",
          "tag": [
            "elasticsearch"
          ],
          "tag_cnt": 1,
          "view_cnt": 10,
          "title": "this is spark blog",
          "content": "spark is best big data solution based on scala ,an programming language similar to java"
        }
      },
      {
        "_index": "forum",
        "_type": "article",
        "_id": "1",
        "_score": 0.26742277,
        "_source": {
          "articleID": "XHDK-A-1293-#fJ3",
          "userID": 1,
          "hidden": false,
          "postDate": "2017-01-01",
          "tag": [
            "java",
            "hadoop"
          ],
          "tag_cnt": 2,
          "view_cnt": 30,
          "title": "this is java and elasticsearch blog",
          "content": "i like to write best elasticsearch article"
        }
      }
    ]
  }
}

3、结果分析

在doc5的content中既有java又有solution，所以我们可能期望doc5会排在前面，结果是doc2,doc4排在了前面

计算每个document的relevance score公式为：每个query的分数，乘以matched query数量，除以总query数量

以doc4和doc5为例

算一下doc4的分数

{ “match”: { “title”: “java solution” }}，针对doc4，title是有一个分数的 { “match”: { “content”: “java solution” }}，针对doc4，content也是有一个分数的

所以是两个分数加起来，比如说，1.1 + 1.2 = 2.3 matched query数量 = 2 总query数量 = 2

2.3 * 2 / 2 = 2.3

算一下doc5的分数

{ “match”: { “title”: “java solution” }}，针对doc5，title是没有分数的 { “match”: { “content”: “java solution” }}，针对doc5，content是有一个分数的

所以说，只有一个query是有分数的，比如2.3 matched query数量 = 1 总query数量 = 2

2.3 * 1 / 2 = 1.15

doc5的分数 = 1.15 < doc4的分数 = 2.3

4、best fields策略，dis_max

best fields策略，就是说，搜索到的结果，应该是某一个field中匹配到了尽可能多的关键词，被排在前面；而不是尽可能多的field匹配到了少数的关键词，排在了前面

dis_max语法，直接取多个query中，只选取分数最高的那一个query的分数即可

{ “match”: { “title”: “java solution” }}，针对doc4，title是有一个分数的，1.1 { “match”: { “content”: “java solution” }}，针对doc4，content也是有一个分数的，1.2 取最大分数，1.2

{ “match”: { “title”: “java solution” }}，针对doc5，title是没有分数的 { “match”: { “content”: “java solution” }}，针对doc5，content是有一个分数的，2.3 取最大分数，2.3

然后doc4的分数 = 1.2 < doc5的分数 = 2.3，所以doc5就可以排在更前面的地方，符合我们的需要

上面的分数并不是真正的分数，只是举个例子说明这种情况。

GET /forum/article/_search
{
    "query": {
        "dis_max": {
            "queries": [
                { "match": { "title": "java solution" }},
                { "match": { "content":  "java solution" }}
            ]
        }
    }
}

响应结果

{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 4,
    "max_score": 0.68640786,
    "hits": [
      {
        "_index": "forum",
        "_type": "article",
        "_id": "2",
        "_score": 0.68640786,
        "_source": {
          "articleID": "KDKE-B-9947-#kL5",
          "userID": 1,
          "hidden": false,
          "postDate": "2017-01-02",
          "tag": [
            "java"
          ],
          "tag_cnt": 1,
          "view_cnt": 50,
          "title": "this is java blog",
          "content": "i think java is the best programming language"
        }
      },
      {
        "_index": "forum",
        "_type": "article",
        "_id": "5",
        "_score": 0.56008905,
        "_source": {
          "articleID": "DHJK-B-1395-#Ky5",
          "userID": 3,
          "hidden": false,
          "postDate": "2021-11-11",
          "tag": [
            "elasticsearch"
          ],
          "tag_cnt": 1,
          "view_cnt": 10,
          "title": "this is spark blog",
          "content": "spark is best big data solution based on scala ,an programming language similar to java"
        }
      },
      {
        "_index": "forum",
        "_type": "article",
        "_id": "4",
        "_score": 0.5565415,
        "_source": {
          "articleID": "QQPX-R-3956-#aD8",
          "userID": 2,
          "hidden": true,
          "postDate": "2017-01-02",
          "tag": [
            "java",
            "elasticsearch"
          ],
          "tag_cnt": 2,
          "view_cnt": 80,
          "title": "this is java, elasticsearch, hadoop blog",
          "content": "elasticsearch and hadoop are all very good solution, i am a beginner"
        }
      },
      {
        "_index": "forum",
        "_type": "article",
        "_id": "1",
        "_score": 0.26742277,
        "_source": {
          "articleID": "XHDK-A-1293-#fJ3",
          "userID": 1,
          "hidden": false,
          "postDate": "2017-01-01",
          "tag": [
            "java",
            "hadoop"
          ],
          "tag_cnt": 2,
          "view_cnt": 30,
          "title": "this is java and elasticsearch blog",
          "content": "i like to write best elasticsearch article"
        }
      }
    ]
  }
}

从结果来看的确doc5排在了doc4前面

但是doc2依然排在了doc5前面，说明还是有其他的因素影响到了最佳字段dis_max返回的结果，

这就需要回想一下之前学过的 term frequency/inverse document frequency算法，简称为TF/IDF算法

关注

打赏

1638062488

查看更多评论

11深度探秘搜索技术_案例实战基于dis_max实现best fields策略进行多字段搜索

最近更新

热门博客

[ 申请 ]友情链接：