POST /forum/article/_bulk
{ "update": { "_id": "1"} }
{ "doc" : {"content" : "i like to write best elasticsearch article"} }
{ "update": { "_id": "2"} }
{ "doc" : {"content" : "i think java is the best programming language"} }
{ "update": { "_id": "3"} }
{ "doc" : {"content" : "i am only an elasticsearch beginner"} }
{ "update": { "_id": "4"} }
{ "doc" : {"content" : "elasticsearch and hadoop are all very good solution, i am a beginner"} }
{ "update": { "_id": "5"} }
{ "doc" : {"content" : "spark is best big data solution based on scala ,an programming language similar to java"} }
响应结果
{
"took": 81,
"errors": false,
"items": [
{
"update": {
"_index": "forum",
"_type": "article",
"_id": "1",
"_version": 6,
"result": "updated",
"_shards": {
"total": 2,
"successful": 1,
"failed": 0
},
"status": 200
}
},
{
"update": {
"_index": "forum",
"_type": "article",
"_id": "2",
"_version": 6,
"result": "updated",
"_shards": {
"total": 2,
"successful": 1,
"failed": 0
},
"status": 200
}
},
{
"update": {
"_index": "forum",
"_type": "article",
"_id": "3",
"_version": 6,
"result": "updated",
"_shards": {
"total": 2,
"successful": 1,
"failed": 0
},
"status": 200
}
},
{
"update": {
"_index": "forum",
"_type": "article",
"_id": "4",
"_version": 6,
"result": "updated",
"_shards": {
"total": 2,
"successful": 1,
"failed": 0
},
"status": 200
}
},
{
"update": {
"_index": "forum",
"_type": "article",
"_id": "5",
"_version": 3,
"result": "updated",
"_shards": {
"total": 2,
"successful": 1,
"failed": 0
},
"status": 200
}
}
]
}
2、搜索title或content中包含java或solution的帖子
下面这个就是multi-field搜索,多字段搜索
GET /forum/article/_search
{
"query": {
"bool": {
"should": [
{ "match": { "title": "java solution" }},
{ "match": { "content": "java solution" }}
]
}
}
}
响应结果
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 4,
"max_score": 0.8849759,
"hits": [
{
"_index": "forum",
"_type": "article",
"_id": "2",
"_score": 0.8849759,
"_source": {
"articleID": "KDKE-B-9947-#kL5",
"userID": 1,
"hidden": false,
"postDate": "2017-01-02",
"tag": [
"java"
],
"tag_cnt": 1,
"view_cnt": 50,
"title": "this is java blog",
"content": "i think java is the best programming language"
}
},
{
"_index": "forum",
"_type": "article",
"_id": "4",
"_score": 0.7120095,
"_source": {
"articleID": "QQPX-R-3956-#aD8",
"userID": 2,
"hidden": true,
"postDate": "2017-01-02",
"tag": [
"java",
"elasticsearch"
],
"tag_cnt": 2,
"view_cnt": 80,
"title": "this is java, elasticsearch, hadoop blog",
"content": "elasticsearch and hadoop are all very good solution, i am a beginner"
}
},
{
"_index": "forum",
"_type": "article",
"_id": "5",
"_score": 0.56008905,
"_source": {
"articleID": "DHJK-B-1395-#Ky5",
"userID": 3,
"hidden": false,
"postDate": "2021-11-11",
"tag": [
"elasticsearch"
],
"tag_cnt": 1,
"view_cnt": 10,
"title": "this is spark blog",
"content": "spark is best big data solution based on scala ,an programming language similar to java"
}
},
{
"_index": "forum",
"_type": "article",
"_id": "1",
"_score": 0.26742277,
"_source": {
"articleID": "XHDK-A-1293-#fJ3",
"userID": 1,
"hidden": false,
"postDate": "2017-01-01",
"tag": [
"java",
"hadoop"
],
"tag_cnt": 2,
"view_cnt": 30,
"title": "this is java and elasticsearch blog",
"content": "i like to write best elasticsearch article"
}
}
]
}
}
3、结果分析
在doc5的content中既有java又有solution,所以我们可能期望doc5会排在前面,结果是doc2,doc4排在了前面
计算每个document的relevance score公式为:每个query的分数,乘以matched query数量,除以总query数量
以doc4和doc5为例
算一下doc4的分数
{ “match”: { “title”: “java solution” }},针对doc4,title是有一个分数的 { “match”: { “content”: “java solution” }},针对doc4,content也是有一个分数的
所以是两个分数加起来,比如说,1.1 + 1.2 = 2.3 matched query数量 = 2 总query数量 = 2
2.3 * 2 / 2 = 2.3
算一下doc5的分数
{ “match”: { “title”: “java solution” }},针对doc5,title是没有分数的 { “match”: { “content”: “java solution” }},针对doc5,content是有一个分数的
所以说,只有一个query是有分数的,比如2.3 matched query数量 = 1 总query数量 = 2
2.3 * 1 / 2 = 1.15
doc5的分数 = 1.15 < doc4的分数 = 2.3
4、best fields策略,dis_maxbest fields策略,就是说,搜索到的结果,应该是某一个field中匹配到了尽可能多的关键词,被排在前面;而不是尽可能多的field匹配到了少数的关键词,排在了前面
dis_max语法,直接取多个query中,只选取分数最高的那一个query的分数即可
{ “match”: { “title”: “java solution” }},针对doc4,title是有一个分数的,1.1 { “match”: { “content”: “java solution” }},针对doc4,content也是有一个分数的,1.2 取最大分数,1.2
{ “match”: { “title”: “java solution” }},针对doc5,title是没有分数的 { “match”: { “content”: “java solution” }},针对doc5,content是有一个分数的,2.3 取最大分数,2.3
然后doc4的分数 = 1.2 < doc5的分数 = 2.3,所以doc5就可以排在更前面的地方,符合我们的需要
上面的分数并不是真正的分数,只是举个例子说明这种情况。
GET /forum/article/_search
{
"query": {
"dis_max": {
"queries": [
{ "match": { "title": "java solution" }},
{ "match": { "content": "java solution" }}
]
}
}
}
响应结果
{
"took": 0,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 4,
"max_score": 0.68640786,
"hits": [
{
"_index": "forum",
"_type": "article",
"_id": "2",
"_score": 0.68640786,
"_source": {
"articleID": "KDKE-B-9947-#kL5",
"userID": 1,
"hidden": false,
"postDate": "2017-01-02",
"tag": [
"java"
],
"tag_cnt": 1,
"view_cnt": 50,
"title": "this is java blog",
"content": "i think java is the best programming language"
}
},
{
"_index": "forum",
"_type": "article",
"_id": "5",
"_score": 0.56008905,
"_source": {
"articleID": "DHJK-B-1395-#Ky5",
"userID": 3,
"hidden": false,
"postDate": "2021-11-11",
"tag": [
"elasticsearch"
],
"tag_cnt": 1,
"view_cnt": 10,
"title": "this is spark blog",
"content": "spark is best big data solution based on scala ,an programming language similar to java"
}
},
{
"_index": "forum",
"_type": "article",
"_id": "4",
"_score": 0.5565415,
"_source": {
"articleID": "QQPX-R-3956-#aD8",
"userID": 2,
"hidden": true,
"postDate": "2017-01-02",
"tag": [
"java",
"elasticsearch"
],
"tag_cnt": 2,
"view_cnt": 80,
"title": "this is java, elasticsearch, hadoop blog",
"content": "elasticsearch and hadoop are all very good solution, i am a beginner"
}
},
{
"_index": "forum",
"_type": "article",
"_id": "1",
"_score": 0.26742277,
"_source": {
"articleID": "XHDK-A-1293-#fJ3",
"userID": 1,
"hidden": false,
"postDate": "2017-01-01",
"tag": [
"java",
"hadoop"
],
"tag_cnt": 2,
"view_cnt": 30,
"title": "this is java and elasticsearch blog",
"content": "i like to write best elasticsearch article"
}
}
]
}
}
从结果来看的确doc5排在了doc4前面
但是doc2依然排在了doc5前面,说明还是有其他的因素影响到了最佳字段dis_max返回的结果,
这就需要回想一下之前学过的 term frequency/inverse document frequency算法,简称为TF/IDF算法