在上一章介绍了搜索的基本使用,有兴趣的朋友可以参考 elasticSearch核心概念的介绍(四):搜索的简单使用 在这一章我们会进行搜索的简单使用。
简介什么是分词器,内置的分词器有哪些?
-
什么是分词器
- 将用户输入的一段文本,按照一定的逻辑,分词成多个词语的一种工具
- 例如:The best 3-points shooter is Curry !
-
常用的内置分词器
- standard analyzer
- simple analyzer
- whitespace analyzer
- stop analyzer
- language analyzer
- pattern analyzer
standard analyzer:标准分词器,是默认分词器,如果未指定,则使用该分词器
-
请求:
curl -X POST "http://172.25.45.150:9200/nba/_analyze" -H 'Content-Type:application/json' -d ' { "analyzer":"standard", "text":"The best 3-points shooter is Curry !" } '
-
响应
{ "tokens": [ { "token": "the", "start_offset": 0, "end_offset": 3, "type": "", "position": 0 }, { "token": "best", "start_offset": 4, "end_offset": 8, "type": "", "position": 1 }, { "token": "3", "start_offset": 9, "end_offset": 10, "type": "", "position": 2 }, { "token": "points", "start_offset": 11, "end_offset": 17, "type": "", "position": 3 }, { "token": "shooter", "start_offset": 18, "end_offset": 25, "type": "", "position": 4 }, { "token": "is", "start_offset": 26, "end_offset": 28, "type": "", "position": 5 }, { "token": "curry", "start_offset": 29, "end_offset": 34, "type": "", "position": 6 } ] }
simple analyzer:简单分词器,当它遇到只要不是字母的字符(如果遇到了例如数字3就会忽略掉),就将文本解析成term,而且所有的term都是小写的。
-
请求:
curl -X POST "http://172.25.45.150:9200/nba/_analyze" -H 'Content-Type:application/json' -d ' { "analyzer":"simple", "text":"The best 3-points shooter is Curry !" } '
-
响应
{ "tokens": [ { "token": "the", "start_offset": 0, "end_offset": 3, "type": "word", "position": 0 }, { "token": "best", "start_offset": 4, "end_offset": 8, "type": "word", "position": 1 }, { "token": "points", "start_offset": 11, "end_offset": 17, "type": "word", "position": 2 }, { "token": "shooter", "start_offset": 18, "end_offset": 25, "type": "word", "position": 3 }, { "token": "is", "start_offset": 26, "end_offset": 28, "type": "word", "position": 4 }, { "token": "curry", "start_offset": 29, "end_offset": 34, "type": "word", "position": 5 } ] }
whitespace analyzer:当它遇到了空白字符串时,就将文本解析成terms
-
请求
curl -X POST "http://172.25.45.150:9200/nba/_analyze" -H 'Content-Type:application/json' -d ' { "analyzer":"whitespace", "text":"The best 3-points shooter is Curry !" } '
-
响应
{ "tokens": [ { "token": "The", "start_offset": 0, "end_offset": 3, "type": "word", "position": 0 }, { "token": "best", "start_offset": 4, "end_offset": 8, "type": "word", "position": 1 }, { "token": "3-points", "start_offset": 9, "end_offset": 17, "type": "word", "position": 2 }, { "token": "shooter", "start_offset": 18, "end_offset": 25, "type": "word", "position": 3 }, { "token": "is", "start_offset": 26, "end_offset": 28, "type": "word", "position": 4 }, { "token": "Curry", "start_offset": 29, "end_offset": 34, "type": "word", "position": 5 }, { "token": "!", "start_offset": 35, "end_offset": 36, "type": "word", "position": 6 } ] }
stop analyzer:stop分析器和simple分析器很想,唯一不同的是,stop分析器增加了对删除停止词的支持,默认使用English停止词
- stopwords预定义的停止词列表,比如(the ,a,an,this,of,at)等等这些都会忽略掉
-
请求
curl -X POST "http://172.25.45.150:9200/nba/_analyze" -H 'Content-Type:application/json' -d ' { "analyzer":"stop", "text":"The best 3-points shooter is Curry !" } '
-
响应
{ "tokens": [ { "token": "best", "start_offset": 4, "end_offset": 8, "type": "word", "position": 1 }, { "token": "points", "start_offset": 11, "end_offset": 17, "type": "word", "position": 2 }, { "token": "shooter", "start_offset": 18, "end_offset": 25, "type": "word", "position": 3 }, { "token": "curry", "start_offset": 29, "end_offset": 34, "type": "word", "position": 5 } ] }
language analyzer:特定的语言的分词器,比如说,english,英语分词器,内置语言:arabic,armenian,basque,bengali,brazilian,bulgarian,catalan,cjk,czech,danish,dutch,english,estonian,finnish,french,galician,german,greek,hindi,hungarian,indonesian,irish,talian,latvian,lithuanian,norwegian,persian,portuguese,omanian,russian,sorani,panish,wedish,turkish,thai
-
请求:
curl -X POST "http://172.25.45.150:9200/nba/_analyze" -H 'Content-Type:application/json' -d ' { "analyzer":"english", "text":"The best 3-points shooter is Curry !" } '
-
响应
{ "tokens": [ { "token": "best", "start_offset": 4, "end_offset": 8, "type": "", "position": 1 }, { "token": "3", "start_offset": 9, "end_offset": 10, "type": "", "position": 2 }, { "token": "point", "start_offset": 11, "end_offset": 17, "type": "", "position": 3 }, { "token": "shooter", "start_offset": 18, "end_offset": 25, "type": "", "position": 4 }, { "token": "curri", "start_offset": 29, "end_offset": 34, "type": "", "position": 6 } ] }
pattern:用正则表达式来将文本分割成terms,默认的正则表达式时\W+ (非单词字符)
-
请求
curl -X POST "http://172.25.45.150:9200/nba/_analyze" -H 'Content-Type:application/json' -d ' { "analyzer":"pattern", "text":"The best 3-points shooter is Curry !" } '
-
响应
{ "tokens": [ { "token": "the", "start_offset": 0, "end_offset": 3, "type": "word", "position": 0 }, { "token": "best", "start_offset": 4, "end_offset": 8, "type": "word", "position": 1 }, { "token": "3", "start_offset": 9, "end_offset": 10, "type": "word", "position": 2 }, { "token": "points", "start_offset": 11, "end_offset": 17, "type": "word", "position": 3 }, { "token": "shooter", "start_offset": 18, "end_offset": 25, "type": "word", "position": 4 }, { "token": "is", "start_offset": 26, "end_offset": 28, "type": "word", "position": 5 }, { "token": "curry", "start_offset": 29, "end_offset": 34, "type": "word", "position": 6 } ] }
上面将大致的分词器都介绍了下,现在我们来使用分词器进行查询
-
请求(使用分词器创建一个索引)
curl -X PUT "http://172.25.45.150:9200/my_index" -H 'Content-Type:application/json' -d ' { "settings":{ "analysis":{ "analyzer":{ "my_analyzer":{ "type":"whitespace" } } } }, "mappings":{ "properties":{ "name":{ "type":"text" }, "team_name":{ "type":"text" }, "position":{ "type":"keyword" }, "play_year":{ "type":"keyword" }, "jerse_no":{ "type":"keyword" }, "title":{ "type":"text", "analyzer":"my_analyzer" } } } } '
-
新增测试数据
curl -X PUT "http://172.25.45.150:9200/my_index/_doc" -H 'Content-Type:application/json' -d ' { "name":"库里", "team_name":"勇士", "position":"组织后卫", "play_year":"10", "jerse_no":"30", "title":"The best 3-point shooter is Curry!" } '
-
查询
curl -X PUT "http://172.25.45.150:9200/my_index/_search" -H 'Content-Type:application/json' -d ' { "query":{ "match":{ "title":"Curry!" } } } '
可以发现当我们参数 title=Curry! 时候能够查询到数据而 title=Curry则查询不到数据,因为whitespace以空白字符串将文本分成了 Curry! 而我们Curry则无法匹配