elasticSearch核心概念的介绍（五）：分词器的介绍和使用

陈橙橙丶发布时间：2022-02-18 17:20:08 ，浏览量：3

分词器的介绍和使用

在上一章介绍了搜索的基本使用，有兴趣的朋友可以参考 elasticSearch核心概念的介绍（四）：搜索的简单使用在这一章我们会进行搜索的简单使用。

简介

什么是分词器，内置的分词器有哪些？

什么是分词器
- 将用户输入的一段文本，按照一定的逻辑，分词成多个词语的一种工具
- 例如：The best 3-points shooter is Curry !

常用的内置分词器

standard analyzer
simple analyzer
whitespace analyzer
stop analyzer
language analyzer
pattern analyzer

standard analyzer：标准分词器，是默认分词器，如果未指定，则使用该分词器

请求:

curl -X POST "http://172.25.45.150:9200/nba/_analyze" -H 'Content-Type:application/json' -d '
{
    "analyzer":"standard",
    "text":"The best 3-points shooter is Curry !"
}
'

响应

{
    "tokens": [
        {
            "token": "the",
            "start_offset": 0,
            "end_offset": 3,
            "type": "",
            "position": 0
        },
        {
            "token": "best",
            "start_offset": 4,
            "end_offset": 8,
            "type": "",
            "position": 1
        },
        {
            "token": "3",
            "start_offset": 9,
            "end_offset": 10,
            "type": "",
            "position": 2
        },
        {
            "token": "points",
            "start_offset": 11,
            "end_offset": 17,
            "type": "",
            "position": 3
        },
        {
            "token": "shooter",
            "start_offset": 18,
            "end_offset": 25,
            "type": "",
            "position": 4
        },
        {
            "token": "is",
            "start_offset": 26,
            "end_offset": 28,
            "type": "",
            "position": 5
        },
        {
            "token": "curry",
            "start_offset": 29,
            "end_offset": 34,
            "type": "",
            "position": 6
        }
    ]
}

simple analyzer:简单分词器，当它遇到只要不是字母的字符（如果遇到了例如数字3就会忽略掉），就将文本解析成term，而且所有的term都是小写的。

请求：

curl -X POST "http://172.25.45.150:9200/nba/_analyze" -H 'Content-Type:application/json' -d '
{
    "analyzer":"simple",
    "text":"The best 3-points shooter is Curry !"
}
'

响应

{
    "tokens": [
        {
            "token": "the",
            "start_offset": 0,
            "end_offset": 3,
            "type": "word",
            "position": 0
        },
        {
            "token": "best",
            "start_offset": 4,
            "end_offset": 8,
            "type": "word",
            "position": 1
        },
        {
            "token": "points",
            "start_offset": 11,
            "end_offset": 17,
            "type": "word",
            "position": 2
        },
        {
            "token": "shooter",
            "start_offset": 18,
            "end_offset": 25,
            "type": "word",
            "position": 3
        },
        {
            "token": "is",
            "start_offset": 26,
            "end_offset": 28,
            "type": "word",
            "position": 4
        },
        {
            "token": "curry",
            "start_offset": 29,
            "end_offset": 34,
            "type": "word",
            "position": 5
        }
    ]
}

whitespace analyzer：当它遇到了空白字符串时，就将文本解析成terms

请求

curl -X POST "http://172.25.45.150:9200/nba/_analyze" -H 'Content-Type:application/json' -d '
{
    "analyzer":"whitespace",
    "text":"The best 3-points shooter is Curry !"
}
'

响应

{
    "tokens": [
        {
            "token": "The",
            "start_offset": 0,
            "end_offset": 3,
            "type": "word",
            "position": 0
        },
        {
            "token": "best",
            "start_offset": 4,
            "end_offset": 8,
            "type": "word",
            "position": 1
        },
        {
            "token": "3-points",
            "start_offset": 9,
            "end_offset": 17,
            "type": "word",
            "position": 2
        },
        {
            "token": "shooter",
            "start_offset": 18,
            "end_offset": 25,
            "type": "word",
            "position": 3
        },
        {
            "token": "is",
            "start_offset": 26,
            "end_offset": 28,
            "type": "word",
            "position": 4
        },
        {
            "token": "Curry",
            "start_offset": 29,
            "end_offset": 34,
            "type": "word",
            "position": 5
        },
        {
            "token": "!",
            "start_offset": 35,
            "end_offset": 36,
            "type": "word",
            "position": 6
        }
    ]
}

stop analyzer：stop分析器和simple分析器很想，唯一不同的是，stop分析器增加了对删除停止词的支持，默认使用English停止词

stopwords预定义的停止词列表，比如（the ，a，an，this，of，at）等等这些都会忽略掉

请求

curl -X POST "http://172.25.45.150:9200/nba/_analyze" -H 'Content-Type:application/json' -d '
{
    "analyzer":"stop",
    "text":"The best 3-points shooter is Curry !"
}
'

响应

{
    "tokens": [
        {
            "token": "best",
            "start_offset": 4,
            "end_offset": 8,
            "type": "word",
            "position": 1
        },
        {
            "token": "points",
            "start_offset": 11,
            "end_offset": 17,
            "type": "word",
            "position": 2
        },
        {
            "token": "shooter",
            "start_offset": 18,
            "end_offset": 25,
            "type": "word",
            "position": 3
        },
        {
            "token": "curry",
            "start_offset": 29,
            "end_offset": 34,
            "type": "word",
            "position": 5
        }
    ]
}

language analyzer:特定的语言的分词器，比如说，english，英语分词器，内置语言：arabic,armenian,basque,bengali,brazilian,bulgarian,catalan,cjk,czech,danish,dutch,english,estonian,finnish,french,galician,german,greek,hindi,hungarian,indonesian,irish,talian,latvian,lithuanian,norwegian,persian,portuguese,omanian,russian,sorani,panish,wedish,turkish,thai

请求:

curl -X POST "http://172.25.45.150:9200/nba/_analyze" -H 'Content-Type:application/json' -d '
{
    "analyzer":"english",
    "text":"The best 3-points shooter is Curry !"
}
'

响应

{
    "tokens": [
        {
            "token": "best",
            "start_offset": 4,
            "end_offset": 8,
            "type": "",
            "position": 1
        },
        {
            "token": "3",
            "start_offset": 9,
            "end_offset": 10,
            "type": "",
            "position": 2
        },
        {
            "token": "point",
            "start_offset": 11,
            "end_offset": 17,
            "type": "",
            "position": 3
        },
        {
            "token": "shooter",
            "start_offset": 18,
            "end_offset": 25,
            "type": "",
            "position": 4
        },
        {
            "token": "curri",
            "start_offset": 29,
            "end_offset": 34,
            "type": "",
            "position": 6
        }
    ]
}

pattern:用正则表达式来将文本分割成terms，默认的正则表达式时\W+ (非单词字符)

请求

curl -X POST "http://172.25.45.150:9200/nba/_analyze" -H 'Content-Type:application/json' -d '
{
    "analyzer":"pattern",
    "text":"The best 3-points shooter is Curry !"
}
'

响应

{
    "tokens": [
        {
            "token": "the",
            "start_offset": 0,
            "end_offset": 3,
            "type": "word",
            "position": 0
        },
        {
            "token": "best",
            "start_offset": 4,
            "end_offset": 8,
            "type": "word",
            "position": 1
        },
        {
            "token": "3",
            "start_offset": 9,
            "end_offset": 10,
            "type": "word",
            "position": 2
        },
        {
            "token": "points",
            "start_offset": 11,
            "end_offset": 17,
            "type": "word",
            "position": 3
        },
        {
            "token": "shooter",
            "start_offset": 18,
            "end_offset": 25,
            "type": "word",
            "position": 4
        },
        {
            "token": "is",
            "start_offset": 26,
            "end_offset": 28,
            "type": "word",
            "position": 5
        },
        {
            "token": "curry",
            "start_offset": 29,
            "end_offset": 34,
            "type": "word",
            "position": 6
        }
    ]
}

选择分词器

上面将大致的分词器都介绍了下，现在我们来使用分词器进行查询

请求(使用分词器创建一个索引)

curl -X PUT "http://172.25.45.150:9200/my_index" -H 'Content-Type:application/json' -d '
	{
        "settings":{
            "analysis":{
                "analyzer":{
                    "my_analyzer":{
                        "type":"whitespace"
                    }
                }
            }
        },
		"mappings":{
            "properties":{
                "name":{
                    "type":"text"
                },
                "team_name":{
                    "type":"text"
                },
                "position":{
                    "type":"keyword"
                },
                "play_year":{
                    "type":"keyword"
                },
                "jerse_no":{
                    "type":"keyword"
                },
                "title":{
                    "type":"text",
                    "analyzer":"my_analyzer"
                }
            }
		}
	}
'

新增测试数据

curl -X PUT "http://172.25.45.150:9200/my_index/_doc" -H 'Content-Type:application/json' -d '
{
    "name":"库里",
    "team_name":"勇士",
    "position":"组织后卫",
    "play_year":"10",
    "jerse_no":"30",
    "title":"The best 3-point shooter is Curry!"
}
'

查询
```
curl -X PUT "http://172.25.45.150:9200/my_index/_search" -H 'Content-Type:application/json' -d '
{
    "query":{
        "match":{
            "title":"Curry!"
        }
    }
}
'
```
可以发现当我们参数 title=Curry！时候能够查询到数据而 title=Curry则查询不到数据，因为whitespace以空白字符串将文本分成了 Curry! 而我们Curry则无法匹配

关注

打赏

1648355393

查看更多评论

elasticSearch核心概念的介绍（五）：分词器的介绍和使用

最近更新

热门博客

[ 申请 ]友情链接：