您当前的位置: 首页 >  ar

陈橙橙丶

暂无认证

  • 1浏览

    0关注

    107博文

    0收益

  • 0浏览

    0点赞

    0打赏

    0留言

私信
关注
热门博文

elasticSearch核心概念的介绍(五):分词器的介绍和使用

陈橙橙丶 发布时间:2022-02-18 17:20:08 ,浏览量:1

分词器的介绍和使用

在上一章介绍了搜索的基本使用,有兴趣的朋友可以参考 elasticSearch核心概念的介绍(四):搜索的简单使用 在这一章我们会进行搜索的简单使用。

简介

什么是分词器,内置的分词器有哪些?

  • 什么是分词器

    • 将用户输入的一段文本,按照一定的逻辑,分词成多个词语的一种工具
    • 例如:The best 3-points shooter is Curry !
  • 常用的内置分词器

    • standard analyzer
    • simple analyzer
    • whitespace analyzer
    • stop analyzer
    • language analyzer
    • pattern analyzer

    standard analyzer:标准分词器,是默认分词器,如果未指定,则使用该分词器

    • 请求:

      curl -X POST "http://172.25.45.150:9200/nba/_analyze" -H 'Content-Type:application/json' -d '
      {
          "analyzer":"standard",
          "text":"The best 3-points shooter is Curry !"
      }
      '
      
    • 响应

      {
          "tokens": [
              {
                  "token": "the",
                  "start_offset": 0,
                  "end_offset": 3,
                  "type": "",
                  "position": 0
              },
              {
                  "token": "best",
                  "start_offset": 4,
                  "end_offset": 8,
                  "type": "",
                  "position": 1
              },
              {
                  "token": "3",
                  "start_offset": 9,
                  "end_offset": 10,
                  "type": "",
                  "position": 2
              },
              {
                  "token": "points",
                  "start_offset": 11,
                  "end_offset": 17,
                  "type": "",
                  "position": 3
              },
              {
                  "token": "shooter",
                  "start_offset": 18,
                  "end_offset": 25,
                  "type": "",
                  "position": 4
              },
              {
                  "token": "is",
                  "start_offset": 26,
                  "end_offset": 28,
                  "type": "",
                  "position": 5
              },
              {
                  "token": "curry",
                  "start_offset": 29,
                  "end_offset": 34,
                  "type": "",
                  "position": 6
              }
          ]
      }
      

    simple analyzer:简单分词器,当它遇到只要不是字母的字符(如果遇到了例如数字3就会忽略掉),就将文本解析成term,而且所有的term都是小写的。

    • 请求:

      curl -X POST "http://172.25.45.150:9200/nba/_analyze" -H 'Content-Type:application/json' -d '
      {
          "analyzer":"simple",
          "text":"The best 3-points shooter is Curry !"
      }
      '
      
    • 响应

      {
          "tokens": [
              {
                  "token": "the",
                  "start_offset": 0,
                  "end_offset": 3,
                  "type": "word",
                  "position": 0
              },
              {
                  "token": "best",
                  "start_offset": 4,
                  "end_offset": 8,
                  "type": "word",
                  "position": 1
              },
              {
                  "token": "points",
                  "start_offset": 11,
                  "end_offset": 17,
                  "type": "word",
                  "position": 2
              },
              {
                  "token": "shooter",
                  "start_offset": 18,
                  "end_offset": 25,
                  "type": "word",
                  "position": 3
              },
              {
                  "token": "is",
                  "start_offset": 26,
                  "end_offset": 28,
                  "type": "word",
                  "position": 4
              },
              {
                  "token": "curry",
                  "start_offset": 29,
                  "end_offset": 34,
                  "type": "word",
                  "position": 5
              }
          ]
      }
      

    whitespace analyzer:当它遇到了空白字符串时,就将文本解析成terms

    • 请求

      curl -X POST "http://172.25.45.150:9200/nba/_analyze" -H 'Content-Type:application/json' -d '
      {
          "analyzer":"whitespace",
          "text":"The best 3-points shooter is Curry !"
      }
      '
      
    • 响应

      {
          "tokens": [
              {
                  "token": "The",
                  "start_offset": 0,
                  "end_offset": 3,
                  "type": "word",
                  "position": 0
              },
              {
                  "token": "best",
                  "start_offset": 4,
                  "end_offset": 8,
                  "type": "word",
                  "position": 1
              },
              {
                  "token": "3-points",
                  "start_offset": 9,
                  "end_offset": 17,
                  "type": "word",
                  "position": 2
              },
              {
                  "token": "shooter",
                  "start_offset": 18,
                  "end_offset": 25,
                  "type": "word",
                  "position": 3
              },
              {
                  "token": "is",
                  "start_offset": 26,
                  "end_offset": 28,
                  "type": "word",
                  "position": 4
              },
              {
                  "token": "Curry",
                  "start_offset": 29,
                  "end_offset": 34,
                  "type": "word",
                  "position": 5
              },
              {
                  "token": "!",
                  "start_offset": 35,
                  "end_offset": 36,
                  "type": "word",
                  "position": 6
              }
          ]
      }
      

    stop analyzer:stop分析器和simple分析器很想,唯一不同的是,stop分析器增加了对删除停止词的支持,默认使用English停止词

    • stopwords预定义的停止词列表,比如(the ,a,an,this,of,at)等等这些都会忽略掉
    • 请求

      curl -X POST "http://172.25.45.150:9200/nba/_analyze" -H 'Content-Type:application/json' -d '
      {
          "analyzer":"stop",
          "text":"The best 3-points shooter is Curry !"
      }
      '
      
    • 响应

      {
          "tokens": [
              {
                  "token": "best",
                  "start_offset": 4,
                  "end_offset": 8,
                  "type": "word",
                  "position": 1
              },
              {
                  "token": "points",
                  "start_offset": 11,
                  "end_offset": 17,
                  "type": "word",
                  "position": 2
              },
              {
                  "token": "shooter",
                  "start_offset": 18,
                  "end_offset": 25,
                  "type": "word",
                  "position": 3
              },
              {
                  "token": "curry",
                  "start_offset": 29,
                  "end_offset": 34,
                  "type": "word",
                  "position": 5
              }
          ]
      }
      

    language analyzer:特定的语言的分词器,比如说,english,英语分词器,内置语言:arabic,armenian,basque,bengali,brazilian,bulgarian,catalan,cjk,czech,danish,dutch,english,estonian,finnish,french,galician,german,greek,hindi,hungarian,indonesian,irish,talian,latvian,lithuanian,norwegian,persian,portuguese,omanian,russian,sorani,panish,wedish,turkish,thai

    • 请求:

      curl -X POST "http://172.25.45.150:9200/nba/_analyze" -H 'Content-Type:application/json' -d '
      {
          "analyzer":"english",
          "text":"The best 3-points shooter is Curry !"
      }
      '
      
    • 响应

      {
          "tokens": [
              {
                  "token": "best",
                  "start_offset": 4,
                  "end_offset": 8,
                  "type": "",
                  "position": 1
              },
              {
                  "token": "3",
                  "start_offset": 9,
                  "end_offset": 10,
                  "type": "",
                  "position": 2
              },
              {
                  "token": "point",
                  "start_offset": 11,
                  "end_offset": 17,
                  "type": "",
                  "position": 3
              },
              {
                  "token": "shooter",
                  "start_offset": 18,
                  "end_offset": 25,
                  "type": "",
                  "position": 4
              },
              {
                  "token": "curri",
                  "start_offset": 29,
                  "end_offset": 34,
                  "type": "",
                  "position": 6
              }
          ]
      }
      

    pattern:用正则表达式来将文本分割成terms,默认的正则表达式时\W+ (非单词字符)

    • 请求

      curl -X POST "http://172.25.45.150:9200/nba/_analyze" -H 'Content-Type:application/json' -d '
      {
          "analyzer":"pattern",
          "text":"The best 3-points shooter is Curry !"
      }
      '
      
    • 响应

      {
          "tokens": [
              {
                  "token": "the",
                  "start_offset": 0,
                  "end_offset": 3,
                  "type": "word",
                  "position": 0
              },
              {
                  "token": "best",
                  "start_offset": 4,
                  "end_offset": 8,
                  "type": "word",
                  "position": 1
              },
              {
                  "token": "3",
                  "start_offset": 9,
                  "end_offset": 10,
                  "type": "word",
                  "position": 2
              },
              {
                  "token": "points",
                  "start_offset": 11,
                  "end_offset": 17,
                  "type": "word",
                  "position": 3
              },
              {
                  "token": "shooter",
                  "start_offset": 18,
                  "end_offset": 25,
                  "type": "word",
                  "position": 4
              },
              {
                  "token": "is",
                  "start_offset": 26,
                  "end_offset": 28,
                  "type": "word",
                  "position": 5
              },
              {
                  "token": "curry",
                  "start_offset": 29,
                  "end_offset": 34,
                  "type": "word",
                  "position": 6
              }
          ]
      }
      
选择分词器

上面将大致的分词器都介绍了下,现在我们来使用分词器进行查询

  • 请求(使用分词器创建一个索引)

    curl -X PUT "http://172.25.45.150:9200/my_index" -H 'Content-Type:application/json' -d '
    	{
            "settings":{
                "analysis":{
                    "analyzer":{
                        "my_analyzer":{
                            "type":"whitespace"
                        }
                    }
                }
            },
    		"mappings":{
                "properties":{
                    "name":{
                        "type":"text"
                    },
                    "team_name":{
                        "type":"text"
                    },
                    "position":{
                        "type":"keyword"
                    },
                    "play_year":{
                        "type":"keyword"
                    },
                    "jerse_no":{
                        "type":"keyword"
                    },
                    "title":{
                        "type":"text",
                        "analyzer":"my_analyzer"
                    }
                }
    		}
    	}
    '
    
  • 新增测试数据

    curl -X PUT "http://172.25.45.150:9200/my_index/_doc" -H 'Content-Type:application/json' -d '
    {
        "name":"库里",
        "team_name":"勇士",
        "position":"组织后卫",
        "play_year":"10",
        "jerse_no":"30",
        "title":"The best 3-point shooter is Curry!"
    }
    '
    
  • 查询

    curl -X PUT "http://172.25.45.150:9200/my_index/_search" -H 'Content-Type:application/json' -d '
    {
        "query":{
            "match":{
                "title":"Curry!"
            }
        }
    }
    '
    

    可以发现当我们参数 title=Curry! 时候能够查询到数据而 title=Curry则查询不到数据,因为whitespace以空白字符串将文本分成了 Curry! 而我们Curry则无法匹配

关注
打赏
1648473527
查看更多评论
立即登录/注册

微信扫码登录

0.0382s