您当前的位置: 首页 >  sql

32彻底掌握IK中文分词_修改IK分词器源码来基于mysql热更新词库

发布时间:2021-11-23 22:07:38 ,浏览量:5

热更新

每次都是在es的扩展词典中,手动添加新词语,很坑 (1)每次添加完,都要重启es才能生效,非常麻烦 (2)es是分布式的,可能有数百个节点,你不能每次都一个一个节点上面去修改

es不停机,直接我们在外部某个地方添加新的词语,es中立即热加载到这些新词语

热更新的常用方案

(1)修改ik分词器源码,然后手动支持从mysql中每隔一定时间,自动加载新的词库 (2)基于ik分词器原生支持的热更新方案,部署一个web服务器,提供一个http接口,通过modified和tag两个http响应头,来提供词语的热更新

img

这里使用第一种方案,第二种,ik git社区官方都不建议采用,觉得不太稳定

1、下载源码

https://github.com/medcl/elasticsearch-analysis-ik/tree/v5.6.0

对应版本可选择

image-20211122221607641

download或者git clone

image-20211122221811628

ik分词器,是个标准的java maven工程,直接导入ide工具就可以看到源码

2、修改源码

官方提供的实现在org.wltea.analyzer.dic.Monitor类中,以下是其完整代码。

注:这里使用的版本是5.6.0,所以有可能和其他版本不同。

首先pom elasticsearch的版本要与使用的匹配

image-20211123212606658

这里总体要做的事情就是开启一个后台线程,扫描mysql中定义的表,加载数据。

Dictionary#initial方法中开启扫描线程

Dictionary类,160行左右:Dictionary单例类的初始化方法,在这里需要创建一个我们自定义的线程,并且启动它 HotDictReloadThread类:就是死循环,不断调用Dictionary.getSingleton().reLoadMainDict(),去重新加载词典

new Thread(new HotDictReloadThread()).start(); 

image-20211122224845356

新建HotDictReloadThread类

package org.wltea.analyzer.dic; import org.apache.logging.log4j.Logger; import org.wltea.analyzer.help.ESPluginLoggerFactory; public class HotDictReloadThread implements Runnable{ private static final Logger logger = ESPluginLoggerFactory.getLogger(HotDictReloadThread.class.getName()); @Override public void run() { while (true){ logger.info("=====reload hot dic from mysql======"); Dictionary.getSingleton().reLoadMainDict(); } } } 

reLoadMainDict()其实就是加载主词库 和 停用词词库

void reLoadMainDict() { logger.info("start to reload ik dict."); // 新开一个实例加载词典,减少加载过程对当前词典使用的影响 Dictionary tmpDict = new Dictionary(configuration); tmpDict.configuration = getSingleton().configuration; //加载主词库 tmpDict.loadMainDict(); //加载停用词词库 tmpDict.loadStopWordDict(); _MainDict = tmpDict._MainDict; _StopWords = tmpDict._StopWords; logger.info("reload ik dict finished."); } 
Dictionary#iloadMainDict 自定义从mysql加载主词典

Dictionary类,400行左右:添加this.loadMySQLExtDict();从mysql加载热更新词典

image-20211122225028031

/**
	 * 加载主词典及扩展词典
	 */ private void loadMainDict() { // 建立一个主词典实例 _MainDict = new DictSegment((char) 0); // 读取主词典文件 Path file = PathUtils.get(getDictRoot(), Dictionary.PATH_DIC_MAIN); loadDictFile(_MainDict, file, false, "Main Dict"); // 加载扩展词典 this.loadExtDict(); // 加载远程自定义词库 this.loadRemoteExtDict(); // 从mysql加载词典 this.loadMySQLExtDict(); } private static Properties prop = new Properties(); static { try { Class.forName("com.mysql.jdbc.Driver"); } catch (ClassNotFoundException e) { logger.error("error", e); } } /**
	 * 从mysql加载热更新词典
	 */ private void loadMySQLExtDict() { Connection conn = null; Statement stmt = null; ResultSet rs = null; try { Path file = PathUtils.get(getDictRoot(), "jdbc-reload.properties"); prop.load(new FileInputStream(file.toFile())); logger.info("[==========]jdbc-reload.properties"); for(Object key : prop.keySet()) { logger.info("[==========]" + key + "=" + prop.getProperty(String.valueOf(key))); } logger.info("[==========]query hot dict from mysql, " + prop.getProperty("jdbc.reload.sql") + "......"); conn = DriverManager.getConnection( prop.getProperty("jdbc.url"), prop.getProperty("jdbc.user"), prop.getProperty("jdbc.password")); stmt = conn.createStatement(); rs = stmt.executeQuery(prop.getProperty("jdbc.reload.sql")); while(rs.next()) { String theWord = rs.getString("word"); logger.info("[==========]hot word from mysql: " + theWord); _MainDict.fillSegment(theWord.trim().toCharArray()); } Thread.sleep(Integer.valueOf(String.valueOf(prop.get("jdbc.reload.interval")))); } catch (Exception e) { logger.error("erorr", e); } finally { if(rs != null) { try { rs.close(); } catch (SQLException e) { logger.error("error", e); } } if(stmt != null) { try { stmt.close(); } catch (SQLException e) { logger.error("error", e); } } if(conn != null) { try { conn.close(); } catch (SQLException e) { logger.error("error", e); } } } } 

加载自定义的db配置文件,通过JDBC查询mysql ,注意这里我使用的是mysql8.0,所以代码中Class.forName(“com.mysql.jdbc.Driver”);

对应的jdbc-reload.properties文件

image-20211122225323922

jdbc.url=jdbc:mysql://localhost:3306/es_test?serverTimezone=GMT
jdbc.user=root
jdbc.password=root
#加载分词sql
jdbc.reload.sql=select word from hot_words
#加载停用词sql
jdbc.reload.stopword.sql=select stopword as word from hot_stopwords
#热加载间隔时间
jdbc.reload.interval=1000

对应数据库创建两个表

hot_words

image-20211122230101248

hot_stopwords

image-20211122230014224

Dictionary#loadStopWordDict自定义从mysql加载停止词词典

Dictionary类,645行左右:添加this.loadMySQLStopwordDict();从mysql加载停用词

image-20211122230856314

this.loadMySQLStopwordDict();
/**
	 * 从mysql加载停用词
	 */ private void loadMySQLStopwordDict() { Connection conn = null; Statement stmt = null; ResultSet rs = null; try { Path file = PathUtils.get(getDictRoot(), "jdbc-reload.properties"); prop.load(new FileInputStream(file.toFile())); logger.info("[==========]jdbc-reload.properties"); for(Object key : prop.keySet()) { logger.info("[==========]" + key + "=" + prop.getProperty(String.valueOf(key))); } logger.info("[==========]query hot stopword dict from mysql, " + prop.getProperty("jdbc.reload.stopword.sql") + "......"); conn = DriverManager.getConnection( prop.getProperty("jdbc.url"), prop.getProperty("jdbc.user"), prop.getProperty("jdbc.password")); stmt = conn.createStatement(); rs = stmt.executeQuery(prop.getProperty("jdbc.reload.stopword.sql")); while(rs.next()) { String theWord = rs.getString("word"); logger.info("[==========]hot stopword from mysql: " + theWord); _StopWords.fillSegment(theWord.trim().toCharArray()); } Thread.sleep(Integer.valueOf(String.valueOf(prop.get("jdbc.reload.interval")))); } catch (Exception e) { logger.error("erorr", e); } finally { if(rs != null) { try { rs.close(); } catch (SQLException e) { logger.error("error", e); } } if(stmt != null) { try { stmt.close(); } catch (SQLException e) { logger.error("error", e); } } if(conn != null) { try { conn.close(); } catch (SQLException e) { logger.error("error", e); } } } } 
3、mvn package打包代码

image-20211122231021978

编译成功,在target\releases\elasticsearch-analysis-ik-5.6.0.zip

image-20211122231031858

4、解压缩ik压缩包

将elasticsearch下的IK包下文件清空

在ik包下解压生成的压缩包

image-20211122231501009

将mysql驱动jar,放入ik的目录下,注意mysql版本

image-20211123213402273

5、修改jdbc相关配置

image-20211122231611856

6、重启es

观察日志,日志中就会显示我们打印的那些东西,比如加载了什么配置,加载了什么词语,什么停用词

PathUtils类代码爆红

比如Dictionary中有用到PathUtils这个类,但是项目中始终无法下载它的依赖

image-20211123220402758

将这个类换成import org.elasticsearch.common.io.PathUtils;

启动报异常

问题:java.security.AccessControlException: access denied (“java.lang.RuntimePermission” “setContextClassLoader”)

解决办法

Java 安全权限导致的异常。

找到ES使用的JDK,这里我使用的是1.8.0_301

java version "1.8.0_301" Java(TM) SE Runtime Environment (build 1.8.0_301-b09) Java HotSpot(TM) 64-Bit Server VM (build 25.301-b09, mixed mode) 

找到安装目录–>进入 jre\lib\security 目录

比如我本地的 E:\software\java\jdk\jdk1.8.0_301\jre\lib\security ,找到 java.policy ,在 grant最后一行加入 permission java.security.AllPermission; ,然后重启ES ,即可解决 image-20211123220011521

7、在mysql中添加词库与停用词
GET _analyze { "text": "一人我饮酒醉", "analyzer": "ik_max_word" } 

响应结果

{ "tokens": [ { "token": "一人", "start_offset": 0, "end_offset": 2, "type": "CN_WORD", "position": 0 }, { "token": "一", "start_offset": 0, "end_offset": 1, "type": "TYPE_CNUM", "position": 1 }, { "token": "人我", "start_offset": 1, "end_offset": 3, "type": "CN_WORD", "position": 2 }, { "token": "人", "start_offset": 1, "end_offset": 2, "type": "COUNT", "position": 3 }, { "token": "我", "start_offset": 2, "end_offset": 3, "type": "CN_CHAR", "position": 4 }, { "token": "饮酒", "start_offset": 3, "end_offset": 5, "type": "CN_WORD", "position": 5 }, { "token": "酒醉", "start_offset": 4, "end_offset": 6, "type": "CN_WORD", "position": 6 } ] } 

添加词库

image-20211123211538985

在elasticsearch.bat就能看到,只不过是中文乱码

image-20211123211529789

8、分词实验,验证热更新生效

再次查询

GET _analyze { "text": "一人我饮酒醉", "analyzer": "ik_max_word" } 

响应结果

{ "tokens": [ { "token": "一人我饮酒醉", "start_offset": 0, "end_offset": 6, "type": "CN_WORD", "position": 0 }, { "token": "一人", "start_offset": 0, "end_offset": 2, "type": "CN_WORD", "position": 1 }, { "token": "一", "start_offset": 0, "end_offset": 1, "type": "TYPE_CNUM", "position": 2 }, { "token": "人我", "start_offset": 1, "end_offset": 3, "type": "CN_WORD", "position": 3 }, { "token": "人", "start_offset": 1, "end_offset": 2, "type": "COUNT", "position": 4 }, { "token": "我", "start_offset": 2, "end_offset": 3, "type": "CN_CHAR", "position": 5 }, { "token": "饮酒", "start_offset": 3, "end_offset": 5, "type": "CN_WORD", "position": 6 }, { "token": "酒醉", "start_offset": 4, "end_offset": 6, "type": "CN_WORD", "position": 7 } ] } 
停用词
GET _analyze { "text": "俺是个好人", "analyzer": "ik_max_word" } 

响应结果

{ "tokens": [ { "token": "俺", "start_offset": 0, "end_offset": 1, "type": "CN_CHAR", "position": 0 }, { "token": "是", "start_offset": 1, "end_offset": 2, "type": "CN_CHAR", "position": 1 }, { "token": "个", "start_offset": 2, "end_offset": 3, "type": "CN_CHAR", "position": 2 }, { "token": "好人", "start_offset": 3, "end_offset": 5, "type": "CN_WORD", "position": 3 } ] } 

添加停用词

image-20211123212113137

elasticsearch.bat

image-20211123212148820

再次查询

GET _analyze { "text": "俺是个好人", "analyzer": "ik_max_word" } 

响应结果

{ "tokens": [ { "token": "俺", "start_offset": 0, "end_offset": 1, "type": "CN_CHAR", "position": 0 }, { "token": "好人", "start_offset": 3, "end_offset": 5, "type": "CN_WORD", "position": 1 } ] } 
关注
打赏
1688896170
查看更多评论

暂无认证

  • 5浏览

    0关注

    115984博文

    0收益

  • 0浏览

    0点赞

    0打赏

    0留言

私信
关注
热门博文
立即登录/注册

微信扫码登录

0.0607s