32彻底掌握IK中文分词_修改IK分词器源码来基于mysql热更新词库

热更新

每次都是在es的扩展词典中，手动添加新词语，很坑（1）每次添加完，都要重启es才能生效，非常麻烦（2）es是分布式的，可能有数百个节点，你不能每次都一个一个节点上面去修改

es不停机，直接我们在外部某个地方添加新的词语，es中立即热加载到这些新词语

热更新的常用方案

（1）修改ik分词器源码，然后手动支持从mysql中每隔一定时间，自动加载新的词库（2）基于ik分词器原生支持的热更新方案，部署一个web服务器，提供一个http接口，通过modified和tag两个http响应头，来提供词语的热更新

这里使用第一种方案，第二种，ik git社区官方都不建议采用，觉得不太稳定

1、下载源码

https://github.com/medcl/elasticsearch-analysis-ik/tree/v5.6.0

对应版本可选择

download或者git clone

ik分词器，是个标准的java maven工程，直接导入ide工具就可以看到源码

2、修改源码

官方提供的实现在org.wltea.analyzer.dic.Monitor类中，以下是其完整代码。

注：这里使用的版本是5.6.0，所以有可能和其他版本不同。

首先pom elasticsearch的版本要与使用的匹配

这里总体要做的事情就是开启一个后台线程，扫描mysql中定义的表，加载数据。

Dictionary#initial方法中开启扫描线程

Dictionary类，160行左右：Dictionary单例类的初始化方法，在这里需要创建一个我们自定义的线程，并且启动它 HotDictReloadThread类：就是死循环，不断调用Dictionary.getSingleton().reLoadMainDict()，去重新加载词典

new Thread(new HotDictReloadThread()).start();

新建HotDictReloadThread类

package org.wltea.analyzer.dic;

import org.apache.logging.log4j.Logger;
import org.wltea.analyzer.help.ESPluginLoggerFactory;

public class HotDictReloadThread  implements Runnable{
    private static final Logger logger = ESPluginLoggerFactory.getLogger(HotDictReloadThread.class.getName());
    @Override
    public void run() {
        while (true){
            logger.info("=====reload hot dic from mysql======");
            Dictionary.getSingleton().reLoadMainDict();
        }
    }
}

reLoadMainDict()其实就是加载主词库和停用词词库

void reLoadMainDict() {
   logger.info("start to reload ik dict.");
   // 新开一个实例加载词典，减少加载过程对当前词典使用的影响
   Dictionary tmpDict = new Dictionary(configuration);
   tmpDict.configuration = getSingleton().configuration;
    //加载主词库
   tmpDict.loadMainDict();
    //加载停用词词库
   tmpDict.loadStopWordDict();
   _MainDict = tmpDict._MainDict;
   _StopWords = tmpDict._StopWords;
   logger.info("reload ik dict finished.");
}

Dictionary#iloadMainDict 自定义从mysql加载主词典

Dictionary类，400行左右：添加this.loadMySQLExtDict();从mysql加载热更新词典

/**
	 * 加载主词典及扩展词典
	 */
	private void loadMainDict() {
		// 建立一个主词典实例
		_MainDict = new DictSegment((char) 0);

		// 读取主词典文件
		Path file = PathUtils.get(getDictRoot(), Dictionary.PATH_DIC_MAIN);
		loadDictFile(_MainDict, file, false, "Main Dict");
		// 加载扩展词典
		this.loadExtDict();
		// 加载远程自定义词库
		this.loadRemoteExtDict();

		// 从mysql加载词典
		this.loadMySQLExtDict();
	}

	private static Properties prop = new Properties();

	static {
		try {
			Class.forName("com.mysql.jdbc.Driver");
		} catch (ClassNotFoundException e) {
			logger.error("error", e);
		}
	}

	/**
	 * 从mysql加载热更新词典
	 */
	private void loadMySQLExtDict() {
		Connection conn = null;
		Statement stmt = null;
		ResultSet rs = null;

		try {
			Path file = PathUtils.get(getDictRoot(), "jdbc-reload.properties");
			prop.load(new FileInputStream(file.toFile()));

			logger.info("[==========]jdbc-reload.properties");
			for(Object key : prop.keySet()) {
				logger.info("[==========]" + key + "=" + prop.getProperty(String.valueOf(key)));
			}

			logger.info("[==========]query hot dict from mysql, " + prop.getProperty("jdbc.reload.sql") + "......");

			conn = DriverManager.getConnection(
					prop.getProperty("jdbc.url"),
					prop.getProperty("jdbc.user"),
					prop.getProperty("jdbc.password"));
			stmt = conn.createStatement();
			rs = stmt.executeQuery(prop.getProperty("jdbc.reload.sql"));

			while(rs.next()) {
				String theWord = rs.getString("word");
				logger.info("[==========]hot word from mysql: " + theWord);
				_MainDict.fillSegment(theWord.trim().toCharArray());
			}

			Thread.sleep(Integer.valueOf(String.valueOf(prop.get("jdbc.reload.interval"))));
		} catch (Exception e) {
			logger.error("erorr", e);
		} finally {
			if(rs != null) {
				try {
					rs.close();
				} catch (SQLException e) {
					logger.error("error", e);
				}
			}
			if(stmt != null) {
				try {
					stmt.close();
				} catch (SQLException e) {
					logger.error("error", e);
				}
			}
			if(conn != null) {
				try {
					conn.close();
				} catch (SQLException e) {
					logger.error("error", e);
				}
			}
		}
	}

加载自定义的db配置文件，通过JDBC查询mysql ,注意这里我使用的是mysql8.0，所以代码中Class.forName(“com.mysql.jdbc.Driver”);

对应的jdbc-reload.properties文件

jdbc.url=jdbc:mysql://localhost:3306/es_test?serverTimezone=GMT
jdbc.user=root
jdbc.password=root
#加载分词sql
jdbc.reload.sql=select word from hot_words
#加载停用词sql
jdbc.reload.stopword.sql=select stopword as word from hot_stopwords
#热加载间隔时间
jdbc.reload.interval=1000

对应数据库创建两个表

hot_words

hot_stopwords

Dictionary#loadStopWordDict自定义从mysql加载停止词词典

Dictionary类，645行左右：添加this.loadMySQLStopwordDict();从mysql加载停用词

this.loadMySQLStopwordDict();

	/**
	 * 从mysql加载停用词
	 */
	private void loadMySQLStopwordDict() {
		Connection conn = null;
		Statement stmt = null;
		ResultSet rs = null;

		try {
			Path file = PathUtils.get(getDictRoot(), "jdbc-reload.properties");
			prop.load(new FileInputStream(file.toFile()));

			logger.info("[==========]jdbc-reload.properties");
			for(Object key : prop.keySet()) {
				logger.info("[==========]" + key + "=" + prop.getProperty(String.valueOf(key)));
			}

			logger.info("[==========]query hot stopword dict from mysql, " + prop.getProperty("jdbc.reload.stopword.sql") + "......");

			conn = DriverManager.getConnection(
					prop.getProperty("jdbc.url"),
					prop.getProperty("jdbc.user"),
					prop.getProperty("jdbc.password"));
			stmt = conn.createStatement();
			rs = stmt.executeQuery(prop.getProperty("jdbc.reload.stopword.sql"));

			while(rs.next()) {
				String theWord = rs.getString("word");
				logger.info("[==========]hot stopword from mysql: " + theWord);
				_StopWords.fillSegment(theWord.trim().toCharArray());
			}

			Thread.sleep(Integer.valueOf(String.valueOf(prop.get("jdbc.reload.interval"))));
		} catch (Exception e) {
			logger.error("erorr", e);
		} finally {
			if(rs != null) {
				try {
					rs.close();
				} catch (SQLException e) {
					logger.error("error", e);
				}
			}
			if(stmt != null) {
				try {
					stmt.close();
				} catch (SQLException e) {
					logger.error("error", e);
				}
			}
			if(conn != null) {
				try {
					conn.close();
				} catch (SQLException e) {
					logger.error("error", e);
				}
			}
		}
	}

3、mvn package打包代码

编译成功，在target\releases\elasticsearch-analysis-ik-5.6.0.zip

4、解压缩ik压缩包

将elasticsearch下的IK包下文件清空

在ik包下解压生成的压缩包

将mysql驱动jar，放入ik的目录下，注意mysql版本

5、修改jdbc相关配置

6、重启es

观察日志，日志中就会显示我们打印的那些东西，比如加载了什么配置，加载了什么词语，什么停用词

PathUtils类代码爆红

比如Dictionary中有用到PathUtils这个类，但是项目中始终无法下载它的依赖

将这个类换成import org.elasticsearch.common.io.PathUtils;

启动报异常

问题：java.security.AccessControlException: access denied (“java.lang.RuntimePermission” “setContextClassLoader”)

解决办法

Java 安全权限导致的异常。

找到ES使用的JDK，这里我使用的是1.8.0_301

java version "1.8.0_301"
Java(TM) SE Runtime Environment (build 1.8.0_301-b09)
Java HotSpot(TM) 64-Bit Server VM (build 25.301-b09, mixed mode)

找到安装目录–>进入 jre\lib\security 目录

比如我本地的 E:\software\java\jdk\jdk1.8.0_301\jre\lib\security ，找到 java.policy ，在 grant最后一行加入 permission java.security.AllPermission; ，然后重启ES ，即可解决

7、在mysql中添加词库与停用词

GET _analyze
{
  "text": "一人我饮酒醉",
  "analyzer": "ik_max_word"
}

响应结果

{
  "tokens": [
    {
      "token": "一人",
      "start_offset": 0,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "一",
      "start_offset": 0,
      "end_offset": 1,
      "type": "TYPE_CNUM",
      "position": 1
    },
    {
      "token": "人我",
      "start_offset": 1,
      "end_offset": 3,
      "type": "CN_WORD",
      "position": 2
    },
    {
      "token": "人",
      "start_offset": 1,
      "end_offset": 2,
      "type": "COUNT",
      "position": 3
    },
    {
      "token": "我",
      "start_offset": 2,
      "end_offset": 3,
      "type": "CN_CHAR",
      "position": 4
    },
    {
      "token": "饮酒",
      "start_offset": 3,
      "end_offset": 5,
      "type": "CN_WORD",
      "position": 5
    },
    {
      "token": "酒醉",
      "start_offset": 4,
      "end_offset": 6,
      "type": "CN_WORD",
      "position": 6
    }
  ]
}

添加词库

在elasticsearch.bat就能看到，只不过是中文乱码

8、分词实验，验证热更新生效

再次查询

GET _analyze
{
  "text": "一人我饮酒醉",
  "analyzer": "ik_max_word"
}

响应结果

{
  "tokens": [
    {
      "token": "一人我饮酒醉",
      "start_offset": 0,
      "end_offset": 6,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "一人",
      "start_offset": 0,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 1
    },
    {
      "token": "一",
      "start_offset": 0,
      "end_offset": 1,
      "type": "TYPE_CNUM",
      "position": 2
    },
    {
      "token": "人我",
      "start_offset": 1,
      "end_offset": 3,
      "type": "CN_WORD",
      "position": 3
    },
    {
      "token": "人",
      "start_offset": 1,
      "end_offset": 2,
      "type": "COUNT",
      "position": 4
    },
    {
      "token": "我",
      "start_offset": 2,
      "end_offset": 3,
      "type": "CN_CHAR",
      "position": 5
    },
    {
      "token": "饮酒",
      "start_offset": 3,
      "end_offset": 5,
      "type": "CN_WORD",
      "position": 6
    },
    {
      "token": "酒醉",
      "start_offset": 4,
      "end_offset": 6,
      "type": "CN_WORD",
      "position": 7
    }
  ]
}

停用词

GET _analyze
{
  "text": "俺是个好人",
  "analyzer": "ik_max_word"
}

响应结果

{
  "tokens": [
    {
      "token": "俺",
      "start_offset": 0,
      "end_offset": 1,
      "type": "CN_CHAR",
      "position": 0
    },
    {
      "token": "是",
      "start_offset": 1,
      "end_offset": 2,
      "type": "CN_CHAR",
      "position": 1
    },
    {
      "token": "个",
      "start_offset": 2,
      "end_offset": 3,
      "type": "CN_CHAR",
      "position": 2
    },
    {
      "token": "好人",
      "start_offset": 3,
      "end_offset": 5,
      "type": "CN_WORD",
      "position": 3
    }
  ]
}

添加停用词

elasticsearch.bat

再次查询

GET _analyze
{
  "text": "俺是个好人",
  "analyzer": "ik_max_word"
}

响应结果

{
  "tokens": [
    {
      "token": "俺",
      "start_offset": 0,
      "end_offset": 1,
      "type": "CN_CHAR",
      "position": 0
    },
    {
      "token": "好人",
      "start_offset": 3,
      "end_offset": 5,
      "type": "CN_WORD",
      "position": 1
    }
  ]
}

32彻底掌握IK中文分词_修改IK分词器源码来基于mysql热更新词库

最近更新

热门博客

[ 申请 ]友情链接：