每次都是在es的扩展词典中,手动添加新词语,很坑 (1)每次添加完,都要重启es才能生效,非常麻烦 (2)es是分布式的,可能有数百个节点,你不能每次都一个一个节点上面去修改
es不停机,直接我们在外部某个地方添加新的词语,es中立即热加载到这些新词语
热更新的常用方案(1)修改ik分词器源码,然后手动支持从mysql中每隔一定时间,自动加载新的词库 (2)基于ik分词器原生支持的热更新方案,部署一个web服务器,提供一个http接口,通过modified和tag两个http响应头,来提供词语的热更新
这里使用第一种方案,第二种,ik git社区官方都不建议采用,觉得不太稳定
1、下载源码https://github.com/medcl/elasticsearch-analysis-ik/tree/v5.6.0
对应版本可选择
download或者git clone
ik分词器,是个标准的java maven工程,直接导入ide工具就可以看到源码
2、修改源码官方提供的实现在org.wltea.analyzer.dic.Monitor
类中,以下是其完整代码。
注:这里使用的版本是5.6.0,所以有可能和其他版本不同。
首先pom elasticsearch的版本要与使用的匹配
这里总体要做的事情就是开启一个后台线程,扫描mysql中定义的表,加载数据。
Dictionary#initial方法中开启扫描线程Dictionary类,160行左右:Dictionary单例类的初始化方法,在这里需要创建一个我们自定义的线程,并且启动它 HotDictReloadThread类:就是死循环,不断调用Dictionary.getSingleton().reLoadMainDict(),去重新加载词典
new Thread(new HotDictReloadThread()).start();
新建HotDictReloadThread类
package org.wltea.analyzer.dic;
import org.apache.logging.log4j.Logger;
import org.wltea.analyzer.help.ESPluginLoggerFactory;
public class HotDictReloadThread implements Runnable{
private static final Logger logger = ESPluginLoggerFactory.getLogger(HotDictReloadThread.class.getName());
@Override
public void run() {
while (true){
logger.info("=====reload hot dic from mysql======");
Dictionary.getSingleton().reLoadMainDict();
}
}
}
reLoadMainDict()其实就是加载主词库 和 停用词词库
void reLoadMainDict() {
logger.info("start to reload ik dict.");
// 新开一个实例加载词典,减少加载过程对当前词典使用的影响
Dictionary tmpDict = new Dictionary(configuration);
tmpDict.configuration = getSingleton().configuration;
//加载主词库
tmpDict.loadMainDict();
//加载停用词词库
tmpDict.loadStopWordDict();
_MainDict = tmpDict._MainDict;
_StopWords = tmpDict._StopWords;
logger.info("reload ik dict finished.");
}
Dictionary#iloadMainDict 自定义从mysql加载主词典
Dictionary类,400行左右:添加this.loadMySQLExtDict();从mysql加载热更新词典
/**
* 加载主词典及扩展词典
*/
private void loadMainDict() {
// 建立一个主词典实例
_MainDict = new DictSegment((char) 0);
// 读取主词典文件
Path file = PathUtils.get(getDictRoot(), Dictionary.PATH_DIC_MAIN);
loadDictFile(_MainDict, file, false, "Main Dict");
// 加载扩展词典
this.loadExtDict();
// 加载远程自定义词库
this.loadRemoteExtDict();
// 从mysql加载词典
this.loadMySQLExtDict();
}
private static Properties prop = new Properties();
static {
try {
Class.forName("com.mysql.jdbc.Driver");
} catch (ClassNotFoundException e) {
logger.error("error", e);
}
}
/**
* 从mysql加载热更新词典
*/
private void loadMySQLExtDict() {
Connection conn = null;
Statement stmt = null;
ResultSet rs = null;
try {
Path file = PathUtils.get(getDictRoot(), "jdbc-reload.properties");
prop.load(new FileInputStream(file.toFile()));
logger.info("[==========]jdbc-reload.properties");
for(Object key : prop.keySet()) {
logger.info("[==========]" + key + "=" + prop.getProperty(String.valueOf(key)));
}
logger.info("[==========]query hot dict from mysql, " + prop.getProperty("jdbc.reload.sql") + "......");
conn = DriverManager.getConnection(
prop.getProperty("jdbc.url"),
prop.getProperty("jdbc.user"),
prop.getProperty("jdbc.password"));
stmt = conn.createStatement();
rs = stmt.executeQuery(prop.getProperty("jdbc.reload.sql"));
while(rs.next()) {
String theWord = rs.getString("word");
logger.info("[==========]hot word from mysql: " + theWord);
_MainDict.fillSegment(theWord.trim().toCharArray());
}
Thread.sleep(Integer.valueOf(String.valueOf(prop.get("jdbc.reload.interval"))));
} catch (Exception e) {
logger.error("erorr", e);
} finally {
if(rs != null) {
try {
rs.close();
} catch (SQLException e) {
logger.error("error", e);
}
}
if(stmt != null) {
try {
stmt.close();
} catch (SQLException e) {
logger.error("error", e);
}
}
if(conn != null) {
try {
conn.close();
} catch (SQLException e) {
logger.error("error", e);
}
}
}
}
加载自定义的db配置文件,通过JDBC查询mysql ,注意这里我使用的是mysql8.0,所以代码中Class.forName(“com.mysql.jdbc.Driver”);
对应的jdbc-reload.properties文件
jdbc.url=jdbc:mysql://localhost:3306/es_test?serverTimezone=GMT
jdbc.user=root
jdbc.password=root
#加载分词sql
jdbc.reload.sql=select word from hot_words
#加载停用词sql
jdbc.reload.stopword.sql=select stopword as word from hot_stopwords
#热加载间隔时间
jdbc.reload.interval=1000
对应数据库创建两个表
hot_words
hot_stopwords
Dictionary类,645行左右:添加this.loadMySQLStopwordDict();从mysql加载停用词
this.loadMySQLStopwordDict();
/**
* 从mysql加载停用词
*/
private void loadMySQLStopwordDict() {
Connection conn = null;
Statement stmt = null;
ResultSet rs = null;
try {
Path file = PathUtils.get(getDictRoot(), "jdbc-reload.properties");
prop.load(new FileInputStream(file.toFile()));
logger.info("[==========]jdbc-reload.properties");
for(Object key : prop.keySet()) {
logger.info("[==========]" + key + "=" + prop.getProperty(String.valueOf(key)));
}
logger.info("[==========]query hot stopword dict from mysql, " + prop.getProperty("jdbc.reload.stopword.sql") + "......");
conn = DriverManager.getConnection(
prop.getProperty("jdbc.url"),
prop.getProperty("jdbc.user"),
prop.getProperty("jdbc.password"));
stmt = conn.createStatement();
rs = stmt.executeQuery(prop.getProperty("jdbc.reload.stopword.sql"));
while(rs.next()) {
String theWord = rs.getString("word");
logger.info("[==========]hot stopword from mysql: " + theWord);
_StopWords.fillSegment(theWord.trim().toCharArray());
}
Thread.sleep(Integer.valueOf(String.valueOf(prop.get("jdbc.reload.interval"))));
} catch (Exception e) {
logger.error("erorr", e);
} finally {
if(rs != null) {
try {
rs.close();
} catch (SQLException e) {
logger.error("error", e);
}
}
if(stmt != null) {
try {
stmt.close();
} catch (SQLException e) {
logger.error("error", e);
}
}
if(conn != null) {
try {
conn.close();
} catch (SQLException e) {
logger.error("error", e);
}
}
}
}
3、mvn package打包代码
编译成功,在target\releases\elasticsearch-analysis-ik-5.6.0.zip
将elasticsearch下的IK包下文件清空
在ik包下解压生成的压缩包
将mysql驱动jar,放入ik的目录下,注意mysql版本
观察日志,日志中就会显示我们打印的那些东西,比如加载了什么配置,加载了什么词语,什么停用词
PathUtils类代码爆红比如Dictionary中有用到PathUtils这个类,但是项目中始终无法下载它的依赖
将这个类换成import org.elasticsearch.common.io.PathUtils;
启动报异常问题:java.security.AccessControlException: access denied (“java.lang.RuntimePermission” “setContextClassLoader”)
解决办法Java 安全权限导致的异常。
找到ES使用的JDK,这里我使用的是1.8.0_301
java version "1.8.0_301"
Java(TM) SE Runtime Environment (build 1.8.0_301-b09)
Java HotSpot(TM) 64-Bit Server VM (build 25.301-b09, mixed mode)
找到安装目录–>进入 jre\lib\security 目录
比如我本地的 E:\software\java\jdk\jdk1.8.0_301\jre\lib\security ,找到 java.policy ,在 grant最后一行加入 permission java.security.AllPermission; ,然后重启ES ,即可解决
GET _analyze
{
"text": "一人我饮酒醉",
"analyzer": "ik_max_word"
}
响应结果
{
"tokens": [
{
"token": "一人",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 0
},
{
"token": "一",
"start_offset": 0,
"end_offset": 1,
"type": "TYPE_CNUM",
"position": 1
},
{
"token": "人我",
"start_offset": 1,
"end_offset": 3,
"type": "CN_WORD",
"position": 2
},
{
"token": "人",
"start_offset": 1,
"end_offset": 2,
"type": "COUNT",
"position": 3
},
{
"token": "我",
"start_offset": 2,
"end_offset": 3,
"type": "CN_CHAR",
"position": 4
},
{
"token": "饮酒",
"start_offset": 3,
"end_offset": 5,
"type": "CN_WORD",
"position": 5
},
{
"token": "酒醉",
"start_offset": 4,
"end_offset": 6,
"type": "CN_WORD",
"position": 6
}
]
}
添加词库
在elasticsearch.bat就能看到,只不过是中文乱码
再次查询
GET _analyze
{
"text": "一人我饮酒醉",
"analyzer": "ik_max_word"
}
响应结果
{
"tokens": [
{
"token": "一人我饮酒醉",
"start_offset": 0,
"end_offset": 6,
"type": "CN_WORD",
"position": 0
},
{
"token": "一人",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 1
},
{
"token": "一",
"start_offset": 0,
"end_offset": 1,
"type": "TYPE_CNUM",
"position": 2
},
{
"token": "人我",
"start_offset": 1,
"end_offset": 3,
"type": "CN_WORD",
"position": 3
},
{
"token": "人",
"start_offset": 1,
"end_offset": 2,
"type": "COUNT",
"position": 4
},
{
"token": "我",
"start_offset": 2,
"end_offset": 3,
"type": "CN_CHAR",
"position": 5
},
{
"token": "饮酒",
"start_offset": 3,
"end_offset": 5,
"type": "CN_WORD",
"position": 6
},
{
"token": "酒醉",
"start_offset": 4,
"end_offset": 6,
"type": "CN_WORD",
"position": 7
}
]
}
停用词
GET _analyze
{
"text": "俺是个好人",
"analyzer": "ik_max_word"
}
响应结果
{
"tokens": [
{
"token": "俺",
"start_offset": 0,
"end_offset": 1,
"type": "CN_CHAR",
"position": 0
},
{
"token": "是",
"start_offset": 1,
"end_offset": 2,
"type": "CN_CHAR",
"position": 1
},
{
"token": "个",
"start_offset": 2,
"end_offset": 3,
"type": "CN_CHAR",
"position": 2
},
{
"token": "好人",
"start_offset": 3,
"end_offset": 5,
"type": "CN_WORD",
"position": 3
}
]
}
添加停用词
elasticsearch.bat
再次查询
GET _analyze
{
"text": "俺是个好人",
"analyzer": "ik_max_word"
}
响应结果
{
"tokens": [
{
"token": "俺",
"start_offset": 0,
"end_offset": 1,
"type": "CN_CHAR",
"position": 0
},
{
"token": "好人",
"start_offset": 3,
"end_offset": 5,
"type": "CN_WORD",
"position": 1
}
]
}