自然语言处理(NLP)之用深度学习实现命名实体识别(NER)

几乎所有的NLP都依赖一个强大的语料库，本项目实现NER的语料库如下(文件名为train.txt，一共42000行，这里只展示前15行，可以在文章最后的Github地址下载该语料库)：

played on Monday ( home team in CAPS ) : VBD IN NNP ( NN NN IN NNP ) : O O O O O O O O O O American League NNP NNP B-MISC I-MISC Cleveland 2 DETROIT 1 NNP CD NNP CD B-ORG O B-ORG O BALTIMORE 12 Oakland 11 ( 10 innings ) VB CD NNP CD ( CD NN ) B-ORG O B-ORG O O O O O TORONTO 5 Minnesota 3 TO CD NNP CD B-ORG O B-ORG O …

简单介绍下该语料库的结构：该语料库一共42000行，每三行为一组，其中，第一行为英语句子，第二行为每个句子的词性（关于英语单词的词性，可参考文章：NLP词形还（自然语言处理(NLP)之英文单词词性还原_IT之一小佬的博客-CSDN博客），第三行为NER系统的标注，具体的含义会在之后介绍。

我们的NER项目的名称为NLP_NER，结构如下：

项目中每个文件的功能如下：

utils.py: 项目配置及数据导入
data_processing.py: 数据探索
Bi_LSTM_Model_training.py: 模型创建及训练
Bi_LSTM_Model_predict.py: 对新句子进行NER预测

项目配置

第一步，是项目的配置及数据导入，在utils.py文件中实现，完整的代码如下：

import pandas as pd
import numpy as np

CORPUS_PATH = './data/train.txt'

KEYS_MODEL_SAVE_PATH = './data/bi_lstm_ner.h5'
WORD_DICTIONARY_PATH = './data/word_dictionary.pk'
INVERSE_WORD_DICTIONARY_PATH = './data/inverse_word_dictionary.pk'
LABEL_DICTIONARY_PATH = './data/label_dictionary.pk'
OUTPUT_DICTIONARY_PATH = './data/output_dictionary.pk'

CONSTANTS = [
    KEYS_MODEL_SAVE_PATH,
    WORD_DICTIONARY_PATH,
    INVERSE_WORD_DICTIONARY_PATH,
    LABEL_DICTIONARY_PATH,
    OUTPUT_DICTIONARY_PATH
]


# load data from corpus to from pandas DataFrame
def load_data():
    with open(CORPUS_PATH, 'r') as f:
        text_data = [text.strip() for text in f.readlines()]
    text_data = [text_data[k].split('\t') for k in range(0, len(text_data))]
    index = range(0, len(text_data), 3)

    # Transforming data to matrix format for neural network
    input_data = list()
    for i in range(1, len(index) - 1):
        rows = text_data[index[i - 1]: index[i]]
        sentence_no = np.array([i] * len(rows[0]), dtype=str)
        rows.append(sentence_no)
        rows = np.array(rows).T
        input_data.append(rows)

    input_data = pd.DataFrame(np.concatenate([item for item in input_data]), columns=['word', 'pos', 'tag', 'sent_no'])

    return input_data


if __name__ == '__main__':
    data = load_data()
    print(data)

在该代码中，先是设置了语料库文件的路径CORPUS_PATH，KERAS模型保存路径KERAS_MODEL_SAVE_PATH，以及在项目过程中会用到的三个字典的保存路径（以pickle文件形式保存）WORD_DICTIONARY_PATH，LABEL_DICTIONARY_PATH， OUTPUT_DICTIONARY_PATH。然后是load_data()函数，它将语料库中的文本以Pandas中的DataFrame结构展示出来，运行结果如下：

            word  pos     tag sent_no
0         played  VBD       O       1
1             on   IN       O       1
2         Monday  NNP       O       1
3              (    (       O       1
4           home   NN       O       1
...          ...  ...     ...     ...
201110        75   CD       O   13997
201111      .409   CD       O   13997
201112        28   CD       O   13997
201113   CENTRAL  NNP  B-MISC   13998
201114  DIVISION  NNP  I-MISC   13998

[201115 rows x 4 columns]

在该数据框中，word这一列表示文本语料库中的单词，pos这一列表示该单词的词性，tag这一列表示NER的标注，sent_no这一列表示该单词在第几个句子中。

数据探索

接着，第二步是数据探索，即对输入的数据（input_data）进行一些数据review，完整的代码（data_processing.py）如下:

import pickle
import numpy as np
from collections import Counter
from itertools import accumulate
from operator import itemgetter
import matplotlib.pyplot as plt
import matplotlib as mpl
from utils import CONSTANTS, load_data

#  设置matplotlib绘图时的字体
mpl.rcParams['font.sans-serif'] = ['SimHei']


#  数据查看
def data_review():
    #  导入数据
    input_data = load_data()

    #  基本的数据review
    sent_num = input_data['sent_no'].astype(np.int).max()
    print('一共有%s个句子。' % sent_num)

    vocabulary = input_data['word'].unique()
    print('一个有%d个单词。' % len(vocabulary))
    print('前10个单词为：%s' % vocabulary[:11])

    pos_arr = input_data['tag'].unique()
    print('单词的词性列表：%s.' % pos_arr)

    df = input_data[['word', 'sent_no']].groupby('sent_no').count()
    sent_len_list = df['word'].tolist()
    print('句子长度及出现的频数字典：\n%s.' % dict(Counter(sent_len_list)))

    #  绘制句子长度及出现频数统计图
    sort_sent_len_dict = sorted(dict(Counter(sent_len_list)).items(), key=itemgetter(0))
    sent_no_data = [item[0] for item in sort_sent_len_dict]
    sent_count_data = [item[1] for item in sort_sent_len_dict]
    plt.bar(sent_no_data, sent_count_data)
    plt.title('句子长度及出现频数统计图')
    plt.xlabel('句子长度')
    plt.ylabel('句子长度出现的频数')
    plt.savefig('./data/句子长度及出现频数统计图.png')
    plt.close()

    #  绘制句子长度累计分布函数（CDF）
    sent_pentage_list = [(count / sent_num) for count in accumulate(sent_count_data)]

    #  寻找分位点为quantile的句子长度
    quantile = 0.9992

    #  print(list(sent_pentage_list))
    for length, per in zip(sent_no_data, sent_pentage_list):
        if round(per, 4) == quantile:
            index = length
            break
    print('分位点为%s的句子长度为：%d' % (quantile, index))

    #  绘制CDF
    plt.plot(sent_no_data, sent_pentage_list)
    plt.hlines(quantile, 0, index, colors="c", linestyles="dashed")
    plt.vlines(index, 0, quantile, colors="c", linestyles="dashed")
    plt.text(0, quantile, str(quantile))
    plt.text(index, 0, str(index))
    plt.title("句子长度累积分布函数图")
    plt.xlabel("句子长度")
    plt.ylabel("句子长度累积频率")
    plt.savefig("./data/句子长度累积分布函数图.png")
    plt.close()


#  数据处理
def data_processing():
    #  数据导入
    input_data = load_data()

    #  标签及词汇表
    labels, vocabulary = list(input_data['tag'].unique()), list(input_data['word'].unique())

    #  字典列表
    word_dictionary = {word: i + 1 for i, word in enumerate(vocabulary)}
    inverse_word_vocabulary = {i + 1: word for i, word in enumerate(vocabulary)}
    label_dictionary = {laber: i + 1 for i, laber in enumerate(labels)}
    output_dictionary = {i + 1: labels for i, labels in enumerate(labels)}

    dict_list = [word_dictionary, inverse_word_vocabulary, label_dictionary, output_dictionary]

    #  保存为pickle形式
    for dict_item, path in zip(dict_list, CONSTANTS[1:]):
        with open(path, 'wb') as f:
            pickle.dump(dict_item, f)


if __name__ == '__main__':
    data_review()

调用data_review()函数，输出的结果如下：

一共有13998个句子。
一个有24339个单词。
前10个单词为：['played' 'on' 'Monday' '(' 'home' 'team' 'in' 'CAPS' ')' ':' 'American']
单词的词性列表：['O' 'B-MISC' 'I-MISC' 'B-ORG' 'I-ORG' 'B-PER' 'B-LOC' 'I-PER' 'I-LOC'
 'sO'].
句子长度及出现的频数字典：
{10: 501, 5: 769, 9: 841, 6: 639, 4: 794, 37: 105, 21: 228, 40: 78, 23: 230, 38: 112, 25: 207, 18: 212, 19: 197, 8: 977, 2: 1141, 41: 74, 20: 221, 11: 395, 7: 999, 30: 183, 34: 141, 16: 225, 13: 339, 15: 275, 3: 620, 29: 214, 22: 221, 14: 291, 31: 202, 26: 224, 33: 167, 24: 210, 27: 188, 42: 63, 39: 98, 17: 229, 1: 177, 35: 130, 36: 119, 12: 316, 32: 167, 48: 19, 51: 8, 28: 199, 46: 19, 52: 9, 47: 22, 44: 42, 43: 51, 113: 1, 49: 15, 45: 39, 50: 16, 58: 2, 69: 1, 59: 2, 53: 5, 66: 1, 71: 1, 72: 1, 54: 4, 55: 9, 57: 2, 62: 2, 67: 1, 124: 1, 80: 1, 56: 2, 60: 3, 78: 1}.
分位点为0.9992的句子长度为：60

在该语料库中，一共有13998个句子，比预期的42000/3=14000个句子少两个。一个有24339个单词，单词量还是蛮大的，当然，这里对单词没有做任何处理，直接保留了语料库中的形式（后期可以继续优化）。我们需要注意的是，NER的标注列表为[‘O’ ,‘B-MISC’, ‘I-MISC’, ‘B-ORG’ ,‘I-ORG’, ‘B-PER’ ,‘B-LOC’ ,‘I-PER’, ‘I-LOC’,‘sO’]，因此，本项目的NER一共分为四类：PER（人名），LOC（位置），ORG（组织）以及MISC，其中B表示开始，I表示中间，O表示单字词，不计入NER，sO表示特殊单字词。

接下来，让我们考虑下句子的长度，这对后面的建模时填充的句子长度有有参考作用。句子长度及出现频数的统计图如下：

可以看到，句子长度基本在60以下，当然，这也可以在输出的句子长度及出现频数字典中看到。那么，我们是否可以选在一个标准作为后面模型的句子填充的长度呢？答案是，利用出现频数的累计分布函数的分位点，在这里，我们选择分位点为0.9992,对应的句子长度为60，如下图：

接着是数据处理函数data_processing()，它的功能主要是实现单词、标签字典，并保存为pickle文件形式，便于后续直接调用。

建模

在第三步中，我们建立Bi-LSTM模型来训练训练，完整的Python代码（Bi_LSTM_Model_training.py）如下：

import pickle
import numpy as np
import pandas as pd
from utils import CONSTANTS, load_data
from data_processing import data_processing
from keras.utils import np_utils, plot_model
from keras.models import Sequential
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Bidirectional, LSTM, Dense, Embedding, TimeDistributed


#  模型输入数据
def input_data_for_model(input_shape):
    #  数据导入
    input_data = load_data()

    #  数据处理
    data_processing()

    #  导入字典
    with open(CONSTANTS[1], 'rb') as f:
        word_dictionary = pickle.load(f)
    with open(CONSTANTS[2], 'rb') as f:
        inverse_word_dictionary = pickle.load(f)
    with open(CONSTANTS[3], 'rb') as f:
        label_dictionary = pickle.load(f)
    with open(CONSTANTS[4], 'rb') as f:
        output_dictionary = pickle.load(f)
    vocab_size = len(word_dictionary.keys())
    label_size = len(label_dictionary.keys())

    #  处理输入数据

    aggregate_function = lambda input: [(word, pos, label) for word, pos, label in
                                        zip(input['word'].values.tolist(), input['pos'].values.tolist(),
                                            input['tag'].values.tolist())]
    grouped_input_data = input_data.groupby('sent_no').apply(aggregate_function)
    sentences = [sentence for sentence in grouped_input_data]

    x = [[word_dictionary[word[0]] for word in sent] for sent in sentences]
    x = pad_sequences(maxlen=input_shape, sequences=x, padding='post', value=0)
    y = [[label_dictionary[word[2]] for word in sent] for sent in sentences]
    y = pad_sequences(maxlen=input_shape, sequences=y, padding='post', value=0)
    y = [np_utils.to_categorical(label, num_classes=label_size + 1) for label in y]

    return x, y, output_dictionary, vocab_size, label_size, inverse_word_dictionary


#  定义深度学习模型：Bi-LSTM
def creat_bi_lstm(vocab_size, label_size, input_shape, output_dim, n_unite, out_act, activation):
    model = Sequential()
    model.add(Embedding(input_dim=vocab_size + 1, output_dim=output_dim, input_length=input_shape, mask_zero=True))
    model.add(Bidirectional(LSTM(units=n_unite, activation=activation, return_sequences=True)))
    model.add(TimeDistributed(Dense(label_size + 1, activation=out_act)))
    model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

    return model


#  模型训练
def model_train():
    #  将数据集分为训练集和测试集，占比为9：1
    input_shape = 60
    x, y, output_dictionary, vocab_size, label_size, inverse_word_dictionary = input_data_for_model(input_shape)
    train_end = int(len(x) * 0.9)
    train_x, train_y = x[0:train_end], np.array(y[0: train_end])
    test_x, test_y = x[train_end:], np.array(y[train_end:])

    #  模型输入参数
    activation = 'selu'
    out_act = 'softmax'
    n_units = 100
    batch_size = 32
    epochs = 10
    output_dim = 20

    #  模型训练
    lstm_model = creat_bi_lstm(vocab_size, label_size, input_shape, output_dim, n_units, out_act, activation)
    lstm_model.fit(train_x, train_y, epochs=epochs, batch_size=batch_size, verbose=1)

    #  模型保存
    model_save_path = CONSTANTS[0]
    lstm_model.save(model_save_path)
    plot_model(lstm_model, to_file='./data/lstm_model.png')

    #  在测试集上的效果
    N = test_x.shape[0]  # 测试的条数
    avg_accuracy = 0  # 预测的平均准确率
    for start, end in zip(range(0, N, 1), range(1, N + 1, 1)):
        sentence = [inverse_word_dictionary[i] for i in test_x[start] if i != 0]
        y_predict = lstm_model.predict(test_x[start:end])
        input_sequences, output_sequences = [], []
        for i in range(0, len(y_predict[0])):
            output_sequences.append(np.argmax(y_predict[0][i]))
            input_sequences.append(np.argmax(test_y[start][i]))

        eval = lstm_model.evaluate(test_x[start:end], test_y[start:end])
        print('Test Accuracy: loss = %0.6f accuracy = %0.2f%%' % (eval[0], eval[1] * 100))
        avg_accuracy += eval[1]
        output_sequences = ' '.join([output_dictionary[key] for key in output_sequences if key != 0]).split()
        input_sequences = ' '.join([output_dictionary[key] for key in input_sequences if key != 0]).split()
        output_input_comparison = pd.DataFrame([sentence, output_sequences, input_sequences]).T
        print(output_input_comparison.dorpna())
        print('#' * 80)

        avg_accuracy /= N
        print("测试样本的平均预测准确率：%.2f%%." % (avg_accuracy * 100))


if __name__ == '__main__':
    model_train()

在上面的代码中，先是通过input_data_for_model()函数来处理好进入模型的数据，其参数为input_shape，即填充句子时的长度。然后是创建Bi-LSTM模型create_Bi_LSTM()，模型的示意图如下：

最后，是在输入的数据上进行模型训练，将原始的数据分为训练集和测试集，占比为9:1，训练的周期为10次。

模型训练

运行上述模型训练代码，一共训练10个周期，训练时间大概为500s，在训练集上的准确率达99%以上，在测试集上的平均准确率为93%以上。以下是最后几个测试集上的预测结果：

Epoch 1/10
394/394 [==============================] - 13s 29ms/step - loss: 0.2133 - accuracy: 0.8241
Epoch 2/10
394/394 [==============================] - 11s 29ms/step - loss: 0.0603 - accuracy: 0.9191
Epoch 3/10
394/394 [==============================] - 11s 29ms/step - loss: 0.0292 - accuracy: 0.9670
Epoch 4/10
394/394 [==============================] - 12s 30ms/step - loss: 0.0157 - accuracy: 0.9840
Epoch 5/10
394/394 [==============================] - 12s 31ms/step - loss: 0.0093 - accuracy: 0.9904
Epoch 6/10
394/394 [==============================] - 12s 31ms/step - loss: 0.0063 - accuracy: 0.9935
Epoch 7/10
394/394 [==============================] - 12s 30ms/step - loss: 0.0043 - accuracy: 0.9955
Epoch 8/10
394/394 [==============================] - 12s 29ms/step - loss: 0.0032 - accuracy: 0.9964
Epoch 9/10
394/394 [==============================] - 11s 29ms/step - loss: 0.0022 - accuracy: 0.9978
Epoch 10/10
394/394 [==============================] - 12s 30ms/step - loss: 0.0014 - accuracy: 0.9988
1/1 [==============================] - 0s 337ms/step - loss: 0.1548 - accuracy: 0.9375
Test Accuracy: loss = 0.154795 accuracy = 93.75%

该模型在原始数据上的识别效果还是可以的。训练完模型后，BASE_DIR中的所有文件如下：

模型预测

最后，也许是整个项目最为激动人心的时刻，因为，我们要在新数据集上测试模型的识别效果。预测新数据的识别结果的完整Python代码（Bi_LSTM_Model_predict.py）如下：

# Import the necessary modules
import pickle
import numpy as np
from utils import CONSTANTS
from keras.preprocessing.sequence import pad_sequences
from keras.models import load_model
from nltk import word_tokenize

# 导入字典
with open(CONSTANTS[1], 'rb') as f:
    word_dictionary = pickle.load(f)
with open(CONSTANTS[4], 'rb') as f:
    output_dictionary = pickle.load(f)

try:
    # 数据预处理
    input_shape = 60
    sent = 'New York is the biggest city in America.'
    new_sent = word_tokenize(sent)
    new_x = [[word_dictionary[word] for word in new_sent]]
    x = pad_sequences(maxlen=input_shape, sequences=new_x, padding='post', value=0)

    # 载入模型
    model_save_path = CONSTANTS[0]
    lstm_model = load_model(model_save_path)

    # 模型预测
    y_predict = lstm_model.predict(x)

    ner_tag = []
    for i in range(0, len(new_sent)):
        ner_tag.append(np.argmax(y_predict[0][i]))

    ner = [output_dictionary[i] for i in ner_tag]
    print(new_sent)
    print(ner)

    # 去掉NER标注为O的元素
    ner_reg_list = []
    for word, tag in zip(new_sent, ner):
        if tag != 'O':
            ner_reg_list.append((word, tag))

    # 输出模型的NER识别结果
    print("NER识别结果：")
    if ner_reg_list:
        for i, item in enumerate(ner_reg_list):
            if item[1].startswith('B'):
                end = i + 1
                while end

自然语言处理(NLP)之用深度学习实现命名实体识别(NER)

最近更新

热门博客

[ 申请 ]友情链接：