您当前的位置: 首页 > 

段智华

暂无认证

  • 0浏览

    0关注

    1232博文

    0收益

  • 0浏览

    0点赞

    0打赏

    0留言

私信
关注
热门博文

命名实体识别NER探索(1)

段智华 发布时间:2020-09-03 19:24:44 ,浏览量:0

命名实体识别NER探索(1)

命名实体识别(Named-entity recognition ,NER)(也称为实体识别、实体分块和实体提取)是信息提取的一个子任务,旨在将非结构化文本中提到的命名实体定位并分类为预定义的类别,例如人名、组织、地名、医疗名称、时间表达式、数量,货币价值、百分比等。

目录
    • Tensorflow 1.x 虚拟环境部署
    • 数据的采集及清洗
    • 自动标注将文本转化为深度学习的格式

Tensorflow 1.x 虚拟环境部署

新建虚拟环境

E:\>python -m venv 2020_vms_tensorflow_1

激活虚拟环境

E:\>cd E:\2020_vms_tensorflow_1\Scripts

E:\2020_vms_tensorflow_1\Scripts>activate.bat
(2020_vms_tensorflow_1) E:\2020_vms_tensorflow_1\Scripts>

安装Tensorflow 1.x tensorflow-1.15.0-cp36-cp36m-win_amd64.whl

(2020_vms_tensorflow_1) D:\2020_vir_tensorflow1\install_whl>pip install tensorflow-1.15.0-cp36-cp36m-win_amd64.whl -i https://pypi.tuna.tsinghua.edu.cn/simple
Processing d:\2020_vir_tensorflow1\install_whl\tensorflow-1.15.0-cp36-cp36m-win_amd64.whl
Collecting wheel>=0.26 (from tensorflow==1.15.0)
  Using cached https://pypi.tuna.tsinghua.edu.cn/packages/a7/00/3df031b3ecd5444d572141321537080b40c1c25e1caa3d86cdd12e5e919c/wheel-0.35.1-py2.py3-none-any.whl
Collecting tensorflow-estimator==1.15.1 (from tensorflow==1.15.0)
  Using cached https://pypi.tuna.tsinghua.edu.cn/packages/de/62/2ee9cd74c9fa2fa450877847ba560b260f5d0fb70ee0595203082dafcc9d/tensorflow_estimator-1.15.1-py2.py3-none-any.whl
Collecting keras-applications>=1.0.8 (from tensorflow==1.15.0)
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/71/e3/19762fdfc62877ae9102edf6342d71b28fbfd9dea3d2f96a882ce099b03f/Keras_Applications-1.0.8-py3-none-any.whl (50kB)
    100% |████████████████████████████████| 51kB 276kB/s
Collecting absl-py>=0.7.0 (from tensorflow==1.15.0)
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/b9/07/f69dd3367368ad69f174bfe426a973651412ec11d48ec05c000f19fe0561/absl_py-0.10.0-py3-none-any.whl (127kB)
    100% |████████████████████████████████| 133kB 488kB/s
Collecting google-pasta>=0.1.6 (from tensorflow==1.15.0)
  Using cached https://pypi.tuna.tsinghua.edu.cn/packages/a3/de/c648ef6835192e6e2cc03f40b19eeda4382c49b5bafb43d88b931c4c74ac/google_pasta-0.2.0-py3-none-any.whl
Collecting keras-preprocessing>=1.0.5 (from tensorflow==1.15.0)
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/79/4c/7c3275a01e12ef9368a892926ab932b33bb13d55794881e3573482b378a7/Keras_Preprocessing-1.1.2-py2.py3-none-any.whl (42kB)
    100% |████████████████████████████████| 51kB 2.1MB/s
Collecting grpcio>=1.8.6 (from tensorflow==1.15.0)
  Using cached https://pypi.tuna.tsinghua.edu.cn/packages/15/3f/f311f382bb658387fe78a30e1ed55193fe94c5e78b37abd134c34bd256eb/grpcio-1.31.0-cp36-cp36m-win_amd64.whl
Collecting gast==0.2.2 (from tensorflow==1.15.0)
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/4e/35/11749bf99b2d4e3cceb4d55ca22590b0d7c2c62b9de38ac4a4a7f4687421/gast-0.2.2.tar.gz
Collecting protobuf>=3.6.1 (from tensorflow==1.15.0)
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/f6/fe/9d8e70a86add02cb1ef35540ec03fd5b210d76323fe4645d7121b13ae33e/protobuf-3.13.0-cp36-cp36m-win_amd64.whl (1.1MB)
    100% |████████████████████████████████| 1.1MB 99kB/s
Collecting astor>=0.6.0 (from tensorflow==1.15.0)
  Using cached https://pypi.tuna.tsinghua.edu.cn/packages/c3/88/97eef84f48fa04fbd6750e62dcceafba6c63c81b7ac1420856c8dcc0a3f9/astor-0.8.1-py2.py3-none-any.whl
Collecting numpy=1.16.0 (from tensorflow==1.15.0)
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/05/1d/d7b100264346a8722325987f10061b66d3c560bfb292f2c0254736e7531e/numpy-1.19.1-cp36-cp36m-win_amd64.whl (12.9MB)
    100% |████████████████████████████████| 12.9MB 42kB/s
Collecting termcolor>=1.1.0 (from tensorflow==1.15.0)
  Using cached https://pypi.tuna.tsinghua.edu.cn/packages/8a/48/a76be51647d0eb9f10e2a4511bf3ffb8cc1e6b14e9e4fab46173aa79f981/termcolor-1.1.0.tar.gz
Collecting opt-einsum>=2.3.2 (from tensorflow==1.15.0)
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/bc/19/404708a7e54ad2798907210462fd950c3442ea51acc8790f3da48d2bee8b/opt_einsum-3.3.0-py3-none-any.whl (65kB)
    100% |████████████████████████████████| 71kB 157kB/s
Collecting six>=1.10.0 (from tensorflow==1.15.0)
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/ee/ff/48bde5c0f013094d729fe4b0316ba2a24774b3ff1c52d924a8a4cb04078a/six-1.15.0-py2.py3-none-any.whl
Collecting wrapt>=1.11.1 (from tensorflow==1.15.0)
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/82/f7/e43cefbe88c5fd371f4cf0cf5eb3feccd07515af9fd6cf7dbf1d1793a797/wrapt-1.12.1.tar.gz
Collecting tensorboard=1.15.0 (from tensorflow==1.15.0)
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/1e/e9/d3d747a97f7188f48aa5eda486907f3b345cd409f0a0850468ba867db246/tensorboard-1.15.0-py3-none-any.whl (3.8MB)
    100% |████████████████████████████████| 3.8MB 90kB/s
Collecting h5py (from keras-applications>=1.0.8->tensorflow==1.15.0)
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/0b/fa/bee65d2dbdbd3611702aafd128139c53c90a1285f169ba5467aab252e27a/h5py-2.10.0-cp36-cp36m-win_amd64.whl (2.4MB)
    100% |████████████████████████████████| 2.4MB 89kB/s
Requirement already satisfied: setuptools in e:\2020_vms_tensorflow_1\lib\site-packages (from protobuf>=3.6.1->tensorflow==1.15.0)
Collecting markdown>=2.6.8 (from tensorboard=1.15.0->tensorflow==1.15.0)
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/a4/63/eaec2bd025ab48c754b55e8819af0f6a69e2b1e187611dd40cbbe101ee7f/Markdown-3.2.2-py3-none-any.whl (88kB)
    100% |████████████████████████████████| 92kB 138kB/s
Collecting werkzeug>=0.11.15 (from tensorboard=1.15.0->tensorflow==1.15.0)
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/cc/94/5f7079a0e00bd6863ef8f1da638721e9da21e5bacee597595b318f71d62e/Werkzeug-1.0.1-py2.py3-none-any.whl (298kB)
    100% |████████████████████████████████| 307kB 109kB/s
Collecting importlib-metadata; python_version =2.6.8->tensorboard=1.15.0->tensorflow==1.15.0)

提示报错

Collecting zipp>=0.5 (from importlib-metadata; python_version markdown>=2.6.8->tensorboard=1.15.0->tensorflow==1.15.0)
  Using cached https://pypi.tuna.tsinghua.edu.cn/packages/b2/34/bfcb43cc0ba81f527bc4f40ef41ba2ff4080e047acb0586b56b3d017ace4/zipp-3.1.0-py3-none-any.whl
Building wheels for collected packages: wrapt
  Running setup.py bdist_wheel for wrapt ... error
  Failed building wheel for wrapt
  Running setup.py clean for wrapt
Failed to build wrapt
Installing collected packages: wrapt, werkzeug, zipp, importlib-metadata, markdown, tensorboard, tensorflow
  Running setup.py install for wrapt ... error
Exception:
Traceback (most recent call last):
  File "e:\2020_vms_tensorflow_1\lib\site-packages\pip\compat\__init__.py", line 73, in console_to_str
    return s.decode(sys.__stdout__.encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa1 in position 44: invalid start byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "e:\2020_vms_tensorflow_1\lib\site-packages\pip\basecommand.py", line 215, in main
    status = self.run(options, args)
  File "e:\2020_vms_tensorflow_1\lib\site-packages\pip\commands\install.py", line 342, in run
    prefix=options.prefix_path,
  File "e:\2020_vms_tensorflow_1\lib\site-packages\pip\req\req_set.py", line 784, in install
    **kwargs
  File "e:\2020_vms_tensorflow_1\lib\site-packages\pip\req\req_install.py", line 878, in install
    spinner=spinner,
  File "e:\2020_vms_tensorflow_1\lib\site-packages\pip\utils\__init__.py", line 676, in call_subprocess
    line = console_to_str(proc.stdout.readline())
  File "e:\2020_vms_tensorflow_1\lib\site-packages\pip\compat\__init__.py", line 75, in console_to_str
    return s.decode('utf_8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa1 in position 44: invalid start byte
You are using pip version 9.0.1, however version 20.2.2 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.

修改73行代码:

if sys.version_info >= (3,):
    def console_to_str(s):
        try:
            return s.decode(sys.__stdout__.encoding)
        except UnicodeDecodeError:
            return s.decode('utf_8')

修改为:

if sys.version_info >= (3,):
    def console_to_str(s):
        try:
            #return s.decode(sys.__stdout__.encoding)
			return s.decode('cp936')
        except UnicodeDecodeError:
            return s.decode('utf_8')
 

Tensorflow 1.x 安装成功!

(2020_vms_tensorflow_1) D:\2020_vir_tensorflow1\install_whl>pip install tensorflow-1.15.0-cp36-cp36m-win_amd64.whl -i https://pypi.tuna.tsinghua.edu.cn/simple
Processing d:\2020_vir_tensorflow1\install_whl\tensorflow-1.15.0-cp36-cp36m-win_amd64.whl
Requirement already satisfied: google-pasta>=0.1.6 in e:\2020_vms_tensorflow_1\lib\site-packages (from tensorflow==1.15.0)
Collecting tensorboard=1.15.0 (from tensorflow==1.15.0)
  Using cached https://pypi.tuna.tsinghua.edu.cn/packages/1e/e9/d3d747a97f7188f48aa5eda486907f3b345cd409f0a0850468ba867db246/tensorboard-1.15.0-py3-none-any.whl
Requirement already satisfied: protobuf>=3.6.1 in e:\2020_vms_tensorflow_1\lib\site-packages (from tensorflow==1.15.0)
Requirement already satisfied: wheel>=0.26 in e:\2020_vms_tensorflow_1\lib\site-packages (from tensorflow==1.15.0)
Requirement already satisfied: opt-einsum>=2.3.2 in e:\2020_vms_tensorflow_1\lib\site-packages (from tensorflow==1.15.0)
Requirement already satisfied: six>=1.10.0 in e:\2020_vms_tensorflow_1\lib\site-packages (from tensorflow==1.15.0)
Requirement already satisfied: astor>=0.6.0 in e:\2020_vms_tensorflow_1\lib\site-packages (from tensorflow==1.15.0)
Requirement already satisfied: keras-applications>=1.0.8 in e:\2020_vms_tensorflow_1\lib\site-packages (from tensorflow==1.15.0)
Collecting wrapt>=1.11.1 (from tensorflow==1.15.0)
  Using cached https://pypi.tuna.tsinghua.edu.cn/packages/82/f7/e43cefbe88c5fd371f4cf0cf5eb3feccd07515af9fd6cf7dbf1d1793a797/wrapt-1.12.1.tar.gz
Requirement already satisfied: grpcio>=1.8.6 in e:\2020_vms_tensorflow_1\lib\site-packages (from tensorflow==1.15.0)
Requirement already satisfied: numpy=1.16.0 in e:\2020_vms_tensorflow_1\lib\site-packages (from tensorflow==1.15.0)
Requirement already satisfied: absl-py>=0.7.0 in e:\2020_vms_tensorflow_1\lib\site-packages (from tensorflow==1.15.0)
Requirement already satisfied: keras-preprocessing>=1.0.5 in e:\2020_vms_tensorflow_1\lib\site-packages (from tensorflow==1.15.0)
Requirement already satisfied: gast==0.2.2 in e:\2020_vms_tensorflow_1\lib\site-packages (from tensorflow==1.15.0)
Requirement already satisfied: tensorflow-estimator==1.15.1 in e:\2020_vms_tensorflow_1\lib\site-packages (from tensorflow==1.15.0)
Requirement already satisfied: termcolor>=1.1.0 in e:\2020_vms_tensorflow_1\lib\site-packages (from tensorflow==1.15.0)
Collecting markdown>=2.6.8 (from tensorboard=1.15.0->tensorflow==1.15.0)
  Using cached https://pypi.tuna.tsinghua.edu.cn/packages/a4/63/eaec2bd025ab48c754b55e8819af0f6a69e2b1e187611dd40cbbe101ee7f/Markdown-3.2.2-py3-none-any.whl
Collecting werkzeug>=0.11.15 (from tensorboard=1.15.0->tensorflow==1.15.0)
  Using cached https://pypi.tuna.tsinghua.edu.cn/packages/cc/94/5f7079a0e00bd6863ef8f1da638721e9da21e5bacee597595b318f71d62e/Werkzeug-1.0.1-py2.py3-none-any.whl
Collecting setuptools>=41.0.0 (from tensorboard=1.15.0->tensorflow==1.15.0)
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/b0/8b/379494d7dbd3854aa7b85b216cb0af54edcb7fce7d086ba3e35522a713cf/setuptools-50.0.0-py3-none-any.whl (783kB)
    100% |████████████████████████████████| 788kB 121kB/s
Requirement already satisfied: h5py in e:\2020_vms_tensorflow_1\lib\site-packages (from keras-applications>=1.0.8->tensorflow==1.15.0)
Collecting importlib-metadata; python_version =2.6.8->tensorboard=1.15.0->tensorflow==1.15.0)
  Using cached https://pypi.tuna.tsinghua.edu.cn/packages/8e/58/cdea07eb51fc2b906db0968a94700866fc46249bdc75cac23f9d13168929/importlib_metadata-1.7.0-py2.py3-none-any.whl
Collecting zipp>=0.5 (from importlib-metadata; python_version markdown>=2.6.8->tensorboard=1.15.0->tensorflow==1.15.0)
  Using cached https://pypi.tuna.tsinghua.edu.cn/packages/b2/34/bfcb43cc0ba81f527bc4f40ef41ba2ff4080e047acb0586b56b3d017ace4/zipp-3.1.0-py3-none-any.whl
Building wheels for collected packages: wrapt
  Running setup.py bdist_wheel for wrapt ... done
  Stored in directory: C:\Users\lenovo\AppData\Local\pip\Cache\wheels\68\e3\d7\4b6eee6f5d547bdfd97ba406128db66c5654dfb831fda163a2
Successfully built wrapt
Installing collected packages: zipp, importlib-metadata, markdown, werkzeug, setuptools, tensorboard, wrapt, tensorflow
  Found existing installation: setuptools 28.8.0
    Uninstalling setuptools-28.8.0:
      Successfully uninstalled setuptools-28.8.0
Successfully installed importlib-metadata-1.7.0 markdown-3.2.2 setuptools-50.0.0 tensorboard-1.15.0 tensorflow-1.15.0 werkzeug-1.0.1 wrapt-1.12.1 zipp-3.1.0
You are using pip version 9.0.1, however version 20.2.2 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.

(2020_vms_tensorflow_1) D:\2020_vir_tensorflow1\install_whl>
(2020_vms_tensorflow_1) D:\2020_vir_tensorflow1\install_whl>python
Python 3.6.4 (v3.6.4:d48eceb, Dec 19 2017, 06:54:40) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
...
>>>
>>> print(tf.__version__)
1.15.0
>>>
数据的采集及清洗

本文采用医疗行业电子病历分析案例,数据及代码来源于互联网资料。电子病历文本自然语言处理研究主要关注病历文本的处理,包括句子边界识别、词性标注、句法分析等,信息抽取以自然语言处理研究为基础,主要关注病历文本中各类表达医疗知识的命名实体或医疗概念的识别和关系抽取。

  • 人工标注的实体数据源 0.ann:第一列是序号,第二列是实体名称,第三列、第四列是标识实体在对应的0.txt文件的起始位置和结束位置,第五列是标识的实体名称。这是人工打标标识的文件。
......
T1	Disease 1845 1850	1型糖尿病
T2	Disease 1983 1988	1型糖尿病
T4	Disease 30 35	2型糖尿病
T5	Disease 1822 1827	2型糖尿病
T6	Disease 2055 2060	2型糖尿病
T7	Disease 2324 2329	2型糖尿病
T8	Disease 4325 4330	2型糖尿病
T9	Disease 5223 5228	2型糖尿病 
.......

医生针对患者的诊疗活动可以概括为:通过患者自述(自诉症状)和检查结果(检查项目)发现疾病的表现(症状),给出诊断结论(疾病),并基于诊断结论,给出治疗措施(治疗方案),涉及信息包括:症状、疾病、检查和治疗。

  • 0.ann对应的原始文本数据源 0.txt:
......
1.一般将HBA1C  。控制于            
关注
打赏
1659361485
查看更多评论
0.0561s