命名实体识别NER探索(1)
命名实体识别(Named-entity recognition ,NER)(也称为实体识别、实体分块和实体提取)是信息提取的一个子任务,旨在将非结构化文本中提到的命名实体定位并分类为预定义的类别,例如人名、组织、地名、医疗名称、时间表达式、数量,货币价值、百分比等。
目录
Tensorflow 1.x 虚拟环境部署
- Tensorflow 1.x 虚拟环境部署
- 数据的采集及清洗
- 自动标注将文本转化为深度学习的格式
新建虚拟环境
E:\>python -m venv 2020_vms_tensorflow_1
激活虚拟环境
E:\>cd E:\2020_vms_tensorflow_1\Scripts
E:\2020_vms_tensorflow_1\Scripts>activate.bat
(2020_vms_tensorflow_1) E:\2020_vms_tensorflow_1\Scripts>
安装Tensorflow 1.x tensorflow-1.15.0-cp36-cp36m-win_amd64.whl
(2020_vms_tensorflow_1) D:\2020_vir_tensorflow1\install_whl>pip install tensorflow-1.15.0-cp36-cp36m-win_amd64.whl -i https://pypi.tuna.tsinghua.edu.cn/simple
Processing d:\2020_vir_tensorflow1\install_whl\tensorflow-1.15.0-cp36-cp36m-win_amd64.whl
Collecting wheel>=0.26 (from tensorflow==1.15.0)
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/a7/00/3df031b3ecd5444d572141321537080b40c1c25e1caa3d86cdd12e5e919c/wheel-0.35.1-py2.py3-none-any.whl
Collecting tensorflow-estimator==1.15.1 (from tensorflow==1.15.0)
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/de/62/2ee9cd74c9fa2fa450877847ba560b260f5d0fb70ee0595203082dafcc9d/tensorflow_estimator-1.15.1-py2.py3-none-any.whl
Collecting keras-applications>=1.0.8 (from tensorflow==1.15.0)
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/71/e3/19762fdfc62877ae9102edf6342d71b28fbfd9dea3d2f96a882ce099b03f/Keras_Applications-1.0.8-py3-none-any.whl (50kB)
100% |████████████████████████████████| 51kB 276kB/s
Collecting absl-py>=0.7.0 (from tensorflow==1.15.0)
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/b9/07/f69dd3367368ad69f174bfe426a973651412ec11d48ec05c000f19fe0561/absl_py-0.10.0-py3-none-any.whl (127kB)
100% |████████████████████████████████| 133kB 488kB/s
Collecting google-pasta>=0.1.6 (from tensorflow==1.15.0)
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/a3/de/c648ef6835192e6e2cc03f40b19eeda4382c49b5bafb43d88b931c4c74ac/google_pasta-0.2.0-py3-none-any.whl
Collecting keras-preprocessing>=1.0.5 (from tensorflow==1.15.0)
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/79/4c/7c3275a01e12ef9368a892926ab932b33bb13d55794881e3573482b378a7/Keras_Preprocessing-1.1.2-py2.py3-none-any.whl (42kB)
100% |████████████████████████████████| 51kB 2.1MB/s
Collecting grpcio>=1.8.6 (from tensorflow==1.15.0)
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/15/3f/f311f382bb658387fe78a30e1ed55193fe94c5e78b37abd134c34bd256eb/grpcio-1.31.0-cp36-cp36m-win_amd64.whl
Collecting gast==0.2.2 (from tensorflow==1.15.0)
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/4e/35/11749bf99b2d4e3cceb4d55ca22590b0d7c2c62b9de38ac4a4a7f4687421/gast-0.2.2.tar.gz
Collecting protobuf>=3.6.1 (from tensorflow==1.15.0)
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/f6/fe/9d8e70a86add02cb1ef35540ec03fd5b210d76323fe4645d7121b13ae33e/protobuf-3.13.0-cp36-cp36m-win_amd64.whl (1.1MB)
100% |████████████████████████████████| 1.1MB 99kB/s
Collecting astor>=0.6.0 (from tensorflow==1.15.0)
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/c3/88/97eef84f48fa04fbd6750e62dcceafba6c63c81b7ac1420856c8dcc0a3f9/astor-0.8.1-py2.py3-none-any.whl
Collecting numpy=1.16.0 (from tensorflow==1.15.0)
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/05/1d/d7b100264346a8722325987f10061b66d3c560bfb292f2c0254736e7531e/numpy-1.19.1-cp36-cp36m-win_amd64.whl (12.9MB)
100% |████████████████████████████████| 12.9MB 42kB/s
Collecting termcolor>=1.1.0 (from tensorflow==1.15.0)
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/8a/48/a76be51647d0eb9f10e2a4511bf3ffb8cc1e6b14e9e4fab46173aa79f981/termcolor-1.1.0.tar.gz
Collecting opt-einsum>=2.3.2 (from tensorflow==1.15.0)
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/bc/19/404708a7e54ad2798907210462fd950c3442ea51acc8790f3da48d2bee8b/opt_einsum-3.3.0-py3-none-any.whl (65kB)
100% |████████████████████████████████| 71kB 157kB/s
Collecting six>=1.10.0 (from tensorflow==1.15.0)
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/ee/ff/48bde5c0f013094d729fe4b0316ba2a24774b3ff1c52d924a8a4cb04078a/six-1.15.0-py2.py3-none-any.whl
Collecting wrapt>=1.11.1 (from tensorflow==1.15.0)
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/82/f7/e43cefbe88c5fd371f4cf0cf5eb3feccd07515af9fd6cf7dbf1d1793a797/wrapt-1.12.1.tar.gz
Collecting tensorboard=1.15.0 (from tensorflow==1.15.0)
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/1e/e9/d3d747a97f7188f48aa5eda486907f3b345cd409f0a0850468ba867db246/tensorboard-1.15.0-py3-none-any.whl (3.8MB)
100% |████████████████████████████████| 3.8MB 90kB/s
Collecting h5py (from keras-applications>=1.0.8->tensorflow==1.15.0)
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/0b/fa/bee65d2dbdbd3611702aafd128139c53c90a1285f169ba5467aab252e27a/h5py-2.10.0-cp36-cp36m-win_amd64.whl (2.4MB)
100% |████████████████████████████████| 2.4MB 89kB/s
Requirement already satisfied: setuptools in e:\2020_vms_tensorflow_1\lib\site-packages (from protobuf>=3.6.1->tensorflow==1.15.0)
Collecting markdown>=2.6.8 (from tensorboard=1.15.0->tensorflow==1.15.0)
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/a4/63/eaec2bd025ab48c754b55e8819af0f6a69e2b1e187611dd40cbbe101ee7f/Markdown-3.2.2-py3-none-any.whl (88kB)
100% |████████████████████████████████| 92kB 138kB/s
Collecting werkzeug>=0.11.15 (from tensorboard=1.15.0->tensorflow==1.15.0)
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/cc/94/5f7079a0e00bd6863ef8f1da638721e9da21e5bacee597595b318f71d62e/Werkzeug-1.0.1-py2.py3-none-any.whl (298kB)
100% |████████████████████████████████| 307kB 109kB/s
Collecting importlib-metadata; python_version =2.6.8->tensorboard=1.15.0->tensorflow==1.15.0)
提示报错
Collecting zipp>=0.5 (from importlib-metadata; python_version markdown>=2.6.8->tensorboard=1.15.0->tensorflow==1.15.0)
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/b2/34/bfcb43cc0ba81f527bc4f40ef41ba2ff4080e047acb0586b56b3d017ace4/zipp-3.1.0-py3-none-any.whl
Building wheels for collected packages: wrapt
Running setup.py bdist_wheel for wrapt ... error
Failed building wheel for wrapt
Running setup.py clean for wrapt
Failed to build wrapt
Installing collected packages: wrapt, werkzeug, zipp, importlib-metadata, markdown, tensorboard, tensorflow
Running setup.py install for wrapt ... error
Exception:
Traceback (most recent call last):
File "e:\2020_vms_tensorflow_1\lib\site-packages\pip\compat\__init__.py", line 73, in console_to_str
return s.decode(sys.__stdout__.encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa1 in position 44: invalid start byte
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "e:\2020_vms_tensorflow_1\lib\site-packages\pip\basecommand.py", line 215, in main
status = self.run(options, args)
File "e:\2020_vms_tensorflow_1\lib\site-packages\pip\commands\install.py", line 342, in run
prefix=options.prefix_path,
File "e:\2020_vms_tensorflow_1\lib\site-packages\pip\req\req_set.py", line 784, in install
**kwargs
File "e:\2020_vms_tensorflow_1\lib\site-packages\pip\req\req_install.py", line 878, in install
spinner=spinner,
File "e:\2020_vms_tensorflow_1\lib\site-packages\pip\utils\__init__.py", line 676, in call_subprocess
line = console_to_str(proc.stdout.readline())
File "e:\2020_vms_tensorflow_1\lib\site-packages\pip\compat\__init__.py", line 75, in console_to_str
return s.decode('utf_8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa1 in position 44: invalid start byte
You are using pip version 9.0.1, however version 20.2.2 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.
修改73行代码:
if sys.version_info >= (3,):
def console_to_str(s):
try:
return s.decode(sys.__stdout__.encoding)
except UnicodeDecodeError:
return s.decode('utf_8')
修改为:
if sys.version_info >= (3,):
def console_to_str(s):
try:
#return s.decode(sys.__stdout__.encoding)
return s.decode('cp936')
except UnicodeDecodeError:
return s.decode('utf_8')
Tensorflow 1.x 安装成功!
(2020_vms_tensorflow_1) D:\2020_vir_tensorflow1\install_whl>pip install tensorflow-1.15.0-cp36-cp36m-win_amd64.whl -i https://pypi.tuna.tsinghua.edu.cn/simple
Processing d:\2020_vir_tensorflow1\install_whl\tensorflow-1.15.0-cp36-cp36m-win_amd64.whl
Requirement already satisfied: google-pasta>=0.1.6 in e:\2020_vms_tensorflow_1\lib\site-packages (from tensorflow==1.15.0)
Collecting tensorboard=1.15.0 (from tensorflow==1.15.0)
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/1e/e9/d3d747a97f7188f48aa5eda486907f3b345cd409f0a0850468ba867db246/tensorboard-1.15.0-py3-none-any.whl
Requirement already satisfied: protobuf>=3.6.1 in e:\2020_vms_tensorflow_1\lib\site-packages (from tensorflow==1.15.0)
Requirement already satisfied: wheel>=0.26 in e:\2020_vms_tensorflow_1\lib\site-packages (from tensorflow==1.15.0)
Requirement already satisfied: opt-einsum>=2.3.2 in e:\2020_vms_tensorflow_1\lib\site-packages (from tensorflow==1.15.0)
Requirement already satisfied: six>=1.10.0 in e:\2020_vms_tensorflow_1\lib\site-packages (from tensorflow==1.15.0)
Requirement already satisfied: astor>=0.6.0 in e:\2020_vms_tensorflow_1\lib\site-packages (from tensorflow==1.15.0)
Requirement already satisfied: keras-applications>=1.0.8 in e:\2020_vms_tensorflow_1\lib\site-packages (from tensorflow==1.15.0)
Collecting wrapt>=1.11.1 (from tensorflow==1.15.0)
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/82/f7/e43cefbe88c5fd371f4cf0cf5eb3feccd07515af9fd6cf7dbf1d1793a797/wrapt-1.12.1.tar.gz
Requirement already satisfied: grpcio>=1.8.6 in e:\2020_vms_tensorflow_1\lib\site-packages (from tensorflow==1.15.0)
Requirement already satisfied: numpy=1.16.0 in e:\2020_vms_tensorflow_1\lib\site-packages (from tensorflow==1.15.0)
Requirement already satisfied: absl-py>=0.7.0 in e:\2020_vms_tensorflow_1\lib\site-packages (from tensorflow==1.15.0)
Requirement already satisfied: keras-preprocessing>=1.0.5 in e:\2020_vms_tensorflow_1\lib\site-packages (from tensorflow==1.15.0)
Requirement already satisfied: gast==0.2.2 in e:\2020_vms_tensorflow_1\lib\site-packages (from tensorflow==1.15.0)
Requirement already satisfied: tensorflow-estimator==1.15.1 in e:\2020_vms_tensorflow_1\lib\site-packages (from tensorflow==1.15.0)
Requirement already satisfied: termcolor>=1.1.0 in e:\2020_vms_tensorflow_1\lib\site-packages (from tensorflow==1.15.0)
Collecting markdown>=2.6.8 (from tensorboard=1.15.0->tensorflow==1.15.0)
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/a4/63/eaec2bd025ab48c754b55e8819af0f6a69e2b1e187611dd40cbbe101ee7f/Markdown-3.2.2-py3-none-any.whl
Collecting werkzeug>=0.11.15 (from tensorboard=1.15.0->tensorflow==1.15.0)
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/cc/94/5f7079a0e00bd6863ef8f1da638721e9da21e5bacee597595b318f71d62e/Werkzeug-1.0.1-py2.py3-none-any.whl
Collecting setuptools>=41.0.0 (from tensorboard=1.15.0->tensorflow==1.15.0)
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/b0/8b/379494d7dbd3854aa7b85b216cb0af54edcb7fce7d086ba3e35522a713cf/setuptools-50.0.0-py3-none-any.whl (783kB)
100% |████████████████████████████████| 788kB 121kB/s
Requirement already satisfied: h5py in e:\2020_vms_tensorflow_1\lib\site-packages (from keras-applications>=1.0.8->tensorflow==1.15.0)
Collecting importlib-metadata; python_version =2.6.8->tensorboard=1.15.0->tensorflow==1.15.0)
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/8e/58/cdea07eb51fc2b906db0968a94700866fc46249bdc75cac23f9d13168929/importlib_metadata-1.7.0-py2.py3-none-any.whl
Collecting zipp>=0.5 (from importlib-metadata; python_version markdown>=2.6.8->tensorboard=1.15.0->tensorflow==1.15.0)
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/b2/34/bfcb43cc0ba81f527bc4f40ef41ba2ff4080e047acb0586b56b3d017ace4/zipp-3.1.0-py3-none-any.whl
Building wheels for collected packages: wrapt
Running setup.py bdist_wheel for wrapt ... done
Stored in directory: C:\Users\lenovo\AppData\Local\pip\Cache\wheels\68\e3\d7\4b6eee6f5d547bdfd97ba406128db66c5654dfb831fda163a2
Successfully built wrapt
Installing collected packages: zipp, importlib-metadata, markdown, werkzeug, setuptools, tensorboard, wrapt, tensorflow
Found existing installation: setuptools 28.8.0
Uninstalling setuptools-28.8.0:
Successfully uninstalled setuptools-28.8.0
Successfully installed importlib-metadata-1.7.0 markdown-3.2.2 setuptools-50.0.0 tensorboard-1.15.0 tensorflow-1.15.0 werkzeug-1.0.1 wrapt-1.12.1 zipp-3.1.0
You are using pip version 9.0.1, however version 20.2.2 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.
(2020_vms_tensorflow_1) D:\2020_vir_tensorflow1\install_whl>
(2020_vms_tensorflow_1) D:\2020_vir_tensorflow1\install_whl>python
Python 3.6.4 (v3.6.4:d48eceb, Dec 19 2017, 06:54:40) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
...
>>>
>>> print(tf.__version__)
1.15.0
>>>
数据的采集及清洗
本文采用医疗行业电子病历分析案例,数据及代码来源于互联网资料。电子病历文本自然语言处理研究主要关注病历文本的处理,包括句子边界识别、词性标注、句法分析等,信息抽取以自然语言处理研究为基础,主要关注病历文本中各类表达医疗知识的命名实体或医疗概念的识别和关系抽取。
- 人工标注的实体数据源 0.ann:第一列是序号,第二列是实体名称,第三列、第四列是标识实体在对应的0.txt文件的起始位置和结束位置,第五列是标识的实体名称。这是人工打标标识的文件。
......
T1 Disease 1845 1850 1型糖尿病
T2 Disease 1983 1988 1型糖尿病
T4 Disease 30 35 2型糖尿病
T5 Disease 1822 1827 2型糖尿病
T6 Disease 2055 2060 2型糖尿病
T7 Disease 2324 2329 2型糖尿病
T8 Disease 4325 4330 2型糖尿病
T9 Disease 5223 5228 2型糖尿病
.......
医生针对患者的诊疗活动可以概括为:通过患者自述(自诉症状)和检查结果(检查项目)发现疾病的表现(症状),给出诊断结论(疾病),并基于诊断结论,给出治疗措施(治疗方案),涉及信息包括:症状、疾病、检查和治疗。
- 0.ann对应的原始文本数据源 0.txt:
......
1.一般将HBA1C 。控制于
关注
打赏
最近更新
- 深拷贝和浅拷贝的区别(重点)
- 【Vue】走进Vue框架世界
- 【云服务器】项目部署—搭建网站—vue电商后台管理系统
- 【React介绍】 一文带你深入React
- 【React】React组件实例的三大属性之state,props,refs(你学废了吗)
- 【脚手架VueCLI】从零开始,创建一个VUE项目
- 【React】深入理解React组件生命周期----图文详解(含代码)
- 【React】DOM的Diffing算法是什么?以及DOM中key的作用----经典面试题
- 【React】1_使用React脚手架创建项目步骤--------详解(含项目结构说明)
- 【React】2_如何使用react脚手架写一个简单的页面?