Tesseract OCR训练时碰到的问题和解决方案

高精度计算机视觉发布时间：2022-09-22 11:50:27 ，浏览量：11

关于如何用Tesseract-OCR训练我就不重复了，大家可以直接参考下面的案例，

使用tesseract训练自己的字库提高识别率_SeventhBlue的博客-CSDN博客_tesseract训练自己的字库OCR 文字识别 - boyang987 - 博客园

随着tesseract版本的不断更新，发现2021年3月18日和以后编译的版本在训练时都存在问题，源码里有不少BUG[https://github.com/tesseract-ocr/tesseract/issues/3925]；于是，我只好在github上拉下了以前的版本进行验证，

git clone --recursive -b 5.0.0-alpha-20201224 https://github.com/tesseract-ocr/tesseract tesseract500A2012

接下来，用cmake-gui进行配置后编译即得到最终的代码，这些基础的过程就不详述了，毕竟没时间写成系列的教学篇章。

使用的过程的指令大致如下，

(1) 加路径到环境变量
E:\pkg_ocr\tesseract\tesseract520

(2) 编辑图片
cd  E:\pkg_ocr\tesstrain\jTessBoxEditor231
train.bat ----> jTessBoxEditor  ---> merge TIFF ---> save it as myfontlab.normal.exp0.tif

(3) 在命令窗口进行操作，
tesseract  myfontlab.normal.exp0.tif   myfontlab.normal.exp0   batch.nochop   makebox
tesseract   myfontlab.normal.exp0.tif    myfontlab.normal.exp0   nobatch   box.train
注意：如果发现empty这样的报错，不能通过box检查，就需要调整对比度或亮度，然后再合成tif

(4)
unicharset_extractor myfontlab.normal.exp0.box

(5)
echo normal 0 0 0 0 0 > font_properties
注意，文件的名字就是font_properties它没有.txt后缀（我用的font_properties.txt发现也没有问题）。里面内容写入 normal 0 0 0 0 0 表示默认普通字体。
注意这里的normal要和myfontlab.normal.exp0.tif中的normal一样。

(6)
shapeclustering -F font_properties -U unicharset myfontlab.normal.exp0.tr
或者
shapeclustering -F font_properties.txt -U unicharset myfontlab.normal.exp0.tr

(7)
mftraining  -F font_properties -U unicharset -O train.unicharset myfontlab.normal.exp0.tr
生成inttemp、pffmtable文件的时候，如果上面命令不行的话，或者报错，就使用下面的命令，
mftraining -F font_properties.txt -U unicharset -O train.unicharset myfontlab.normal.exp0.tr

(8)
cntraining myfontlab.normal.exp0.tr

(9)
combine_tessdata normal

(10)测试成功会生成一个t_7B-normal.txt的文件，如下
tesseract E:\test_images\ocr\t_7B.png  E:\test_images\ocr\t_7B-normal -l normal

问题1：

mftraining.exe Warning no protos configs for -something- in CreateIntTemplates() when use command mftraining

这个主要是样本量不够引起的，例如你要训练的某个字符只有小于5个的样本，最好你需要准备10个样本。

问题2：

combine_tessdata.exe: Error: traineddata file must contain at least (a unicharset fileand inttemp) OR an lstm file.

这个问题我在github上有回复，

https://github.com/tesseract-ocr/tesstrain/issues/156

就是在生成各种文件后，

cntraining mytest.normal.exp0.tr

inttemp normproto pffmtable shapetable unicharset

需要把这些文件重命名为

normal.inttemp normal.normproto normal.pffmtable normal.shapetable normal.unicharset

然后再执行combine_tessdata normal，就可以得到最终训练的结果，我得到的输出如下，

Combining tessdata files Output normal.traineddata created successfully. Version string:5.0.0-alpha-20201224 1:unicharset:size=662, offset=192 3:inttemp:size=132152, offset=854 4:pffmtable:size=103, offset=133006 5:normproto:size=1262, offset=133109 13:shapetable:size=166, offset=134371 23:version:size=20, offset=134537

---------------------------

本文结束，有其他问题再来补充。

关注

打赏

1661664439

查看更多评论

Tesseract OCR训练时碰到的问题和解决方案

最近更新

热门博客

[ 申请 ]友情链接：