简介

Tesseract-OCR,一款由HP实验室开发并由Google维护的开源光学字符识别引擎,支持全球100多种语言,并且支持样本训练。

准备

依赖

1
yum install autoconf automake libtool libjpeg-devel libpng-devel libtiff-devel zlib-devel

下载

1
2
wget https://github.com/tesseract-ocr/tesseract/archive/4.1.0.tar.gz
wget https://github.com/DanBloomberg/leptonica/releases/download/1.78.0/leptonica-1.78.0.tar.gz

安装

Leptonica

1.编译安装

1
2
3
4
tar -xzvf leptonica-1.78.0.tar.gz
cd leptonica-1.78.0.tar.gz
./configure --profix=/usr/local/leptonica
make && make install

2.环境变量

1
2
#打开 /etc/profile 并 追加以下配置
vi /etc/profile
1
2
3
4
5
6
7
8
9
10
11
12
PKG_CONFIG_PATH=$PKG_CONFIG_PATH:/usr/local/leptonica/lib/pkgconfig
export PKG_CONFIG_PATH
CPLUS_INCLUDE_PATH=$CPLUS_INCLUDE_PATH:/usr/local/leptonica/include/leptonica
export CPLUS_INCLUDE_PATH
C_INCLUDE_PATH=$C_INCLUDE_PATH:/usr/local/leptonica/include/leptonica
export C_INCLUDE_PATH
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/leptonica/lib
export LD_LIBRARY_PATH
LIBRARY_PATH=$LIBRARY_PATH:/usr/local/leptonica/lib
export LIBRARY_PATH
LIBLEPT_HEADERSDIR=/usr/local/leptonica/include/leptonica
export LIBLEPT_HEADERSDIR
1
2
#应用配置
source /etc/profile

Tesseract

编译安装

1
2
3
4
tar -xzvf 4.1.0.tar.gz
cd tesseract-4.1.0
./configure --profix=/usr/local/tesseract
make && make install

环境变量

1
2
#打开 /etc/profile 并 追加以下配置
vi /etc/profile
1
2
PATH=$PATH:/usr/local/tesseract/bin
export PATH
1
2
#应用配置
source /etc/profile

语言

1
2
3
4
5
6
7
#所有语言
https://github.com/tesseract-ocr/tessdata
#下载语言(以英语为例)
cd /usr/local/tesseract/share/tessdata
wget https://raw.githubusercontent.com/tesseract-ocr/tessdata/master/eng.traineddata
#查看可用语言
tesseract --list-langs

完成

1
2
3
4
#版本
tesseract -v
#测试
tesseract test.jpg stdout -l eng