[Ubuntu] tesseractをinstallして使いたい – ソフトウェアエンジニアの技術ブログ：Software engineer tech blog

$ sudo apt -y install tesseract-ocr tesseract-ocr-jpn libtesseract-dev libleptonica-dev tesseract-ocr-script-jpan tesseract-ocr-script-jpan-vert
$ tesseract -v
tesseract 4.1.3
leptonica-1.79.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.3) : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.1
Found AVX2
Found AVX
Found SSE
Found libarchive 3.4.0 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.8 liblz4/1.9.2 libzstd/1.4.4
$ tesseract –list-langs
List of available languages (5):
Japanese
Japanese_vert
eng
jpn
osd
※osdとは、言語の判定、文字角度の識別を行う

$ tesseract test1.png output
->
$ tesseract test1.png stdout
Hello world
$ tesseract test2.png stdout -l jpn
こんにちは

### Pythonで使えるようにする
$ pip3 install pyocr

from PIL import Image
import sys
import pyocr

tools = pyocr.get_available_tools()
langs = tools[0].get_available_languages()

img = Image.open('test1.png')
txt = tools[0].image_to_string(
	img,
	lang=langs[0],
	builder=pyocr.builders.TextBuilder(tesseract_layout=6)
)
print(txt)

※tesseract_layout=6 が精度に重要

$ python3 main.py
Hello world

lang=’jpn’に変更すると、日本語も出力できる
$ python3 main.py
とこんにちは
|
tesseract_layout=3で実行すると…
$ python3 main.py
こんにちは

#### tesseract_layoutの意味
0 = Orientation and script detection (OSD) only.
1 = Automatic page segmentation with OSD.
2 = Automatic page segmentation, but no OSD, or OCR
3 = Fully automatic page segmentation, but no OSD. (Default)
4 = Assume a single column of text of variable sizes.
5 = Assume a single uniform block of vertically aligned text.
6 = Assume a single uniform block of text.
7 = Treat the image as a single text line.
8 = Treat the image as a single word.
9 = Treat the image as a single word in a circle.
10 = Treat the image as a single character.

0 =方向とスクリプトの検出（OSD）のみ。
1 = OSDによる自動ページセグメンテーション。
2 =自動ページセグメンテーション、ただしOSDまたはOCRなし
3 =完全自動のページセグメンテーション。ただし、OSDはありません。（ディフォルト）
4 =可変サイズのテキストの単一列を想定します。
5 =垂直に配置されたテキストの単一の均一なブロックを想定します。
6 =単一の均一なテキストブロックを想定します。
7 =画像を単一のテキスト行として扱います。
8 =画像を1つの単語として扱います。
9 =画像を円の中の1つの単語として扱います。
10 =画像を1文字として扱います。

6で設定されているのが多いですね
文字角度を考慮しなければ3で、文字角度を考慮する場合は6でしょうか？
読み取りたい文字に併せて設定を変えながら試していくイメージか…