04.2 · OCR and Text Extraction

Languages and Preprocessing

Language Setting

Tip

Setting — Ocr.Language (default eng). Change it in Settings → OCR → Language.

Tex ships with 29 Tesseract language packs. The .traineddata files live in the tessdata\ folder next to Tex.Wpf.exe. Only one language is active per OCR call — switch in Settings when you need another script.

Script	Codes
Latin — Western European	`eng` English, `fra` French, `deu` German, `spa` Spanish, `ita` Italian, `por` Portuguese, `nld` Dutch, `dan` Danish, `fin` Finnish, `hun` Hungarian, `nor` Norwegian, `swe` Swedish, `ces` Czech, `pol` Polish, `tur` Turkish, `vie` Vietnamese
Cyrillic	`rus` Russian, `ukr` Ukrainian
CJK	`jpn` Japanese, `chi_sim` Chinese (Simplified), `chi_tra` Chinese (Traditional), `kor` Korean
Arabic / RTL	`ara` Arabic, `heb` Hebrew
Indic / SE Asia	`hin` Hindi, `tha` Thai
Classical	`grc` Ancient Greek, `lat` Latin
Auxiliary	`osd` (orientation & script detection, not a language itself)

Tip

Tip — Using the right language makes a bigger difference to accuracy than any preprocessing toggle. OCR run with eng on a Japanese screenshot will produce garbage regardless of image quality.

Preprocessing Toggle

Tip

Setting — Ocr.PreprocessImage (default true). Settings → OCR → Preprocess Image.

When enabled, Tex runs the captured image through a contrast / threshold / denoise pipeline before passing it to Tesseract. This is tuned for screenshots (crisp anti-aliased text on flat backgrounds) and usually improves confidence scores by 5–20 points.

Situation	Recommendation
Normal app / web screenshots	Leave on.
High-resolution documents at 100% zoom	Leave on.
Photographs of paper / whiteboards	Leave on — still helps.
Already-cleaned binary images	Try off if confidence is oddly low.

Tip

Warning — Preprocessing adds ~50–200 ms per call. If you are scripting bulk OCR and every millisecond counts, benchmark both modes on your actual input.