User Guide
04.2 · OCR and Text Extraction

Languages and Preprocessing

Language Setting

Tip

SettingOcr.Language (default eng). Change it in Settings → OCR → Language.

Tex ships with 29 Tesseract language packs. The .traineddata files live in the tessdata\ folder next to Tex.Wpf.exe. Only one language is active per OCR call — switch in Settings when you need another script.

ScriptCodes
Latin — Western Europeaneng English, fra French, deu German, spa Spanish, ita Italian, por Portuguese, nld Dutch, dan Danish, fin Finnish, hun Hungarian, nor Norwegian, swe Swedish, ces Czech, pol Polish, tur Turkish, vie Vietnamese
Cyrillicrus Russian, ukr Ukrainian
CJKjpn Japanese, chi_sim Chinese (Simplified), chi_tra Chinese (Traditional), kor Korean
Arabic / RTLara Arabic, heb Hebrew
Indic / SE Asiahin Hindi, tha Thai
Classicalgrc Ancient Greek, lat Latin
Auxiliaryosd (orientation & script detection, not a language itself)
Tip

Tip — Using the right language makes a bigger difference to accuracy than any preprocessing toggle. OCR run with eng on a Japanese screenshot will produce garbage regardless of image quality.

Preprocessing Toggle

Tip

SettingOcr.PreprocessImage (default true). Settings → OCR → Preprocess Image.

When enabled, Tex runs the captured image through a contrast / threshold / denoise pipeline before passing it to Tesseract. This is tuned for screenshots (crisp anti-aliased text on flat backgrounds) and usually improves confidence scores by 5–20 points.

SituationRecommendation
Normal app / web screenshotsLeave on.
High-resolution documents at 100% zoomLeave on.
Photographs of paper / whiteboardsLeave on — still helps.
Already-cleaned binary imagesTry off if confidence is oddly low.
Tip

Warning — Preprocessing adds ~50–200 ms per call. If you are scripting bulk OCR and every millisecond counts, benchmark both modes on your actual input.