Install tesseract on windows with a gui

INSTALL TESSERACT ON WINDOWS WITH A GUI HOW TO
INSTALL TESSERACT ON WINDOWS WITH A GUI INSTALL

Unicharset_extractor `wrap $N “” “.box”`Įcho “ocrb 0 0 1 0 0” > font_properties # tell Tesseract informations about the font #rm pol.pffmtable pol.shapetable pol.traineddata pol.unicharset unicharset font_properties pol.inttemp pol.normproto *.tr *.txt

# Uncomment this line if, you’re rerunning the script N=13 # Change this accordingly to number of files, that you want to feed to tesseract or export it as a script parameter. Do not run it now, read it carefully. You will need to customize it to meet your needs. There is yet one important thing to remember before you go further: If you are using windows make sure all of your files that you are using have the UNIX style end-of-line! If you are editing them manually you can do it with notepad++ in Edit -> EOL Conversion. Now we are going to generate *.traineddata file which can later be loaded to Tesseract, so it can recognize characters the way we want it. Time to train Tesseract to recognize letters properly Open each file (image file, not *.box file that you generated) with qt-box-editor and correct Tesseract if it made any mistakes (if it did not, you probably don’t have to train it □ ). If your files contain letters in a grid, you should use it, but otherwise you may want to remove it from the command. batch.nochop – tells Tesseract not to use its fancy algorithms for segmenting the picture.makebox – tells Tesseract to (only) generate box files.You can find them all in $TESSERACT_INSTALATION_DIR/tessdata/configs/ and $TESSERACT_INSTALATION_DIR/tessdata/tessconfigs/ ( here you can find the list of parameters you can use in the config files). The first two parameters of the command are input and output file names (remember to change them accordingly), then there follow config files (“batch.nochop” and “makebox”) which tell Tesseract what to do. The input files must be named accordingly to the Tesseract convention: You need one or multiple files that together contain at least 1 (but preferably more) occurrence of each glyph of your font. I decided that to achieve the best accuracy I should train Tesseract with images preprocessed in exactly the same way as they would be in the final application. In my case the font was OCR-B – a font that is used on ID cards in Poland. for Windows (I used version 1.08, the newer ones are for some reason not packaged with all needed libraries, what makes the installation more difficult)įirst, you must prepare the data which you want to feed into Tesseract.Qt-box-editor – this is the only GUI program, you’re going to need – to fix the boxes generated by Tesseract, and ensure we feed the right data into it.Cygwin – if you are using Windows (or you can rewrite the scripts from this article to Windows Batch).

INSTALL TESSERACT ON WINDOWS WITH A GUI INSTALL

What do we need before we begin?įirst, you need to install tesseract-ocr (this tutorial is based on version 3.02). Do not forget to add the installation directory to your system path (the installer may not do it). In this article I will try to explain the process step by step. Unfortunately, it’s a little bit outdated and doesn’t include some details.

INSTALL TESSERACT ON WINDOWS WITH A GUI HOW TO

Looking for a solution on how to do this, I came across a couple of articles suggesting to use some third-party GUI applications, but I encountered many problems with customizing them and still didn’t meet my goals. Luckily, I found this great article by Cédric Verstraeten which helped me to make it an old-fashioned command-line way. So we had to train Tesseract how to read these fonts properly. But we had some problems with specific letters recognition (mixing W and H, O and 0 (zero)). We used it to develop an application that automatically reads data from ID cards. It worked well and we did not spent much time on development. It can be used as a command-line program or an embedded library in a custom application. Tesseract is very good at recognizing multiple languages and fonts.