Tesseract
Description
OCR: Tesseract plugin step detects and extracts text from an image to a readable text type. Supported image types: BMP, PNG, JPG, JPEG. Compatibility: Tesseract version 4.0.0.
Prerequisites:
- Download tessdata(tesseract-ocr) version 4.0.0. https://github.com/tesseract-ocr/tessdata
- After download, extract it and put it on the processing machine on a particular path. You will need to specify this path in the ‘Data Folder Path’ in the step.
- Install Microsoft Visual C++ Redistributable for Visual Studio 2015, 2017, and 2019 (32 bit & 64 bit)
Configurations
No. | Field Name | Description |
---|---|---|
1 | Step Name | Name of the step. This name has to be unique in a single workflow. |
Input Fields | ||
1 | Data Folder Path | Specify the Tesseract data folder path or click the Browse button to browse the folder path (data folder path is mentioned in the prerequisites). The data type is String. This field is mandatory. |
2 | Button: Browse | Clicking on this button brings up the dialog to browse the Tesseract data folder path. |
3 | File Path | Specify the path of the input image file to extract readable text. Alternately browse the file path. Note: Supported image types are BMP, PNG, JPG, JPEG The data type is String. This field is mandatory. |
4 | Button: Browse | Clicking on this button brings up the dialog to browse the image File path. |
5 | Language Code | Specify Language. (e.g. eng for English, hin for Hindi, urd for Urdu). Multiple languages can be passed. Add ‘+’ sign to extract multi-language output. For language code refer URL: https://muthu.co/all-tesseract-ocr-options/ Default value is: eng. The data type is String. |
6 | Page Segment Mode | Select Page Segmentation Mode required as per the input file type. Allowed values are 0-13. The data type is String. Please refer table below for a list of Page Segmentation Mode with a description. |
Output Field | ||
1 | Output Text | Specify an output field to hold converted text on successful plugin execution. The default value is OutputText. |
Sr. | No. | Page Segment Mode | Description |
---|---|---|---|
1 | 0 | Orientation and script detection (OSD) only. | |
2 | 1 | Automatic page segmentation with OSD. | |
3 | 2 | Automatic page segmentation, but no OSD, or OCR. | |
4 | 3 | Fully automatic page segmentation, but no OSD. (Default) | |
5 | 4 | Assume a single column of text of variable sizes. | |
6 | 5 | Assume a single uniform block of vertically aligned text. | |
7 | 6 | Assume a single uniform block of text. | |
8 | 7 | Treat the image as a single text line. | |
9 | 8 | Treat the image as a single word. | |
10 | 9 | Treat the image as a single word in a circle. | |
11 | 10 | Treat the image as a single character. | |
12 | 11 | Sparse text. Find as much text as possible in no particular order. | |
13 | 12 | Sparse text with OSD. | |
14 | 13 | Raw line. Treat the image as a single text line, bypassing hacks that are Tesseract-specific. |