Tesseract Text Extraction
This task uses Tesseract OCR 5.X to scan for text from images and PDF files, saving that text to a field in the repository documented called simflofy_ai_texts. Supported formats are .png, .jpg, .pdf, .tiff, .gif, and .bmp. PDFs are saved on a per-page basis to configuration field.
Note: PREREQUISITE
This task requires Tesseract OCR 5.X to be installed on the system that 3Sixty is running on.
Note: Tesseract OCR will be an optional dependency of 3Sixty.
-
Windows Install: https://tesseract-ocr.github.io/tessdoc/Downloads.html
-
Mac Install: https://formulae.brew.sh/formula/tesseract (command line: brew install tesseract)
-
Ubuntu Install: https://ubuntuhandbook.org/index.php/2021/12/install-tesseract-ocr-5-ubuntu/
Once installed and added to the job you will be able to check the box to attach text as metadata in the task using Tesseract and enter the field name for the extracted content. We recommend using "content" if it does not conflict with other metadata fields in your run"..
For a walk through of how to install Tesseract on windows watch this video: How to Install and Use Tesseract OCR on Windows
Configuration
Engine Mode
Select which engine Tesseract should use, legacy or LTSM. Ensure that ensure is installed before selecting it, or leave it on the default config for it to detect your engine.
Page Segmentation Mode
By default Tesseract expects a page of text. You can change the way it segments a page if your images differ from this.
Tesseract Language Code
The language code for the installed trained data in your Tessdata directory. This is in ISO 639-1/T format and is the letters before the .trained data extension for the trained data file.
Use HOCR
Whether to use HOCR. When enabled, text will be output in HTML format rather than as raw text.
Examples
API Keys
Processor: tesseractTextExtraction
Key |
Display Name |
Type |
---|---|---|
use_condition | Check a condition before executing this task. | Boolean |
task_condition |
Condition |
String |
task_stop_proc |
Stop Processing |
Boolean |
tesseract_field |
Metadata field for extracted text |
String |
tesseract_engine_mode |
Engine Mode |
String |
tesseract_page_seg_mode |
Page Segmentation Mode |
String |
tesseract_lang |
Tesseract Language Code (ISO 639-1/T) |
String |
tesseract_use_hocr |
Use HOCR? |
Boolean |