Tesseract Text Extraction

This task uses Tesseract OCR 5.X to scan for text from images and PDF files, saving that text to a field in the repository documented called simflofy_ai_texts. Supported formats are .png, .jpg, .pdf, .tiff, .gif, and .bmp. PDFs are saved on a per-page basis to configuration field.

Note:  PREREQUISITE
This task requires Tesseract OCR 5.X to be installed on the system that 3Sixty is running on.

Note:  Tesseract OCR will be an optional dependency of 3Sixty.

Once installed and added to the job you will be able to check the box to attach text as metadata in the task using Tesseract and enter the field name for the extracted content. We recommend using "content" if it does not conflict with other metadata fields in your run"..

For a walk through of how to install Tesseract on windows watch this video: How to Install and Use Tesseract OCR on Windows


Configuration

Engine Mode

Select which engine Tesseract should use, legacy or LTSM. Ensure that ensure is installed before selecting it, or leave it on the default config for it to detect your engine.

Page Segmentation Mode

By default Tesseract expects a page of text. You can change the way it segments a page if your images differ from this.

Tesseract Language Code

The language code for the installed trained data in your Tessdata directory. This is in ISO 639-1/T format and is the letters before the .trained data extension for the trained data file.

Use HOCR

Whether to use HOCR. When enabled, text will be output in HTML format rather than as raw text.


Examples


API Keys

Processor: tesseractTextExtraction

Key

Display Name

Type

use_condition Check a condition before executing this task. Boolean

task_condition

Condition

String

task_stop_proc

Stop Processing

Boolean

tesseract_field

Metadata field for extracted text

String

tesseract_engine_mode

Engine Mode

String

tesseract_page_seg_mode

Page Segmentation Mode

String

tesseract_lang

Tesseract Language Code (ISO 639-1/T)

String

tesseract_use_hocr

Use HOCR?

Boolean