Tika Text Extraction

Apache Tika is an open-source tool used to extract text from documents. 3Sixty most commonly uses it to extract text during indexing for federated search.

Tika Text Extractor requires Tesseract OCR 5.X

Once installed check the box to attach text as metadata and enter the field name for the extracted content. We recommend using "content" if it does not conflict with other metadata fields in your run".

Note:  For this feature to work on larger files, memory pool settings of 4GB is required, 8GB recommended. This can be updated in the Java tab of your Apache Tomcat Properties window.

Important:  File size limit is 95MB


Configuration

To use this task go to the task tab in your job. Select the task from the drop down and click the plus circle to configure the task. Click done after making any changes to save.

Condition check

It will execute the task when the condition's result is 'true', 't', 'on', '1', or 'yes' (case-insensitive), or run on all conditions if left empty. This condition is evaluated for each document, determining whether the task should be executed based on the specified values.

Example: If I only want to run this task for PDF documents I would use the expression: equals('#{rd.mimetype}',"application/pdf")

Tika Content Field

This is the field that 3Sixty will use to put the content it extracts from the document. The default field is content.

Max Content Length (B)

Set the max content length which is checked before processing. The job will not process documents over this size. Set to 0 to process documents of any length.

File Extensions to Extract

Comma delimited list of file extensions to process or leave blank to process all. The extensions are checked at the same time as content length.

Fail Document on Extraction Error

Fail the Document if there is an Extraction Error during processing.

Remove Content After Extraction

Remove the content from the documents. This will happen even if the document exceeds the maximum length.


Examples

The following example will extract all of the content in the documents while processing the integration job.


API Keys

Processor: tikaExtractorTask

Key

Display Name

Type

use_condition Check a condition before executing this task. Boolean

task_condition

Condition

String

task_stop_proc

Stop Processing

Boolean

tejt_field_to_mark

Tika Content Field

String

tejt_max_length

Max Content length (B)

LONG

tejt_etp

File Extensions to Extract

String

tejt_fail_on_error

Fail Document on Extraction Error

Boolean

tejt_rm_bin

Remove content after extraction

Boolean