Tika Text Extraction
Apache Tika is an open-source tool used to extract text from documents. 3Sixty most commonly uses it to extract text during indexing for federated search.
Tika Text Extractor requires Tesseract OCR 5.X
-
Windows Install: https://tesseract-ocr.github.io/tessdoc/Downloads.html
-
Mac Install: https://formulae.brew.sh/formula/tesseract (command line: brew install tesseract)
-
Ubuntu: https://ubuntuhandbook.org/index.php/2021/12/install-tesseract-ocr-5-ubuntu/
Once installed check the box to attach text as metadata and enter the field name for the extracted content. We recommend using "content" if it does not conflict with other metadata fields in your run".
Note: For this feature to work on larger files, memory pool settings of 4GB is required, 8GB recommended. This can be updated in the Java tab of your Apache Tomcat Properties window.
Important: File size limit is 95MB
Configuration
To use this task go to the task tab in your job. Select the task from the drop down and click the plus circle to configure the task. Click done after making any changes to save.
Condition check
It will execute the task when the condition's result is 'true', 't', 'on', '1', or 'yes' (case-insensitive), or run on all conditions if left empty. This condition is evaluated for each document, determining whether the task should be executed based on the specified values.
Example: If I only want to run this task for PDF documents I would use the expression: equals('#{rd.mimetype}',"application/pdf")
Tika Content Field
This is the field that 3Sixty will use to put the content it extracts from the document. The default field is content.
Max Content Length (B)
Set the max content length which is checked before processing. The job will not process documents over this size. Set to 0 to process documents of any length.
File Extensions to Extract
Comma delimited list of file extensions to process or leave blank to process all. The extensions are checked at the same time as content length.
Fail Document on Extraction Error
Fail the Document if there is an Extraction Error during processing.
Remove Content After Extraction
Remove the content from the documents. This will happen even if the document exceeds the maximum length.
Examples
The following example will extract all of the content in the documents while processing the integration job.
API Keys
Processor: tikaExtractorTask
Key |
Display Name |
Type |
---|---|---|
use_condition | Check a condition before executing this task. | Boolean |
task_condition |
Condition |
String |
task_stop_proc |
Stop Processing |
Boolean |
tejt_field_to_mark |
Tika Content Field |
String |
tejt_max_length |
Max Content length (B) |
LONG |
tejt_etp |
File Extensions to Extract |
String |
tejt_fail_on_error |
Fail Document on Extraction Error |
Boolean |
tejt_rm_bin |
Remove content after extraction |
Boolean |