Duplicate Detection
The Duplication Detection task can be used to identify duplicate documents in a specific field during an integration job run. If files already exist in the output location, errors may occur. This task will prevent 3Sixty from attempting to integrate duplicate documents, resulting in a cleaner and faster integration process. This task checks the selected field to identify duplicate documents during the job run and will take the selected actions against the document if a duplicate is found.. To see how each repository handles duplicates and versioning check the individual connector page.
When the duplication check is ran multiple times the original file will not be marked as duplicate.
Configuration
To use this task go to the task tab in your job. Select the task from the drop down and click the plus circle to configure the task. Click done after making any changes to save.
Condition check
It will execute the task when the condition's result is 'true', 't', 'on', '1', or 'yes' (case-insensitive), or run on all conditions if left empty. This condition is evaluated for each document, determining whether the task should be executed based on the specified values.
Example: If I only want to run this task for PDF documents I would use the expression: equals('#{rd.mimetype}',"application/pdf")
Field to Compare
This is the field 3Sixty will use to check for duplicates. If this value is found in any other document it will be considered a duplicate. The default is the File Content Hash. Documents can be compared by the following properties:
-
File hash: This option will require the Hash Value Generator task to be added to the job. When 3Sixty runs the job it will generate a hash value for the file and the duplication checker will use this hash value to compare documents to flag any duplicates. Note: Hash will find duplicates of two files with the same content even if the file names are different.
Note: Tasks are ran in the order they are listed. If you wish to compare file hashes (a sort of fingerprint for a document), you will need to precede this task with a Hash Value Generator Task. If you edit the contents of the file the hash value will change and will no longer be flagged as a duplicate.
-
Document type: records can be compared by Document type such as folder or document for example.
-
Document source ID: This will usually be the file path of the document for example a file in an Objective directory would be \\objnas.objective.com\Engineering\Simflofy\2156\testdocument1.docx
-
Document URI: In software like Amazon S3 this can be the url of the document
-
Version ID: These are used in cases like Amazon S3 for example. Any time you have a PUT request in an S3 bucket that has versioning enabled, it triggers that object to become the latest version and assigns it a new Version ID.
-
Version series ID: in certain software like SharePoint this can be the file path or the url of the file.
Duplication Check Scope
During the job run you have the option to decide how duplicates should be detected.
-
Job Run Only: Using this option, 3Sixty will only check the documents associated with this job run. This is useful for an initial job run to identify duplicates during the first integration from one source to another.
-
Job: Using this option, 3Sixty will check all the documents that ever ran for that job. You can do this when you want to capture any new documents that have been added to the source repository to ensure that documents that have already been migrated don’t get added multiple times.
-
Enterprise: This option will check all documents ever processed through 3Sixty in case you have duplicate documents across multiple repositories to reduce the number of duplicates across multiple locations.
Action
When the job runs the user can select from three options as far as what you want to do when 3Sixty encounters duplicates.
-
Audit and continue: This action will continue the scan. 3Sixty will integrate the duplicate but you can look at the processed record type in the job run history to easily identify the duplicates found. If duplicate files exist, you may encounter an error, but the job will continue to run.
-
Skip the document: This action will skip over any found duplicates preventing that “file already exists” error. 3Sixty will only migrate the original files.
-
Fail the Job: This action will stop the job from continuing to run if a duplicate is found. You can review the found file then re-run the job to identify more duplicates.
Tagging Duplicate Documents
Metadata can be added through mapping to tag documents discovered as duplicates when the Action field is set to Audit and Continue.
The fields that can be added are:
-
isDuplicate: true (if duplicate found) or false (if not)
Important: When using the Duplication Detection task in a job the user must make sure that if they are mapping the "isDuplicate" field that they set the Target type to String and not Boolean or they will receive an error that they cannot change text to boolean. If this error is received the user has to drop the index and run the job again.
-
baseParentID: doc ID of the original document
-
duplicationParentID: comma separated list of doc IDs of the documents that it found duplicates against - blank if no duplicate detected
-
duplicationScope: blank if no duplicate detected, and
-
duplicationCriteria: which fields the duplicate was considered against; depends on what was selected below - blank if no duplicate detected
Creating duplicate mapping fields
Examples
In the images below you can see an example of how to configure your job to detect duplicate data.
Configuring the task following the first image will tell 3Sixty when the job runs, compare the hash fields to the other documents in the job run. If a duplicate document is found, make a note of it in the job's audit and continue to run the job.
In the second image the following fields will be added during the integration and will be available to filters by in reports. isDuplicate, duplicateParentID, duplicationScope, and duplicationCriteria.
Adding the Duplicate detection task
API Keys
Processor: duplicationCheckTask
Key |
Display Name |
Type |
---|---|---|
use_condition | Check a condition before executing this task. | Boolean |
task_condition |
Condition |
String |
task_stop_proc |
Stop Processing |
Boolean |
field_to_compare |
Field to Compare |
String |
scope |
Duplication Check Scope |
String |
action |
Action |
String |