Set OCR cache options

This section covers what is the Nuix OCR cache folder, and how to:

Define or update cache options for an OCR Profile

Locate the most recently generated OCRed items

Nuix OCR cache folder

The OCR process maintains a cache which stores the results of items that have been OCRed. If in the OCR Profile you specify a network folder accessible from multiple Nuix Workstation instances and cases, this provides a central cache for OCR-processed items which should potentially reduce the amount of OCR processing required. This, in fact, is one of the original intentions of the cache.

To expedite the OCR process, if the cache is found to contain OCR results which match the MD5 of an item in question, the cache values are applied to that item instead of passing that item through the OCR processor. Then the cache stores TXT files OCRed by MD5 digest as PDF files.

The Nuix OCR feature places two items in a "cache folder" (.txt and .pdf) in the Nuix case folder. This acts as a "deduplication" method so that OCR does not run against the same document multiple times. It also helps performance. If you attempt to run OCR a few times on the same document using different OCR settings and profiles, nothing changes. This is because the OCR Cache contains copies from the first attempt. To apply any new OCR settings, you need to clear the cache or specify a new cache.

If another item in the case has the same MD5 value and you attempt OCR on that item, Nuix Workstation retrieves this PDF or TXT file instead of passing the item to be OCRed. The filenames of the PDF and TXT files in the OCR cache directory are named by the MD5 value of each OCRed item.

Image 55

Define or update cache options for an OCR Profile

Out-of-the-box or modified OCR Profiles also require you to configure cache settings on the Cache and Description tab of the Add or Edit OCR Profile dialog. If in the OCR Profile you specify a network folder accessible from multiple Nuix Workstation instances and cases, this provides a central cache for OCR processed items and helps to reduce the amount of OCR processing required.

To define or update the caching options for an OCR profile:

Open the Add OCR profile window (see the previous first steps of Create your OCR Profile).

Select the Cache and Description tab.

In Description provide a high-level description to define this OCR profile.

Under Cache Options, enable one of the following options:

Update duplicate items in case: To update all post-load OCRed items in the case. This option is not applicable to front-load OCR.

Use custom cache directory: To locate and set a special cache directory (for example,

F:\tmp\ocr-cache); else this defaults to a folder in the OCR case directory.

Clear cache on completion: To automatically clear and delete the folder on completion of processing any OCR job.

Note: You can also delete the cache or the items in the cache manually, but then you must do this every time you perform OCR with a new OCR Profile or you change any OCR settings in the Add or Edit OCR Profile dialog. Thereafter, you can then run OCR with new settings.

Click OK.

Locate the most recently generated OCRed items

To find your most recently generated OCRed items:

Locate the default cache folder in your OCR case directory or custom one you defined to store items post OCRing.