Guide to OCR Processing
This guide provides 'Best Practice' advice on how to perform OCR either when ingesting data into Nuix Workstation, or after doing so. This introduction covers:
What is Optical Character Recognition?
What are the capabilities and features of OCR?
What OCR processing speeds can one expect?
How can performance be sped up?
Do features differ in product versions?
Why provide a default Processing Profile with a default OCR Profile?
What is Optical Character Recognition?
Optical Character Recognition (OCR) is a technology that identifies letters, numbers, and other characters, converting images or scanned paper documents into searchable electronic text. Using OCR technology, Nuix Workstation can extract text from PDF documents (including text in email PDF attachments, text in scanned documents and text from pictures in PDF documents) and from picture artifacts. This allows you, for better investigative capability, to view and search on text that is normally locked inside images.
What are the capabilities and features of OCR?
Nuix Workstation uses the ABBYY FineReader Engine for OCR processing. The ABBYY Reader is provided at no additional cost with several licenses as the Nuix OCR Addon. For details see OCR license types and OCR Addon versions. To enable OCR processing in Nuix Workstation, you must download and install the Nuix OCR Addon separately from the Nuix Customer Portal. For information about the ABBYY FineReader, visit the ABBYY website.
ABBYY can OCR anything that can be converted to PDF. However, ABBYY cannot process encrypted or password-protected files. OCR works on images and non- searchable PDFs. The OCR engine respects the text layout in the text it outputs. Once processed, the items become fully searchable (keyword, context, skin tone analysis, and so forth).
While Nuix Workstation is pre-configured for your license type with a Nuix Processing Profile that includes a default OCR Profile, you can also set up customized OCR Profiles to identify specific file types to be OCRed automatically during ingestion, and certain item types to be skipped. See Customize an OCR Profile for details. OCR processing automatically identifies non-searchable PDFs and displays them using the Non-Searchable PDF filter.
Additionally, the OCR feature allows:
Nuix Workstation to report on documents that fail during the OCR process and tag them as exceptions to be processed when submitted for resubmission. Automatic resubmission through a script.
External OCR tools, including ABBYY, to be integrated and used with Nuix Workstation, if required.
What OCR processing speeds can one expect?
OCR processing speed depends on a number of factors, including the following:
The size of the native documents
The complexity of the native documents
The number of documents and pages
The number of Workers used
The configuration of the servers
Nuix Workstation’s OCR process was last benchmarked at approximately 27 pages per Worker per minute. In that exercise, the application OCRed 58,285 pages in 7,500 documents with four (4) Workers taking 8 hours and 58 minutes to process, but did not include PDF regeneration.
How can performance be sped up?
Note: Performing OCR, like processing or exporting activities, is most effectively done in Nuix Workstation by using Worker servers in a distributed network with one or more master servers using the same programs. This architecture requires at least two licenses with Nuix Worker capabilities. The master server manages the separate Worker servers as they handle individual tasks and provide the results of their tasks back to the master server. See the Nuix Workstation Guide to Configuring Distributed Workers for how to configure and maintain such a network.
Nuix OCR scales to available hardware. Adding RAM, cores, and Nuix Workers can increase performance. On premises machines being fixed resources do not scale like remote Workers. Additionally, remote workers are able to use less costly licenses. Therefore, if you want maximum Nuix processing power for every piece of hardware you have at the best price, then using remote Workers is a must. Besides, using remote Workers can be highly beneficial to get much faster speeds.
The OCR settings can make a significant difference. For example, selecting "Multiprocessing setting" to "Parallel" on a large multi-core machine can speed up OCR for large documents with many pages (because it parallelizes at a page level). Also using the Text extraction – Accuracy setting in the OCR Profile’s Template (which is now its default setting), and ensuring you disable the Use OCR printed image if you are not using the PDF, speeds things up. See Customize an OCR Profile for details.
Do features differ in product versions?
See the OCR license types and OCR Addon versions for which version of ABBYY works with which Nuix Workstation version.
For Nuix Workstation v8.8 and later:
The OCR installer for Windows has a dependency on the Microsoft Visual C++ 2010 Redistributable Package (x64) that requires it for OCR to function properly.
The Nuix Processing Profile is automatically configured for the relevant license types with the default OCR Profile.
The license you select on starting up Nuix Workstation determines what features are available and what options are enabled or visible in the Processing Settings window.
Nuix has set defaults based on the most common use cases when ingesting data. You can then determine if these settings are appropriate for your use case. One of the Processing Profile’s options is to perform front-load OCR when ingesting data into a case.
If you have any issues or do not see the OCR feature, log a ticket on the Nuix Support Portal, located at https://www.nuix.com/support.
For Nuix Workstation v9.10 and later:
You have the options to perform OCR during the ingestion process ("front-load OCR") OR after ingestion from the Results view.
For post-ingestion OCR, you can only run Simple cases.
The post-ingestion OCR option provides a more cost-effective one-step workflow for more targeted OCR processing of items that have already been ingested so that only items you select for OCR are processed. The OCR dialog that you access with this method contains all the necessary configuration you need to extract text from an OCR process. For details, see Perform OCR.
You can no longer run Simple cases as a background process, but you can for Compound and Elastic cases.
Why provide a default Processing Profile with a default OCR Profile?
Nuix Workstation is distributed with a default Processing Profile. The license a user selects when they start up Nuix Workstation determines the features that are available. This also affects what options are enabled or visible in the Processing Settings dialog. One Processing Profile option is the ability to perform front-load OCR when ingesting data into a case.
If the license you select when you start up Nuix Workstation does not allow for OCR, this option is not visible in the Add/Edit Processing Profile dialog. Nuix has based the defaults here on the most common use cases when ingesting data and for retrieving text from an item. You can then determine if these settings are appropriate for your use case.
An OCR Profile provides control over what OCR occurs during data ingestion or processing. Each profile uses a customized template of OCR settings, which you may want to edit. You can configure and use multiple OCR Profiles per OCR job across different file types according to your rules. The rules match items with a specific OCR template.