Import a Digest List

A Digest List is a list of hashes for a collection of files you can use as filters when performing a search.

You can import Digest Lists from third-party sources or create a new list to assist with tasks such as duplicate identification. You can do so in these formats:

Nuix's own binary format

NSRL (NSRLFile.txt)

iLook (*.hsh)

HashKeeper (*.hke, *.hsh)

Plain text (which requires a single digest per line, with no trailing punctuation or whitespace)

(Currently it is only possible to import a Digest List directly from an SQLite database using code. See https://github.com/nuix for details.)

To import a Digest List:

Do either of the following:

From the Global Options window, select the Digest Lists option.

From the File menu, select Import then Import Word List.

Click the Import link and browse to find the relevant Digest List file. Digest Lists are useful when you want to eliminate the following:

System files or other application files with known signatures and minimal value to your investigation.

This is known as De-NISTing because the most commonly used file for this operation is the NIST published list of digests.

For more details, go to: https://www.nist.gov/itl/ssd/software-quality-group/national-software-reference-library-nsrl/nsrl-download

Previously produced content - do this by importing the top-level Digest List report included with a legal export.

Inappropriate content when detected - do this by importing or generating a hash list of such content, and adding it as part of the export process to suppress it in future exports.

See the Create a Digest List in Configure List types for repetitive processes under Set Global Options for more details on basic functions related to Digest Lists.

Import an NSRL Digest List

You can directly import the National Software Reference Library (NSRL) Digest Lists (Hash Sets) from: https://www.nist.gov/itl/ssd/software-quality-group/national-software-reference-library-nsrl/nsrl-download

Note: Be aware that NSRL hash lists contain a significant amount of extraneous information and duplicates, which can cause Nuix to perform additional, unnecessary work and take additional time.

Performing these steps significantly decreases the time it takes to load and perform searches with the NSRL digest list.

To import an NSRL Digest List into Nuix Workstation:

Download the NSRL hash lists from the previously mentioned website with all four ISO images.

Extract the contents of each *.iso image.

Ensure you streamline and load the NSRLFile.txt (the ultimate target) for each image. To streamline the files on a Linux OS, enter the following command:

cat NSRLFile.txt | cut -d '"' -f 4 | sort | uniq | sed '/MD5/d' > NSRLFile.sorted.txt

Use Single Quote Double Quote Single Quote '"' syntax.

If you are not running Linux, download Cygwin to process these files from http://www.cygwin.com; find the unzipped NSRLFile.txt files, and execute the previous cat command.

To navigate Cygwin, enter the following commands. Assuming the files exist in a mapped drive, type the following sequence from the $ prompt:

Move to the root directory: cd /

Show all available folders: ls

Move into the root directory for all the mapped drives: cd cygdrive

Move into a specific folder (for example, "d" instead of "drive letter"):

cd drive letter/folder name

Open the file in a text editor and remove the last two lines in the NSRLFile.txt (being a blank space and an "md5" string).

Combine the four NSRLFile.sorted files in a single file for use as a single Digest List filter. Or use this command to merge the files into one deduplicated hash list: $ cat * > merged-file This produces a single merged and sorted (deduplicated) hash list.

Select File, then choose Global Options, select the Digest List tile and click Add.

How Digest Lists are computed

You can generate SHA-1, SHA-256 and MD5 digests in Nuix Workstation, but only MD5 digests are used for deduplication.

Duplication is detected if two documents contain the same MD5 digest value. For documents, the MD5 digest value is computed over a document's binary stream.

Email Digests

Because not all email types actually have a binary stream and two copies of the same message can have completely different header information, Nuix Workstation computes an email's MD5 digest by taking the following data encoded using UTF-8 as input:

Subject header

From header

To header

Cc header

Email body text tokenized so whitespace and irrelevant characters are removed

Binary streams of all attachments

Personal details are discarded in address headers and only the address part is used. The email body is tokenized to ignore whitespace differences, which can be a factor when comparing HTML and plain text messages.

Note: If you find unsent emails populate the FROM value instead of being blank, use the following switch: nuix.mapi.unsentitem.generateFromAddr.

Then the "FROM" address for unsent emails generates from the api-last-modifier-name and mapi-creator-name fields, unless there is a specific rule in place that applies to those fields.