Deduplicate emails using EDRM MIH hashing

Nuix Workstation v100.2.0 now can generate EDRM MIH hashing to perform cross platform email duplicate identification.

About EDRM MIH

Since 2005 the Electronic Discovery Reference Model (EDRM) has created practical resources to improve e-discovery, privacy, security and information governance. The EDRM is used in 145 countries by individuals, law firms, corporations and government organizations seeking to improve the practice and provision of data and legal discovery.

The EDRM now provides a Cross Platform Email Duplicate Identification Specification for identifying duplicates across multiple email platforms using the hash value of an email Message ID metadata field, the EDRM Message Identification Hash (“MIH”). The EDRM MIH is the MD5 hash value of the ASCII string comprised of the Message-ID header field of RFC-compliant email messages.

The EDRM MIH does not replace current Nuix Workstation email deduplication methods but enables cross platform email duplicate identification so that organizations can more easily identify duplicate emails efficiently and effectively in a defensible and cost- effective manner.

Prior to developing the MIH hash there was no economical way to deduplicate across platforms, so parties ended up having to pay for duplicative review and analysis of the same documents. This was because each vendor had its own method of handling email hashing and deduplication, making it impossible to exchange the (email hashes) results between other vendor applications. Now, leading platforms that are using the MIH hash include Relativity, Reveal-Brainspace, EDT and Nuix.

The EDRM MIH is an open-source, collaborative effort of leading companies and technologists backed by the EDRM which seamlessly integrates with existing tools and workflows. It is anticipated that litigants will be exchanging EDRM MIH values in load files as routinely as parties now exchange Bates numbers and file names.

Nuix Workstation v100.2.0 now can generate MIH for cross platform email duplicate identification.

Note: This ability to export load files using EDRM MIH, is only available for Nuix Workstation licenses which allow Legal Exports.

Benefits

The new standard MIH hash value provides the following benefits:

A faster alternative way of deduplicating emails, especially of previously seen emails even when the forms of production have changed

Better exchangeability of data between platforms and vendors

Wide vendor support (by Relativity, EDT, Reveal, and our Nuix API)

Vendors do not need to change the way they deduplicate email messages internally

Requesting parties can identify duplicates of email messages across production sets:

From the same party

From different producing parties

Across different matters

The ability, for example, to reliably deduplicate a TIFF production set from one party against a PDF or native set from another party

Smaller review populations, shorter review times and smaller hosting fees

Greater consistency in coding, leading to improved predictive coding scoring and less need to reconcile disparate coding for duplicates

Greater insight into other parties' productions with less time and effort

Ability to repurpose and leverage previous work product for future matters

Increased flexibility in allowing data to remain in multiple databases and locations while applying data minimization techniques before consolidating the data into a central location

Greater ease in moving matters across platforms or service providers

How to deduplicate emails using EDRM MIH hashing

To deduplicate emails using EDRM MIH hashing:

On the Data Processing Settings tab of the Edit Processing Profile window, under Digest Settings > Email Digest Settings, to maximize the uniqueness of the MD5 values returned, enable the following:

Use EDRM MIH check box.

A screenshot of a computer
Description automatically generated

Include Communication Date check box (optional, however)

This is similar to enabling the MD5 checkbox under Digest to compute to calculate new hash values during ingestion.

Note: Enabling or disabling the Include Communication Date option has no effect on the EDRM MIH hash values produced as these are always based only on the Message-ID. However, the EDRM MIH and MD5 hash values that are produced do differ if this date is enabled, as the MD5 values are based on the Message-ID with the Date. (When the Include Communication Date is OFF, both the EDRM MIH and MD5 hash values have exactly the same value.)

Image 10

Then search using text-custom-metadata:"edrm-mih:*" or "edrm-mih:12345". to find targeted results in the Results view which you can then see more details in the Preview pane's Metadata tab.

Prerequisites for Message-ID values

Message-ID values must:

Have the following format: "<"id-left"@"id-right">", where the id-left and id-right sections above must

contain 1 or more alphanumeric characters.

Not contain spaces, or angle brackets, spaces, or @ (at symbol).

Examples:

Message-ID header line from email: Message-ID: <C> Value passed to MIH generator:<CALckR-a8UDkRjO4xJyjd_s0GPxQWw@mail.gmail.com>

Generated MIH:1de319c276884bd0c9e2f1621ada26cc

Limitations

The MIH does not generate if an email has no Message-ID or an invalid Message-ID value.

(To generate the MIH, the complete Message-ID value, including the flanking angle brackets, MUST be used.)

Changing the character case of the Message-ID value before MIH generation changes the hash value of the string.

If the email contains more than one Message-ID value, then only the first Message-ID value in the parent email message headers generates the MIH.

Non-email messages do not generate MIHs.

Currently, calendar or contact items which may have Message-IDs will not generate MIHs.

Scenarios where the MIH on its own may be inadequate to perform deduplication

While the requirement that Message-IDs be “guaranteed” as globally unique, the EDRM Committee identified the following scenarios where Message-IDs were absent or were the “same” when the messages in which they presented were “different”:

Combining the MIH and the email Date (Sent Date & Time)*

Draft messages without Message IDs

SPAM and Fraudulent Messages

System Generated Emails

Malformed or Corrupted Message IDs

Messages with Prepended or Appended Headers, Footers and Signatures

Messages with BCCs

Messages with Stripped or Corrupted Attachments

Messages with Time Anomalies:

Items that are Not Email Messages

Note: This combination produces a Message-ID that is not "unique" enough. That is why concatenating the Message-ID and Communication data to derive the MD5 has of the new string is most effective.