Deduplicate emails using EDRM MIH hashing
Nuix Workstation v100.2.0 now can generate EDRM MIH hashing to perform cross platform email duplicate identification.
About EDRM MIH
Since 2005 the Electronic Discovery Reference Model (EDRM) has created practical resources to improve e-discovery, privacy, security and information governance. The EDRM is used in 145 countries by individuals, law firms, corporations and government organizations seeking to improve the practice and provision of data and legal discovery.
The EDRM now provides a Cross Platform Email Duplicate Identification Specification for identifying duplicates across multiple email platforms using the hash value of an email Message ID metadata field, the EDRM Message Identification Hash (“MIH”). The EDRM MIH is the MD5 hash value of the ASCII string comprised of the Message-ID header field of RFC-compliant email messages.
The EDRM MIH does not replace current Nuix Workstation email deduplication methods but enables cross platform email duplicate identification so that organizations can more easily identify duplicate emails efficiently and effectively in a defensible and cost- effective manner.
Prior to developing the MIH hash there was no economical way to deduplicate across platforms, so parties ended up having to pay for duplicative review and analysis of the same documents. This was because each vendor had its own method of handling email hashing and deduplication, making it impossible to exchange the (email hashes) results between other vendor applications. Now, leading platforms that are using the MIH hash include Relativity, Reveal-Brainspace, EDT and Nuix.
The EDRM MIH is an open-source, collaborative effort of leading companies and technologists backed by the EDRM which seamlessly integrates with existing tools and workflows. It is anticipated that litigants will be exchanging EDRM MIH values in load files as routinely as parties now exchange Bates numbers and file names.
Nuix Workstation v100.2.0 now can generate MIH for cross platform email duplicate identification.
Note: This ability to export load files using EDRM MIH, is only available for Nuix Workstation licenses which allow Legal Exports.
Benefits
The new standard MIH hash value provides the following benefits:
A faster alternative way of deduplicating emails, especially of previously seen emails even when the forms of production have changed
Better exchangeability of data between platforms and vendors
Wide vendor support (by Relativity, EDT, Reveal, and our Nuix API)
Vendors do not need to change the way they deduplicate email messages internally
Requesting parties can identify duplicates of email messages across production sets:
From the same party
From different producing parties
Across different matters
The ability, for example, to reliably deduplicate a TIFF production set from one party against a PDF or native set from another party
Smaller review populations, shorter review times and smaller hosting fees
Greater consistency in coding, leading to improved predictive coding scoring and less need to reconcile disparate coding for duplicates
Greater insight into other parties' productions with less time and effort
Ability to repurpose and leverage previous work product for future matters
Increased flexibility in allowing data to remain in multiple databases and locations while applying data minimization techniques before consolidating the data into a central location
Greater ease in moving matters across platforms or service providers
How to deduplicate emails using EDRM MIH hashing
To deduplicate emails using EDRM MIH hashing:
On the Data Processing Settings tab of the Edit Processing Profile window, under Digest Settings > Email Digest Settings, to maximize the uniqueness of the MD5 values returned, enable the following:
Use EDRM MIH check box.
Include Communication Date check box (optional, however)
This is similar to enabling the MD5 checkbox under Digest to compute to calculate new hash values during ingestion.
Note: Enabling or disabling the Include Communication Date option has no effect on the EDRM MIH hash values produced as these are always based only on the Message-ID. However, the EDRM MIH and MD5 hash values that are produced do differ if this date is enabled, as the MD5 values are based on the Message-ID with the Date. (When the Include Communication Date is OFF, both the EDRM MIH and MD5 hash values have exactly the same value.)
Then search using text-custom-metadata:"edrm-mih:*" or "edrm-mih:12345". to find targeted results in the Results view which you can then see more details in the Preview pane's Metadata tab.
Prerequisites for Message-ID values
Message-ID values must:
Have the following format: "<"id-left"@"id-right">", where the id-left and id-right sections above must
contain 1 or more alphanumeric characters.
Not contain spaces, or angle brackets, spaces, or @ (at symbol).
Examples:
Message-ID header line from email: Message-ID: <C> Value passed to MIH generator:<CALckR-a8UDkRjO4xJyjd_s0GPxQWw@mail.gmail.com>
Generated MIH:1de319c276884bd0c9e2f1621ada26cc
Limitations
The MIH does not generate if an email has no Message-ID or an invalid Message-ID value.
(To generate the MIH, the complete Message-ID value, including the flanking angle brackets, MUST be used.)
Changing the character case of the Message-ID value before MIH generation changes the hash value of the string.
If the email contains more than one Message-ID value, then only the first Message-ID value in the parent email message headers generates the MIH.
Non-email messages do not generate MIHs.
Currently, calendar or contact items which may have Message-IDs will not generate MIHs.
Scenarios where the MIH on its own may be inadequate to perform deduplication
While the requirement that Message-IDs be “guaranteed” as globally unique, the EDRM Committee identified the following scenarios where Message-IDs were absent or were the “same” when the messages in which they presented were “different”:
Combining the MIH and the email Date (Sent Date & Time)*
Draft messages without Message IDs
SPAM and Fraudulent Messages
System Generated Emails
Malformed or Corrupted Message IDs
Messages with Prepended or Appended Headers, Footers and Signatures
Messages with BCCs
Messages with Stripped or Corrupted Attachments
Messages with Time Anomalies:
Items that are Not Email Messages
Note: This combination produces a Message-ID that is not "unique" enough. That is why concatenating the Message-ID and Communication data to derive the MD5 has of the new string is most effective.