Review and search for irregular file types

During the ingestion process, Nuix Workstation flags irregular files and presents them as part of the Statistics View. You can access each row from that view or through a query. You must review irregular files every time you process data to ensure that all the data is processed correctly. Nuix Workstation records an item failure the same way for a *.txt file as a PST file. You must review any questionable items and potentially reprocess them.

This section lists how to search for these 'questionable items' as one of the following irregular file types:

Items with bad extensions

Corrupted items

Deleted items

Empty items

Encrypted items

Non-searchable PDFs

Text-stripped items

Unrecognized items

Unsupported items

Image 2376

Note: Nuix Workstation only presents the types of irregular files present in the current case. The types can vary in each case.

Search for items with bad extensions

Bad Extension indicates items whose MIME-type is inconsistent with their file extension.

Image 2380

In the preceding example image, the Family.jpeg file is not an image but is actually a Microsoft Word document.

To search for files with improper extensions, use the following search syntax: flag:irregular_file_extension

Note: Nuix Workstation sets a native file's extension to File Extension (Corrected) during an export. It records the exported item's definitive metadata in the Export Item Summary, per-item XHTML report files, or load file.

Search for corrupted items

Corrupted items are those that Nuix Workstation has been unable to process. These items may also be referred to as evidence containers, evidence file containers, or evidence repositories. Nuix Workstation marks a document corrupt if:

It is unable to open the file.

Opening the file results in some type of failure.

It is otherwise unable to process the file.

For items listed as Corrupted, the File Type property displays the type of corruption. Additionally, two pieces of metadata may be recorded: FailureDetail and FailureMessage. By reviewing these items or optionally building a specific Metadata Profile that contains these fields, you can gain insight into the nature of the failures. A reason could be something as simple as a file being locked by an external process. Hovering over the FailureDetail value displays a message with full details for you to review.

Image 2387

To search for corrupted items, use the following search syntax: properties:FailureDetail

Search for deleted items

Deleted items are those items that Nuix Workstation extracted from the slack space of Microsoft email boxes.

Deleted email messages are not items listed in the Deleted Items folder. Instead, they are items that have been "permanently deleted" from within Outlook or Outlook Express. While processing them, Nuix Workstation attempts to extract as many fragments as possible, and reconstitute complete messages. If only a portion of the message still exists, Nuix Workstation extracts the available portion.

To search for deleted items, use the following search syntax: flag:deleted

Search for empty items

Empty items are items that are zero (0) bytes in size.

To search for empty items, use the following search syntax: mime-type:application/x-empty

Note: The classification of exceptions is based on our knowledge of file types. It is recommended that you save the diagnostics information to a file, which allows you to review the exceptions later.

Search for encrypted items

Encrypted items are those that Nuix Workstation has determined contain encrypted content. Nuix Workstation still extracts metadata, and as much information as possible from an encrypted file, but is unable to index all of the content.

To search for encrypted files, use the following search syntax: flag:encrypted

Identify encrypted files in a decrypted zip file

Sometimes after using a password to decrypt an encrypted zip file that decrypts most of the files it contains, you may still find one or more encrypted files. Then, how do you search for and identify those files or, in other words, generate a list of encrypted files belonging to a decrypted parent file?

Run a search using the following flags: flag:encrypted AND NOT flag:decrypted AND NOT content:*

This identifies all decrypted items which have no text. Then find them in the Document Navigator’s No text folder.

Search for non-searchable PDFs

Non-Searchable PDFs are items that are determined to be a PDF through header recognition but do not contain indexable text. These items are most frequently image-only PDFs and warrant further investigation, as the content in these PDFs is not text indexed, and therefore unsearchable by Nuix Workstation.

To search for non-searchable PDFs, use the following search syntax: mime-type:application/pdf AND NOT content:*

Nuix Workstation allows you to export the items using a third-party tool to OCR images (for example, PDF, TIFF, and PNG) and import the searchable text and PDFs back into Nuix Workstation.

Search for text-stripped items

Text-stripped items are items where Nuix Workstation is able to identify the file type but does not have a routine to cleanly extract all text and metadata in accordance with the file types' API. The result is an item that is searchable, but the text may be garbled or not properly formatted.

Note: Nuix Workstation only strips out US-ASCII characters (punctuation, 0-9, A-z). Nuix Workstation uses UTF-16LE encoding (a Unicode encoding used by Microsoft) to potentially extract more textual data.

To search for text-stripped file, use the following search syntax: flag:text_stripped

Types of text-stripped items

Text-stripped file types include the following (list is subject to change):

image/vnd.corel-draw

image/vnd.micrografx-designer

image/x-pict

 

image/vnd.micrografx-designer

application/vnd.adobe-photoshop

application/vnd.ms-shortcut

application/vnd.lotus-freelance

application/vnd.lotus-wordpro

application/vnd.borland-paradox

image/vnd.autocad-dwg

image/cgm

application/vnd.myob

application/x-js-taro

application/vnd.lotus-123

application/vnd.ms-works-ss

application/vnd.ms-works-wp

application/vnd.corel-slideshow

application/vnd.ms-works-wp

application/vnd.ms-visio

application/vnd.corel-quattro

application/vnd.corel-wordperfect

application/vnd.stardivision.calc

application/vnd.stardivision.draw

application/vnd.stardivision.impress

application/vnd.stardivision.math

application/vnd.stardivision.writer

application/x-hwp

application/octet-stream

Search for unrecognized items

Unrecognized items are items where Nuix Workstation did not recognize the header and was unable to assign a MIME-type. When Nuix Workstation cannot recognize the header in an item, the item is tagged as application/octet-stream and its text is stripped. In addition to extracting the ASCII text, Nuix Workstation extracts all recognizable system metadata.

Note: Nuix Workstation only strips out US-ASCII characters (punctuation, 0-9, A-z). Nuix Workstation uses the UTF-16LE encoding (a Unicode encoding used by Microsoft) to potentially extract more textual data.

Unrecognized MIME-types

There are 4 potential unrecognized MIME-types:

XML

OLE2

TXT

Unknown binary

To search for unrecognized files, use the following search syntax: kind:unrecognized

Search for unsupported items

Unsupported items are those from which Nuix Workstation was unable to extract any content or text. To search for unsupported items, use the following search syntax:

( has-embedded-data:0 AND has-text:0 AND has-image:0 AND NOT kind:multimedia ) OR ( mime-type:application/vnd.lotus-notes AND has-embedded-data:0 )

See the Nuix Supported Files Types document for the most current list of supported file types in Nuix Workstation.