Use Regex Queries to Find Named Entities
This section covers how to:
Add named entities to the named entities library
Perform a regular expression search to find named entities
About Regex Queries for Named Entities
A regular expression is used to describe search patterns in named entities that match patterns of text strings, such as particular characters, numbers, or words. Named Entities are stored as regular expressions files in the Named Entities library in C:\Program Files\Nuix\Nuix Application version\user-data\Named Entities.
Add Named Entities to the Named Entities Library
You can add named entities to the Named Entities library by adding a regular expressions file (regexp.list) and updating the regex.properties file. Adding a Named Entity requires the following files:
A regexp that is a plain text .list file that holds your regular expression
An icon that is a .png file, and ideally 48 x 48 pixels, or smaller.
To add named entities to the Named Entities library:
Copy the regexp .list file to the default directory in: C:\Program Files\Nuix\Nuix Application version\user-data\Named Entities.
Make a backup of the file.
Access the same directory and re-open the regexp .list file.
Enter the following extra required information in this file to display the new named entity, ensuring:
On Line 1, you start with a # (to indicate a comment) followed by a description of the group the entity is part of, such as a location or named entities, which are the default groups in Nuix Workstation.
On Line 2, name the entity.
On Line 3, provide the filename of the icon to which you want the directory to point. For example: If you create a Social Media entity, enter:
#NamedEntities.social.group=Custom NamedEntities.social.title=Social Media
<PNG file name> (which must be identical to the way it is named in the regex files)
Update the regex.properties file with the group and title information.
Relaunch Nuix Workstation.
Right-click all items, and select Reload Items from Source data.
In the Data Processing Settings window, ensure you enable the Extract named entities from text and Extract named entities from properties options.
Post-processing, in the Results pane, change the Results view to Entities.
Select the Entities list to view the new entity you added.
Select the new entity to see the items identified from reprocessing and added to the entity.
Perform a Regular Expression Search to Find Named Entities
To perform a regular expression search to find named entities:
Add the forward-slash character (/) to the start and end of the regular expression you want to use for a search.
Enter characters only in lowercase (as matching is not case-sensitive).
While you cannot use expressions such as the caret (^) to find the start of a line in the text, you can form complex phrase queries using spaces and “slop”. See the following Use slop in a regular expression section for details.
Available patterns include:
Syntax |
Results |
\d |
A digit (0-9) |
\D |
A non-digit |
| |
Matches either the left or right side |
[] |
One of the characters within the brackets |
. |
Any character |
.* |
The same as a multiple character wildcard search |
\b |
A word boundary. Hyphenated words are broken up by word boundaries. This matches hyphen boundaries and the end of a word. |
^ |
The start of a word. Will not match hyphen boundaries. |
$ |
The end of a word. Will not match hyphen boundaries. |
The pattern is matched against each word.
Use Slop in a Regular Expression
"Slop" in a phrase query allows you to search for words within a certain distance of each other, using the tilde (~) symbol at the end of a phrase query along with a numerical value that indicates the number of unrelated words that can occur in between.
Examples of how to use regex expressions:
Query Syntax |
Action |
/apple|orange/ |
Matches all items containing either 'apple' or 'orange'. |
/eat|ate apple|orange/ ~2 |
Matches all items containing either 'eat' or 'ate' and then 'apple' or 'orange', with up to two unrelated terms separating them. Note: Only whole words are matched. To match partial words, use wildcard characters. |
/gr[eao]y/ |
Matches all items containing either 'grey', 'gray', or 'groy'. |
/gr[^eao]y/ |
Matches all items containing at least one word starting with 'gr' followed by a character that is not 'e', 'a' or 'o', followed by 'y'. This query would match 'griy' and 'gr3y'. |
/.oe.* not/ |
An example of a phrase query. Matches all items with a word starting with any letter followed by 'oe', optionally followed by any other characters then the word 'not'. This query would match 'does not', 'joe not' and 'ioexception not'. |
/0\d{1,3}/ |
Matches all items starting with '0' followed by 1 to 3 digits. This query would match '02', '0404', '00' and '080'. |
/0\d{1,3} \d{3,4} \d{3,4}/ OR /0\d{1,3} \d{6,8}/ |
Matches all items that may contain local phone number patterns. The first part of this query would match '02 2328 1929', '043 232 192' and '0404 0233 2333'. The second part would match '02 23281929', '043 23221923' and '0404 023323'. There are different conventions for how phone numbers are grouped, so adjust this query for those cases. |
/[\u0400-\u052f]*/ |
Matches all Unicode Cyrillic and Cyrillic Supplement alphabet families. Note: Adding an asterisk (*) highlights whole words for languages such as Russian and Serbian. |
/[\p{InCyrillic}\p{InCyrillic_Supplementary}]*/ |
Matches all Unicode Cyrillic and Cyrillic Supplement alphabet families using the Unicode block names. |
Note: For information on formulating character-based regex patterns for searching on named entities, visit: http://regexlib.com/CheatSheet.aspx.