Glossary

A | B | C | D | E | F | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z

A

abbreviations: Custom user-created categories that group together common item abbreviations to the same overarching category.

Example categories: Calendar, States, Common

Example item abbreviations: Jan, Feb, Mar | MA, NY, VT | appt, Mr, Mrs, MD, corp

accuracy: The metric that measures how often a model correctly predicts the outcome.

aliases: Synonyms and conjugations of a lexeme. Nuix NLP will consider all of these terms belonging to the same entity.

average negative proximity: The average proximity of negative testing text. This should be considerably lower than the average positive proximity.

average positive proximity: The average proximity of positive training text. This should be considerably higher than the average negative proximity.

B

C

campaigns: Create a project to process texts, select the pipeline to process with, the source of the data, and custom email alerts when classification criteria are met. Campaigns are used in standalone NLP environments.

classifier: Interchangeable terms for any text analytics object. The individual dictionaries, skills, and compound lexemes built by users can all be referred to as "models", "classifications', or "classifiers".

clustering: Automatically group documents into logically named clusters and sub-clusters based on language patterns.

compound lexeme: Provides the user with a ‘building block’ interface for creating complex search criteria. Wraps regex, named entities, geolocations, keywords, lexemes, and other system parts with simple logic to extract anything of interest within the text. Example: social security numbers, persons and phone numbers, and personal health information. It is also called cognitive expressions or cogex.

D

definition: The definition of a lexeme. This is what Nuix NLP uses to determine the relevance of a given term within text. The definitions establish context for each of the terms, providing disambiguation across the terms.

dictionaries: Identifies the subject matter of a given document, or what the document is about. Dictionaries have a topic (referenced as 'Topic-Type' in the output). Dictionaries work best if the classification is noun-heavy or contains common phrases that require disambiguation to perform accurately. See also topic.

E

F

F1 score: An evaluation metric that combines precision and recall to provide a balanced measure of a model’s performance.

feeds engine: Ingestion service used to process large datasets quickly. When enabled, the feeds will constantly pull text from a given source and process using the models users have pushed. Holds the 'production' version of models available.

G

geolocation: Physical location that is associated with a name found in text. Geolocations have their own set of aliases, definitions, and extra geo-specific fields (county, longitude, and latitude). These share a similar hierarchy to defined terms.

global validation: Test a collection of skills (or all skills) against each other to ensure accuracy, and auto-update models to increase accuracy based on results.

I

ignore list: A list of words intentionally ignored for a document classification. If the classifier is rebuilt, these words and phrases will be intentionally left out of the weighted terms. See also match list.

L

language models: A mathematical representation of a language created by NLP by feeding in a large quantity of text for a language.

lexeme: A term or phrase paired with aliases and a brief text, enabling Nuix NLP to determine if a certain term is relevant within a body of text.

M

matches: An automatically generated set of words. These words are paired with weight grades of importance based on the training text documents. You can increase or decrease the weight of words based on the importance and topic within the text model. You can also add new matches or delete existing matches.

match list: A list of terms and their respective weights that are important to a subclass, generated using Nuix NLP's Intelligent TF-IDF. Unlike lexemes, these whitelist terms do not have brief text associated with them.

model: Term for any text analytics object. The individual dictionaries, skills, and compound lexemes built by users can all be referred to as "models", "classifications", or "classifiers".

model validation: Test a set of text against a model to ensure high accuracy for that model.

N

normalization rules: Custom user-added rules that tell NLP how to associate and map certain words such as prefixes, suffixes, or aliases to specific pre-existing NLP values.

O

ontology: The complete set of dictionaries and skills created and trained on a server. It can be thought of as the entire knowledge base Nuix NLP has to pull from when classifying text.

optimal threshold: The proximity cutoff that yields the best F1 score. Anything above this value should be considered a true positive, while anything below should be viewed as a true negative.

P

pipelines: Exposed text processing rules that users can tweak based on needs.

precision: The metric that measures the occurrences of false positives.

R

recall: The metric that measures the occurrences of false negatives.

relevance: How similar a document is to a classification the user has created. The higher the value is (0.0 to 1.0) the more it looks like that classification. Also referred to as proximity.

risk rules: Customizable weighted prioritization tags that allow users to determine what documents are brought to the top of the review.

S

skills: This feature is best used to classify a ‘type’ of document. Skills identify content at the document level and use example documents of what users want to be identified.

skillset: Named buckets that contain skills that classify texts. For example, the skillset of human resource forms might have the skill of resume, W-2, and employee evaluation. The skillset of source code would have Python, Ruby, Swift, and C++.

stop words: The words that Nuix NLP intentionally ignores when processing text. Stop words cannot be added as lexemes.

T

tags (icon used in UI): These power the risk value functionality and allow the user to access additional features like parts of speech suppression and proximity suppression. The tagging system enables users to add one of four types of static values to skills and topics.

testing sets: Sets of documents used to test the skills and topics model performance. We recommend 50-150 additional documents, separate from the original training document sets.

text editor: Process text to view results of all models found on the system and assess accuracy on a doc-by-doc basis. Also known as model editor.

topic: The second tier in a dictionary. Topics target specific subject matter within their parent dictionary. For example, dictionary: nature, topic: fauna - dictionary: economics and finance, topic: cryptocurrency

training text: The text that is used to teach Nuix NLP about a given subject. Training texts are required for skill models but are optional for topic models and dictionaries.

U

update model: You can build a model by clicking “update model”, the process should only take a few seconds. When you update the model, you also have the option of processing the training texts against it by selecting the “recalculate training text proximities” checkbox, if you select this it gives feedback on which types of documents are under-represented and where the model is underperforming.

V

validation: Service that is used to test the accuracy of a model. Stores validation documents, F1 score, precision, and recall for the model the user wants to test. Users can create any number of positive and negative buckets for validation documents.

visualization: The model's validation results which include the F-1 score, the confusion matrix results, two histograms for positive and negative validation, and the proximity bins and counts of documents in each bin.