Perform predictive coding using Continuous Active Learning (CAL)

Predictive coding facilitates document review by using a human-reviewed set of training documents to make predictions on the entire population. On a population's Predictive Coding page, administrators can access the documents in a population that are most likely to be responsive. After users begin reviewing these documents, predictive coding uses an algorithm called Continuous Active Learning (CAL) to check for newly reviewed documents at regular intervals, and uses those documents' scores to refine the predictive model. The model is then used to re-prioritize documents for review. When you are happy with the performance of the model, you can stop the review.

To create and train predictive models to use for coding, use the Predictive Models functionality, available under Analysis on the Case Home page. For more information about the differences between CAL and Predictive Models, see Compare standard predictive coding and Continuous Active Learning.

You must perform several preliminary steps before you can use CAL for predictions. For more information, see Preliminary steps for predictive coding.

Typical CAL workflow

The typical workflow to perform predictive coding using CAL includes the following steps:

Prepare for predictive coding:

Create a binder of documents.

Create a population from the binder, and create a random sample from the population.

Perform a traditional human review of the random sample. CAL will use this sample to train your predictive model.

Create a predictive coding template (optional).

Continuously train and predict:

Configure training.

Create and train the model.

CAL automatically trains and predicts the model at a specified time interval. All coded documents are used to train the model.

Review the highest ranked documents.

Optionally, you can configure CAL to automatically rebuild assignments with the highest ranked documents.

The newly reviewed documents automatically update the model's training.

Assess the review progress by measuring the recall to date.

Decide whether you are happy with the performance:

If no, continue reviewing the highest ranked documents, and then reassess the review progress.

If yes, stop the review.

Configure training for predictive coding

To configure training for predictive coding and generate a predictive model:

On the Case Home page, under Analysis, click Populations.

Click the name of a population.

In the navigation pane, click Predictive Coding.

Click Configure training.

In the Predictive coding template list, select a predictive coding template. You can select Standard or Standard + people, or you can select a custom template. 

Note: The Standard + people template gives more weight to individuals associated with a document (for example, the To and From fields and addresses found in email messages). Selecting this template means that documents that have people in common are considered more similar for the purposes of the training set, even if the documents do not share many concepts. For information on creating a custom predictive coding template, see Create a custom predictive coding template.

In the Training field [Pick List] list, select the pick list field that the application will use to compare with human reviewers' marks on the documents in the training set.

In the Positive list, select one or more values that the model should consider a positive mark made by a human reviewer. For example, you can configure the values responsive or privileged as positive marks.

In the Negative list, select one or more values that the model should consider a negative mark made by a human reviewer. For example, you can configure the values nonresponsive or not privileged as negative marks.

Note: You cannot designate the same values as both positive and negative.

To specify how frequently CAL training occurs, in the Time interval list, select one of the following options:

Hourly

4 Hours

12 Hours (default)

Daily

In the Start time list, select a date and time for the first training job to run.

Optionally, you can automatically re-prioritize assignments for review after a training job finishes based on the updated CAL score. Do the following:

Select the Auto-rebuild assignment check box, and then click Next.

Select from the following:

To rebuild only the assignments that are not assigned to a user, select Only unassigned.

To rebuild all assignments, including assignments that are assigned to a user or suspended, select All (assigned, unassigned, suspended).

In the Workflow list, click a workflow.

The phases associated with the workflow appear.

Click the name of a phase.

Click Save.

Create and train the model

After you configure training, the application creates a new CAL model automatically and trains it using the reviewed documents in the population. The new model scores the entire population as soon as training is completed, giving the highest scores to the documents that are most likely to be relevant. At this point, if CAL is configured to automatically rebuild assignments based on the updated score, reviewers can get their updated assignments and review the documents that are most likely to be relevant.

CAL checks for newly reviewed documents automatically and uses them to refine the model. It then re-ranks the documents using the scores from the updated model. After initial training is complete, you can choose to disable this feature by clearing the Active (enable Continuous Active Learning) check box on the population's Predictive Coding page. You can manually check for newly reviewed documents and update the model's prediction by clicking Run training at the top of the Predictive Coding page.

Review the population

After creating and training the model, users can begin reviewing the documents that are most likely to be responsive.

Locate predicted responsive documents

Administrators can identify documents that are most likely to be responsive in the following ways.

Tip: You can configure a CAL training job to automatically rebuild assignments after a Continuous Active Learning (CAL) training job finishes. This allows you to automatically reprioritize documents for review based on the updated CAL score.

To find documents using the links on the Predictive Coding page:

To access the Predictive Coding page for a population, do the following:

On the Case Home page, under Analysis, click Populations.

Click the name of a population.

In the navigation pane, click Predictive Coding.

Under the Population heading on the left or the Sample heading on the right, click any of the links in the Positives, Negatives, or Unreviewed columns.

Important: By default, documents appear in descending CAL score order, from most positive to most negative.

To find documents using advanced search:

Access the Search page. For more information, see Perform an advanced search.

In the Select a field box, select [RT] CAL - PopulationName_Score, where PopulationName is the name of a population.

In the Select a value box, select has a value.

Click Search.

The list of documents appears in the List pane.

Sort the results in descending order of CAL score, from most positive to most negative. For information about how to add the CAL score as a column in the List pane, see Configure columns in the List pane.

Tip: On the Documents page, administrators can add unreviewed documents to a phase of a workflow, or start reviewing documents in the Map. For more information about these methods, see Create assignments: add and remove documents in a phase and Review documents in the Map pane.

Interpret the data on the Predictive Coding page

The predictive coding results that appear on the Predictive Coding page, including recall and precision achieved to date, are continually updated based on the reviewed sample. The review may be considered complete when recall is sufficient.

Because the ratio of positives to negatives may decline for one or more batches, you should draw a sample from the remaining unreviewed documents before you consider the review complete. Doing so provides a reasonable estimate of the number of positive documents that have not yet been identified, which facilitates a defensible decision as to whether you can consider the review complete. This scenario may mean that most of the positives that can be found have been found. It may also indicate that the CAL model has found most of a certain type of responsive document. The addition of another random sample may help the model identify unrelated types of responsive documents.

The following table describes the information that is available on the Predictive Coding page.

Element

Description

Continually prioritize and train

The Active (enable Continuous Active Learning) option is selected by default. Continuous Active Learning (CAL) checks for newly reviewed documents at the specified time interval, and uses those scores to refine the prediction.

Note: One training job runs at a time. If a previous job is already running when a job is scheduled to start, the job is postponed until the next training interval.

After initial training is complete, you can choose to disable the Continuous Active Learning feature by clearing the check box.

Last processed

The date and time that the application collected the most recent review data and used it to refine the model. If the Active (enable Continuous Active Learning) option is selected, the application also lists the date and time of the next review data collection.

Document score field

The field that contains the predicted scores for the documents.

To set security for the field, click the field name. Click Security in the navigation pane, and then select an option. For more information, see Set security for fields.

(Graph)

A visual representation of the distribution of scores for the population. The scale at the bottom of the graph displays the range of predicted document scores, from -1 to +1. Scores near -1 or +1 are stronger predictions. Scores near 0 are weaker, less certain predictions.

Training field [Pick List]

The pick list field and its associated positive and negative values that the application will use to compare with human reviewers' marks on the documents in the training set.

Population

For each training field value, human-reviewed documents in the population fall into one of the following categories:

Positive: The number of documents that a human reviewer marked with a positive code.

Negative: The number of documents that a human reviewer marked with a negative code.

Unreviewed: The number of documents that have not yet been reviewed.

Note: To open the documents on the Documents page, click a number. Documents open in descending order of CAL scores.

Sample

Select a sample: Select a sample from this box to view the following information about the sample:

For each training field value, human-reviewed documents in the selected sample fall into one of the following categories:

Positive: The number of documents that a human reviewer marked with a positive code.

Negative: The number of documents that a human reviewer marked with a negative code.

Unreviewed: The number of documents that have not yet been reviewed.

Note: To open the documents on the Documents page, click a number. Documents open in descending order of CAL scores.

Confidence level: The probability that the actual value of recall falls within a desired range of values. In other words, the chance that a value that you predict to happen, actually happens. For example, a 95% confidence level means that if you drew 100 independent, random samples from a population, and then calculated the expected range of recall for each sample, the expected range of recall would contain the true value of recall in about 95 of 100 times.

Note: Changing this percentage will impact the rest of the data in this area.

Projected positives in population: The estimated range of true positives in the whole population based on the sample.

Recall to date: An estimate of the percentage of relevant documents found so far by the reviewers, by any means. This estimate is based on the known number of relevant documents found so far, and the estimate from the sample of the total number of relevant documents in the population.

Recall worst case: A worst-case scenario for the Recall to date estimate, taking into account the potential impact of any unreviewed documents in the sample.

Precision to date: The percentage of documents reviewed that were marked positive.

Review conflicts: Find false positives and false negatives

The review conflicts feature allows you to review documents with a predicted score that differs significantly from the human reviewer's mark. These documents are called conflict documents. For example, if the predicted score of a document is strongly negative, but the human reviewer marked the document as positive, it is considered a strong conflict document.

Conflict documents include documents that are false negatives and false positives. A false negative is a document that the model predicted with a negative code, but that the human reviewer marked with a positive code. A false positive is a document that the model predicted with a positive code, but that the human reviewer marked with a negative code.

By reviewing conflict documents, you can confirm the human reviewers' marks, and then allow the model to retrain itself and improve its predictions.

To review conflicts within a population:

To access the Predictive Coding page for a population, do the following:

On the Case Home page, under Analysis, click Populations.

Click the name of a population.

In the navigation pane, click Predictive Coding.

Click Review Conflicts.

Do any of the following:

In the False negatives below area, adjust the slider to find false negative documents that are below the specified score. The number of documents to be reviewed again appears in parentheses.

In the False positives above area, adjust the slider to find false positive documents that are above the specified score. The number of documents to be reviewed again appears in parentheses.

Note: The application evaluates all coded documents in the population for false negatives and false positives.

Click OK.

The Documents page opens. You can now review the conflict documents.

The application uses the newly reviewed documents to refine the model the next time that CAL training occurs. For more information, see Create and train the model.

View and download a report for Continuous Active Learning (CAL)

Each time that CAL runs on a population, the system stores specific data points for all samples and confidence levels. You can view a report of these data points on the Predictive Coding page at any time.

Note: The report is only available if you select a sample that is associated with a population.

To view a visualization of positive coding rates per CAL run:

To access the Predictive Coding page for a population, do the following:

On the Case Home page, under Analysis, click Populations.

Click the name of a population.

In the navigation pane, click Predictive Coding.

Click Report. All results are based on the sample and confidence level selected on the page.

The purple line represents the positive rate of the sample selected at each run.

The red line represents the overall precision of the population at each run.

The green line represents the precision per CAL run. This representation is based on the number of changes in coding of coded positive documents since the last run, divided by the total number of changes in the coding of coded docs.

To download a report in .csv format with the data from each CAL run, click Download report at the top right of the visualization page. The .csv file includes all of the data depicted in the graph as well as recall rates for each CAL run and worst-case recall rates (if the sample selected is not 100% coded).