Big Model Testing

This space aims to outline the proposed testing methodology for validating future Big Models.

The test should address 2 scenarios:

Mutually exclusive classes: Models tested against each other should have little to no semantic overlap between them. This should be an extremely easy measurement to pass.
ex:

Source code, W2, Resume, Contract, Obituary.

Semantically similar classes: Models tested against each other should have considerable overlap. This is a much harder measurement of accuracy, the exact accuracy value is dependent on the testing set and how difficult it might be for even a human to tell the difference between documents.
ex:

Python 2.7, Python 3.9

contract amendment, contract addendum

affidavit of marriage, affidavit of support

Steps 2 - 5 of the testing process below should be carried out for each scenario.

Testing process:

Topos generates German and Spanish Big Models using a corpus gathered by Topos engineers. This building process is identical between supported language types.

Dictioneers of the respective languages will build 10 models per language, which will be determined by referencing an English counterpart (chosen at later date) of 5 Topics and 5 Skills.

Each model will be validated as the English models are validated in-house today, a testing process where positive and negative testing documents are used to determine accuracy of the model. Requires a minimum F1 score of 0.85 in order to proceed to Global Testing.

Each Skill (Topics are excluded in this step) must then receive a minimum F1 score of 0.85 in Global Validation, a testing process where models compete against each other to determine if the correct answer is the highest model chosen.

Note: Global Validation is not applicable to Topic accuracy, it measures the top N results as the correct answer, which is not true for subject matter - a document can be about multiple things, meaning the top N results could all be correct.

If all 10 models (5 topics, 5 skills) pass these criteria, it can be considered that the accuracy of the underlying language’s Big Model (Spanish and German) is comparable to the English Big Model.

Selected Models for German/Spanish:

First 5 Skills to Build:

Income Tax Return Form (Form 1040 US equivalent for target language)

Non-Disclosure Agreement

Curriculum Vitae

Articles of Incorporation

Obituary

Additional 5 Skills to Build:

Articles of Organization

Residential Lease

Job Description

Non-Compete Agreement

Tax Withheld Form (Form W-2 US equivalent for target language)

First 5 Topics to Build:

Business – Marketing & Sales

Family & Parenting – Life Insurance

Government & Politics – Climate Change

Computers & Electronics – Cybersecurity

Food & Drink – Coffee & Tea

Additional 5 Topics to Build:

Business – Financing

Business – Mergers & Acquisitions

Economics & Finance - Accounting

Government & Politics – Infrastructure

Government & Politics – Healthcare

Selected Models for UK/Aus English:

Skills

Curriculum Vitae

Obituary

Job Description

Restaurant Menu

Staff Directory

Topics

Business – Marketing & Sales

Family & Parenting – Life Insurance

Government & Politics – Climate Change

Computers & Electronics – Cybersecurity

Food & Drink – Coffee & Tea

For the UK/Aus english test, it is hard to tell if a document is Australian or from the UK if there are no words found from the dialect - this is because on paper they are close to identical. The skills and topics chosen should contain enough differences (like phone numbers or addresses in a Staff Directory) to show that they are not American documents.

Requirements for the UK / Australian Dialect tests - “achieve better than 90% accuracy (course granularity) in the first pass and better than 80% accuracy where there is competition between documents in subsequent passes”.