Big Model Testing
This space aims to outline the proposed testing methodology for validating future Big Models.
The test should address 2 scenarios:
Mutually exclusive classes: Models tested against each other should have little to no semantic overlap between them. This should be an extremely easy measurement to pass.
ex:
Source code, W2, Resume, Contract, Obituary.
Semantically similar classes: Models tested against each other should have considerable overlap. This is a much harder measurement of accuracy, the exact accuracy value is dependent on the testing set and how difficult it might be for even a human to tell the difference between documents.
ex:
Python 2.7, Python 3.9
contract amendment, contract addendum
affidavit of marriage, affidavit of support
Steps 2 - 5 of the testing process below should be carried out for each scenario.
Testing process:
Topos generates German and Spanish Big Models using a corpus gathered by Topos engineers. This building process is identical between supported language types.
Dictioneers of the respective languages will build 10 models per language, which will be determined by referencing an English counterpart (chosen at later date) of 5 Topics and 5 Skills.
Each model will be validated as the English models are validated in-house today, a testing process where positive and negative testing documents are used to determine accuracy of the model. Requires a minimum F1 score of 0.85 in order to proceed to Global Testing.
Each Skill (Topics are excluded in this step) must then receive a minimum F1 score of 0.85 in Global Validation, a testing process where models compete against each other to determine if the correct answer is the highest model chosen.
Note: Global Validation is not applicable to Topic accuracy, it measures the top N results as the correct answer, which is not true for subject matter - a document can be about multiple things, meaning the top N results could all be correct.
If all 10 models (5 topics, 5 skills) pass these criteria, it can be considered that the accuracy of the underlying language’s Big Model (Spanish and German) is comparable to the English Big Model.
Selected Models for German/Spanish:
First 5 Skills to Build:
Income Tax Return Form (Form 1040 US equivalent for target language)
Non-Disclosure Agreement
Curriculum Vitae
Articles of Incorporation
Obituary
Additional 5 Skills to Build:
Articles of Organization
Residential Lease
Job Description
Non-Compete Agreement
Tax Withheld Form (Form W-2 US equivalent for target language)
First 5 Topics to Build:
Business – Marketing & Sales
Family & Parenting – Life Insurance
Government & Politics – Climate Change
Computers & Electronics – Cybersecurity
Food & Drink – Coffee & Tea
Additional 5 Topics to Build:
Business – Financing
Business – Mergers & Acquisitions
Economics & Finance - Accounting
Government & Politics – Infrastructure
Government & Politics – Healthcare
Selected Models for UK/Aus English:
Skills
Curriculum Vitae
Obituary
Job Description
Restaurant Menu
Staff Directory
Topics
Business – Marketing & Sales
Family & Parenting – Life Insurance
Government & Politics – Climate Change
Computers & Electronics – Cybersecurity
Food & Drink – Coffee & Tea
For the UK/Aus english test, it is hard to tell if a document is Australian or from the UK if there are no words found from the dialect - this is because on paper they are close to identical. The skills and topics chosen should contain enough differences (like phone numbers or addresses in a Staff Directory) to show that they are not American documents.
Requirements for the UK / Australian Dialect tests - “achieve better than 90% accuracy (course granularity) in the first pass and better than 80% accuracy where there is competition between documents in subsequent passes”.