Creating Custom Test Sets with Ground Truth

This guide walks through the end-to-end workflow for creating a custom test set with ground truth (evaluation baseline) data from scratch. Once created, the test set can be used for:

Benchmarking — Compare accuracy across different models and configurations
Cost optimization — Find the cheapest model that meets your accuracy requirements
Prompt engineering — Measure the impact of prompt and schema changes
Custom model training — Provide labeled training data for fine-tuning (see Custom Model Fine-Tuning)

Pre-deployed test sets: The accelerator ships with four ready-to-use benchmark datasets. If you just want to run tests against those, see Test Studio — Pre-Deployed Test Sets. This guide is for creating your own test set from your own documents.

Workflow Overview

┌─────────────┐    ┌─────────────┐    ┌──────────────┐    ┌──────────────┐    ┌─────────────┐    ┌───────────────┐
│ 1. Configure │───▶│ 2. Discover │───▶│ 3. Process   │───▶│ 4. Review &  │───▶│ 5. Create   │───▶│ 6. Run Test   │
│    Models    │    │    Schema   │    │    Documents  │    │    Correct   │    │    Test Set  │    │    Executions │
└─────────────┘    └─────────────┘    └──────────────┘    └──────────────┘    └─────────────┘    └───────────────┘
  Use the best       Bootstrap          Process sample      Edit predictions    Save as eval       Compare models,
  model for high     document classes    docs with your      and fix errors      baseline &         prompts, and
  accuracy           from samples        configuration       in the UI editor    register set       configurations

Step 1: Configure for Maximum Accuracy

The goal of this initial run is to produce predictions that are as accurate as possible, minimizing the amount of manual editing you’ll need to do later. Use the best available model for both classification and extraction.

Go to Configuration in the web UI
Create a new configuration version (or edit an existing one)
Set both the classification model and extraction model to a high-accuracy model (e.g., Claude Opus)
Save the configuration version

Tip: You can always create a cheaper configuration later for production use. The expensive model is only used here to bootstrap high-quality ground truth.

For details on configuration management, see Configuration and Configuration Versions.

Step 2: Discover the Document Schema

If you don’t already have document classes defined for your document type, use Discovery to bootstrap the schema automatically.

Go to Discovery in the web UI
Select your high-accuracy configuration version
Upload a representative sample document
Run discovery — it will analyze the document and populate document classes and attributes

After discovery completes, verify the schema in your configuration under Document Schema. You should see the discovered document class with its attributes populated.

For details on discovery modes and options, see Discovery.

Step 3: Process Your Sample Documents

Now process a set of sample documents that will become your test set.

Go to Upload Documents in the web UI
Select your high-accuracy configuration version
Upload your sample documents
Wait for all documents to finish processing

How many documents? For illustration, a handful of documents is fine. For a meaningful benchmark test set, aim for a larger representative sample. For custom model training, you’ll need a significant number of labeled documents — see Custom Model Fine-Tuning for guidance on training data requirements.

Step 4: Review, Edit, and Save Ground Truth

This is the most important step. You’ll review each document’s predictions, correct any errors, and save the corrected version as evaluation baseline (ground truth).

Review and Edit Predictions

For each processed document:

Open the document from the document list
Click View Data to see the extracted information
Click Edit Data to enter edit mode
Review each extracted field:
- Click on a field to highlight it in the document viewer
- Compare the extracted value against the source document
- Correct any errors by editing the field value directly
Save your changes — the system creates a revision history of all edits

Tip: The solution generates a confidence score for each field. To save time, you could focus on reviewing lower-confidence fields first. However, for the highest quality ground truth, review all fields.

Save as Evaluation Baseline

Once you’re confident the predictions are correct for a document:

Click the Use as Evaluation Baseline button
The system copies the corrected predictions to the evaluation baseline bucket

Repeat this for every document you want to include in your test set.

For details on the editing interface, see Web UI — Edit Data. For details on the evaluation baseline concept, see Evaluation Framework.

Step 5: Create the Test Set

Now register a test set that references your documents and their ground truth.

Go to Test Studio → Test Sets tab
Click Add Test Set
Give the test set a name
Specify the input bucket path containing your processed files
Verify the file count matches your expectations
Click Add Test Set

For details on test set management, see Test Studio.

Step 6: Run Test Executions and Compare

With your test set created, you can now run test executions to compare different configurations.

Run a Baseline Test

Go to Test Studio → Test Executions tab
Select your test set
Choose the high-accuracy configuration version you used to create the ground truth
Run the test

This establishes your baseline — it should show near-perfect accuracy since the ground truth was generated from these same model predictions.

Compare with Alternative Configurations

Create and test alternative configurations to find the best cost/accuracy balance:

Create a new configuration version with a cheaper model (e.g., Nova Lite)
Run a test execution against the same test set using the new configuration
Use the comparison view to analyze the results side-by-side

Analyzing Results

The comparison view shows:

Overall accuracy — How each configuration performed against the ground truth
Cost comparison — Total processing cost for each configuration
Field-level metrics — Which specific fields lost accuracy with the cheaper model

This data helps you identify:

Whether a cheaper model meets your accuracy requirements
Which fields need attention (e.g., improved prompts, better attribute descriptions)
The cost/accuracy tradeoff for your specific document type

For details on evaluation metrics and reporting, see Evaluation Framework and Enhanced Reporting.

Incrementally Growing Your Test Set

You don’t have to create your entire test set in one go. As you process and review more documents over time, you can add them to an existing test set:

Process new documents and save their evaluation baselines (Steps 3-4 above)
Go to Test Studio → Test Sets tab
Select your existing test set and click Add Documents → From Existing Files
Select the Input Bucket and enter a file pattern matching your new documents
The file pattern is pre-filled from the original test set — adjust if needed
Optionally use the Modified after filter (e.g., “Last 24 hours” or a custom date/time) to easily find recently reviewed documents
Click Check Files to preview matches, then Add Documents

Files without matching baseline data are automatically excluded, so you can use a broad pattern — only documents you’ve reviewed and saved as evaluation baselines will be added. The test set’s file count is updated automatically.

Next Steps

Improve accuracy: Use field-level metrics to refine your document class descriptions, attribute prompts, and few-shot examples. See IDP Configuration Best Practices and Few-Shot Examples.
Train a custom model: If your test set is large enough, use it to fine-tune a custom model. See Custom Model Fine-Tuning.
Automate with CLI/SDK: Create and run test sets programmatically. See IDP CLI and IDP SDK.

Creating Custom Test Sets with Ground Truth

Creating Custom Test Sets with Ground Truth

Workflow Overview

Step 1: Configure for Maximum Accuracy

Step 2: Discover the Document Schema

Step 3: Process Your Sample Documents

Step 4: Review, Edit, and Save Ground Truth

Review and Edit Predictions

Save as Evaluation Baseline

Step 5: Create the Test Set

Step 6: Run Test Executions and Compare

Run a Baseline Test

Compare with Alternative Configurations

Analyzing Results

Incrementally Growing Your Test Set

Next Steps

Related Documentation