Creating Custom Test Sets with Ground Truth
Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. SPDX-License-Identifier: MIT-0
Creating Custom Test Sets with Ground Truth
Section titled “Creating Custom Test Sets with Ground Truth”This guide walks through the end-to-end workflow for creating a custom test set with ground truth (evaluation baseline) data from scratch. Once created, the test set can be used for:
- Benchmarking — Compare accuracy across different models and configurations
- Cost optimization — Find the cheapest model that meets your accuracy requirements
- Prompt engineering — Measure the impact of prompt and schema changes
- Custom model training — Provide labeled training data for fine-tuning (see Custom Model Fine-Tuning)
Pre-deployed test sets: The accelerator ships with four ready-to-use benchmark datasets. If you just want to run tests against those, see Test Studio — Pre-Deployed Test Sets. This guide is for creating your own test set from your own documents.
Workflow Overview
Section titled “Workflow Overview”┌─────────────┐ ┌─────────────┐ ┌──────────────┐ ┌──────────────┐ ┌─────────────┐ ┌───────────────┐│ 1. Configure │───▶│ 2. Discover │───▶│ 3. Process │───▶│ 4. Review & │───▶│ 5. Create │───▶│ 6. Run Test ││ Models │ │ Schema │ │ Documents │ │ Correct │ │ Test Set │ │ Executions │└─────────────┘ └─────────────┘ └──────────────┘ └──────────────┘ └─────────────┘ └───────────────┘ Use the best Bootstrap Process sample Edit predictions Save as eval Compare models, model for high document classes docs with your and fix errors baseline & prompts, and accuracy from samples configuration in the UI editor register set configurationsStep 1: Configure for Maximum Accuracy
Section titled “Step 1: Configure for Maximum Accuracy”The goal of this initial run is to produce predictions that are as accurate as possible, minimizing the amount of manual editing you’ll need to do later. Use the best available model for both classification and extraction.
- Go to Configuration in the web UI
- Create a new configuration version (or edit an existing one)
- Set both the classification model and extraction model to a high-accuracy model (e.g., Claude Opus)
- Save the configuration version
Tip: You can always create a cheaper configuration later for production use. The expensive model is only used here to bootstrap high-quality ground truth.
For details on configuration management, see Configuration and Configuration Versions.
Step 2: Discover the Document Schema
Section titled “Step 2: Discover the Document Schema”If you don’t already have document classes defined for your document type, use Discovery to bootstrap the schema automatically.
- Go to Discovery in the web UI
- Select your high-accuracy configuration version
- Upload a representative sample document
- Run discovery — it will analyze the document and populate document classes and attributes
After discovery completes, verify the schema in your configuration under Document Schema. You should see the discovered document class with its attributes populated.
For details on discovery modes and options, see Discovery.
Step 3: Process Your Sample Documents
Section titled “Step 3: Process Your Sample Documents”Now process a set of sample documents that will become your test set.
- Go to Upload Documents in the web UI
- Select your high-accuracy configuration version
- Upload your sample documents
- Wait for all documents to finish processing
How many documents? For illustration, a handful of documents is fine. For a meaningful benchmark test set, aim for a larger representative sample. For custom model training, you’ll need a significant number of labeled documents — see Custom Model Fine-Tuning for guidance on training data requirements.
Step 4: Review, Edit, and Save Ground Truth
Section titled “Step 4: Review, Edit, and Save Ground Truth”This is the most important step. You’ll review each document’s predictions, correct any errors, and save the corrected version as evaluation baseline (ground truth).
Review and Edit Predictions
Section titled “Review and Edit Predictions”For each processed document:
- Open the document from the document list
- Click View Data to see the extracted information
- Click Edit Data to enter edit mode
- Review each extracted field:
- Click on a field to highlight it in the document viewer
- Compare the extracted value against the source document
- Correct any errors by editing the field value directly
- Save your changes — the system creates a revision history of all edits
Tip: The solution generates a confidence score for each field. To save time, you could focus on reviewing lower-confidence fields first. However, for the highest quality ground truth, review all fields.
Save as Evaluation Baseline
Section titled “Save as Evaluation Baseline”Once you’re confident the predictions are correct for a document:
- Click the Use as Evaluation Baseline button
- The system copies the corrected predictions to the evaluation baseline bucket
Repeat this for every document you want to include in your test set.
For details on the editing interface, see Web UI — Edit Data. For details on the evaluation baseline concept, see Evaluation Framework.
Step 5: Create the Test Set
Section titled “Step 5: Create the Test Set”Now register a test set that references your documents and their ground truth.
- Go to Test Studio → Test Sets tab
- Click Add Test Set
- Give the test set a name
- Specify the input bucket path containing your processed files
- Verify the file count matches your expectations
- Click Add Test Set
For details on test set management, see Test Studio.
Step 6: Run Test Executions and Compare
Section titled “Step 6: Run Test Executions and Compare”With your test set created, you can now run test executions to compare different configurations.
Run a Baseline Test
Section titled “Run a Baseline Test”- Go to Test Studio → Test Executions tab
- Select your test set
- Choose the high-accuracy configuration version you used to create the ground truth
- Run the test
This establishes your baseline — it should show near-perfect accuracy since the ground truth was generated from these same model predictions.
Compare with Alternative Configurations
Section titled “Compare with Alternative Configurations”Create and test alternative configurations to find the best cost/accuracy balance:
- Create a new configuration version with a cheaper model (e.g., Nova Lite)
- Run a test execution against the same test set using the new configuration
- Use the comparison view to analyze the results side-by-side
Analyzing Results
Section titled “Analyzing Results”The comparison view shows:
- Overall accuracy — How each configuration performed against the ground truth
- Cost comparison — Total processing cost for each configuration
- Field-level metrics — Which specific fields lost accuracy with the cheaper model
This data helps you identify:
- Whether a cheaper model meets your accuracy requirements
- Which fields need attention (e.g., improved prompts, better attribute descriptions)
- The cost/accuracy tradeoff for your specific document type
For details on evaluation metrics and reporting, see Evaluation Framework and Enhanced Reporting.
Incrementally Growing Your Test Set
Section titled “Incrementally Growing Your Test Set”You don’t have to create your entire test set in one go. As you process and review more documents over time, you can add them to an existing test set:
- Process new documents and save their evaluation baselines (Steps 3-4 above)
- Go to Test Studio → Test Sets tab
- Select your existing test set and click Add Documents → From Existing Files
- Select the Input Bucket and enter a file pattern matching your new documents
- The file pattern is pre-filled from the original test set — adjust if needed
- Optionally use the Modified after filter (e.g., “Last 24 hours” or a custom date/time) to easily find recently reviewed documents
- Click Check Files to preview matches, then Add Documents
Files without matching baseline data are automatically excluded, so you can use a broad pattern — only documents you’ve reviewed and saved as evaluation baselines will be added. The test set’s file count is updated automatically.
Next Steps
Section titled “Next Steps”- Improve accuracy: Use field-level metrics to refine your document class descriptions, attribute prompts, and few-shot examples. See IDP Configuration Best Practices and Few-Shot Examples.
- Train a custom model: If your test set is large enough, use it to fine-tune a custom model. See Custom Model Fine-Tuning.
- Automate with CLI/SDK: Create and run test sets programmatically. See IDP CLI and IDP SDK.