Enhanced Evaluation Reporting (sticker-eval v0.1.4+)

This document describes the enhanced evaluation reporting features available in IDP v0.4.9+ using sticker-eval v0.1.4.

Overview

The evaluation module now leverages sticker-eval v0.1.4’s fine-grain field comparison feature (from GitHub Issue #48 and PR #51) to provide:

Detailed nested object match information alongside aggregate scores
Interactive controls to filter and explore evaluation results
Field-by-field comparison details for arrays and complex objects

Key Features

1. Nested Field Comparison Details

For complex attributes (nested objects, arrays), the evaluation now captures detailed field-by-field comparison information:

{
  "name": "LineItems",
  "score": 0.88,  // Aggregate score
  "matched": false,
  "field_comparison_details": [
    {
      "expected_key": "LineItems[0].Description",
      "expected_value": "Service A",
      "actual_key": "LineItems[0].Description",
      "actual_value": "Service A",
      "match": true,
      "score": 1.0,
      "weighted_score": 2.0
    },
    {
      "expected_key": "LineItems[1].Description",
      "expected_value": "Service B",
      "actual_key": "LineItems[1].Description",
      "actual_value": "Service C",
      "match": false,
      "score": 0.75,
      "weighted_score": 1.5
    }
    // ... more comparisons
  ]
}

2. Interactive Markdown Reports

The markdown reports now include interactive HTML controls:

🔍 Show Only Unmatched

Filter the attribute table to show only rows where matches failed, providing a compact view highlighting problematic fields.

<button onclick="toggleUnmatchedOnly()">🔍 Show Only Unmatched</button>

➕➖ Expand/Collapse All Details

Expand or collapse all nested field comparison details at once.

<button onclick="expandAllDetails()">➕ Expand All Details</button>
<button onclick="collapseAllDetails()">➖ Collapse All Details</button>

📋 Expandable Nested Details

Each attribute with nested comparisons has an expandable section:

<details>
  <summary>🔍 View 6 Nested Field Comparisons</summary>
  <!-- Detailed comparison table -->
</details>

3. Aggregate Score Annotations

Aggregate scores for complex objects are clearly marked:

Visual indicator: <span class="aggregate-score">0.88</span>
Text annotation: (aggregate) appears next to the score
Color coding: Blue styling distinguishes aggregate from simple field scores

Report Structure

JSON Report

The JSON report (results.json) includes:

{
  "document_id": "doc-123",
  "overall_metrics": { ... },
  "section_results": [
    {
      "section_id": "section-001",
      "document_class": "Invoice",
      "metrics": { ... },
      "attributes": [
        {
          "name": "AttributeName",
          "expected": "...",
          "actual": "...",
          "matched": true,
          "score": 0.95,
          "field_comparison_details": [  // NEW in v0.1.4
            { /* detailed comparison */ }
          ]
        }
      ]
    }
  ]
}

Markdown Report

The markdown report (report.md) includes:

Interactive Controls - Filter and navigation buttons
Summary Section - High-level metrics with visual indicators
Section Details - Per-section metrics and attributes
Attribute Table - Enhanced with:
- Row classes for filtering (matched-row, unmatched-row)
- Aggregate score annotations
- Expandable nested details for complex fields
Evaluation Methods - Documentation of comparison methods

Usage Example

from idp_common.evaluation.service import EvaluationService

# Initialize service
eval_service = EvaluationService(region="us-east-1", config=config)

# Evaluate document (field_comparisons automatically enabled)
result_doc = eval_service.evaluate_document(
    actual_document=actual_doc,
    expected_document=expected_doc,
    store_results=True  # Generates both JSON and Markdown
)

# Access detailed comparisons programmatically
for section in result_doc.evaluation_result.section_results:
    for attr in section.attributes:
        if attr.field_comparison_details:
            print(f"Attribute: {attr.name}")
            print(f"Aggregate Score: {attr.score}")
            print(f"Nested Comparisons: {len(attr.field_comparison_details)}")

            for detail in attr.field_comparison_details:
                if not detail['match']:
                    print(f"  Mismatch: {detail['expected_key']}")
                    print(f"    Expected: {detail['expected_value']}")
                    print(f"    Actual: {detail['actual_value']}")
                    print(f"    Score: {detail['score']}")

Viewing Interactive Reports

GitHub

GitHub’s markdown renderer supports HTML, so the interactive controls will work when viewing the report in:

Pull requests
Issue comments
Repository files

VS Code

Install a markdown extension that supports HTML:

Markdown Preview Enhanced (recommended)
Markdown All in One

Web Browser

Open the .md file directly in a browser:

open test_evaluation_report.md

Jupyter Notebooks

Use IPython.display.Markdown:

from IPython.display import Markdown, display

with open('evaluation/report.md', 'r') as f:
    display(Markdown(f.read()))

Configuration

No additional configuration required! The enhancement automatically activates when using sticker-eval v0.1.4+.

The feature is enabled in lib/idp_common_pkg/idp_common/evaluation/service.py:

# Compare using Stickler with field_comparisons enabled
stickler_result = expected_instance.compare_with(
    actual_instance,
    document_field_comparisons=True,  # Enables detailed comparison
)

Benefits

1. Better Debugging

Quickly identify which specific nested fields are causing mismatches
See exact values that differ within complex objects
Understand Hungarian matching results for arrays

2. Compact Problem View

Filter to show only unmatched rows
Focus attention on fields requiring investigation
Reduce cognitive load when reviewing large reports

3. Complete Context

Aggregate scores provide high-level overview
Nested details provide granular diagnostics
Both perspectives available in single report

4. Production Ready

JSON structure fully captures all comparison data
Can be consumed by analytics tools
Markdown provides human-readable interface

Technical Details

Data Model Changes

AttributeEvaluationResult now includes:

@dataclass
class AttributeEvaluationResult:
    # ... existing fields ...
    field_comparison_details: Optional[List[Dict[str, Any]]] = None

Field Comparison Structure

Each comparison in field_comparison_details:

{
    "expected_key": "path.to.field",      # Dot/bracket notation
    "expected_value": "actual value",
    "actual_key": "path.to.field",
    "actual_value": "actual value",
    "match": true,                        # Boolean match result
    "score": 0.95,                        # Similarity score (0.0-1.0)
    "weighted_score": 1.9,                # score * field_weight
    "reason": "explanation"               # Human-readable reason
}

Grouping Logic

Field comparisons are grouped by root field name:

LineItems[0].Description → grouped under LineItems
Address.City → grouped under Address
Simple fields have no grouping (single comparison or none)

Backward Compatibility

The enhancement is fully backward compatible:

✅ Existing API unchanged
✅ JSON reports remain consumable by old code (new field is optional)
✅ Markdown reports viewable in any viewer (controls degrade gracefully)
✅ No configuration changes required

Examples

See test_evaluation_enhancements.py for complete working examples demonstrating:

Nested object comparisons
Array item comparisons
Aggregate score calculations
Interactive report generation

Run the test:

python test_evaluation_enhancements.py

This generates test_evaluation_report.md demonstrating all features.

Future Enhancements

Potential future improvements:

Export to CSV with nested details flattened
Comparison history tracking across runs
Threshold recommendations based on field mismatch patterns
Visual diff viewer for nested structures