Skip to content

idp_common API Reference

The idp_common library (lib/idp_common_pkg/) is the core shared Python package that powers all document processing in the IDP Accelerator. It provides modular services for each step of the pipeline, a central Document data model, and utility functions for S3, Bedrock, DynamoDB, and more.

Terminal window
# Minimal installation
pip install "idp_common[core]"
# Install specific modules
pip install "idp_common[ocr]"
pip install "idp_common[classification]"
pip install "idp_common[extraction]"
pip install "idp_common[evaluation]"
# Install multi-document discovery (includes scikit-learn, scipy, numpy, strands-agents)
pip install "idp_common[multi_document_discovery]"
# Install everything
pip install "idp_common[all]"
from idp_common import get_config, Document, Page, Section, Status
# Load configuration from DynamoDB
config = get_config()
# Create a document
document = Document(
id="doc-123",
input_bucket="my-bucket",
input_key="documents/sample.pdf",
output_bucket="output-bucket"
)

The Document dataclass is the central data structure passed through the entire processing pipeline. Each processing step enriches it with additional data.

Key fields:

FieldTypeDescription
idstrDocument identifier (typically the S3 key)
input_bucket / input_keystrSource document location in S3
output_bucketstrS3 bucket for processing outputs
statusStatusCurrent processing stage (see Status enum below)
num_pagesintNumber of pages in the document
pagesDict[str, Page]Page data keyed by page ID
sectionsList[Section]Logical document sections (grouped by classification)
summary_report_uristrS3 URI to the summarization output
config_versionstrConfiguration version used for processing
meteringDictToken usage and cost tracking data
evaluation_results_uristrS3 URI to evaluation results
rule_validation_resultRuleValidationResultRule validation output
hitl_statusstrHuman-in-the-loop review status
confidence_alert_countintNumber of low-confidence fields
errorsList[str]Processing error messages

Key methods:

MethodDescription
Document.from_dict(data)Create from a dictionary
Document.from_json(json_str)Create from a JSON string
Document.from_s3(bucket, key)Create from baseline files in S3
Document.from_s3_event(event, bucket)Create from an S3 EventBridge event
Document.load_document(event_data, bucket)Handle compressed or uncompressed Lambda input
document.serialize_document(bucket, step)Prepare output with automatic compression
document.to_dict() / document.to_json()Serialize to dict or JSON

Represents a single page in a document.

FieldTypeDescription
page_idstrPage identifier
image_uristrS3 URI to page image (JPG)
raw_text_uristrS3 URI to raw Textract JSON response
parsed_text_uristrS3 URI to parsed text (markdown)
text_confidence_uristrS3 URI to confidence data for assessment
classificationstrClassified document type for this page
confidencefloatClassification confidence score

Represents a logical group of pages with the same document class.

FieldTypeDescription
section_idstrSection identifier
classificationstrDocument type for this section
page_idsList[str]List of page IDs in this section
extraction_result_uristrS3 URI to extraction results
attributesDictExtracted attribute values
confidence_threshold_alertsList[Dict]Low-confidence field alerts
QUEUED → RUNNING → OCR → CLASSIFYING → EXTRACTING → ASSESSING →
HITL_IN_PROGRESS → SUMMARIZING → RULE_VALIDATION →
RULE_VALIDATION_ORCHESTRATOR → EVALUATING → COMPLETED

Also: POSTPROCESSING, FAILED, ABORTED

Converts PDF documents to machine-readable text using Amazon Textract.

from idp_common.ocr.service import OcrService
service = OcrService(region="us-east-1", config=config)
document = service.process_document(document)

Features: Layout analysis, table extraction, signature detection, configurable DPI.

Classification (idp_common.classification)

Section titled “Classification (idp_common.classification)”

Identifies document types and creates logical section boundaries.

from idp_common.classification import ClassificationService
service = ClassificationService(region="us-east-1", config=config)
document = service.classify_document(document)

Features: Multimodal page-level classification, text-based holistic classification, BIO-like sequence segmentation, regex-based classification, DynamoDB caching, few-shot examples, section splitting strategies (disabled, page, llm_determined).

Extracts structured data fields from document sections using LLMs.

from idp_common.extraction.service import ExtractionService
service = ExtractionService(region="us-east-1", config=config)
document = service.process_document_section(document, section_id="1")

Features: Simple/group/list attribute types, multimodal extraction (text + images), few-shot examples, class-specific filtering, JSON Schema validation.

Evaluates confidence of extraction results with optional bounding box localization.

from idp_common.assessment.service import AssessmentService
service = AssessmentService(region="us-east-1", config=config)
document = service.process_document_section(document, section_id="1")

Features: Per-attribute confidence scores (0.0-1.0), confidence reasoning, optional bounding box coordinates, configurable confidence thresholds.

Creates human-readable document summaries with citations.

from idp_common.summarization.service import SummarizationService
service = SummarizationService(region="us-east-1", config=config)
document = service.process_document(document)

Features: Markdown formatting, page citations, structured references section.

Compares processing results against ground truth for accuracy assessment.

from idp_common.evaluation.service import EvaluationService
service = EvaluationService(region="us-east-1", config=config)
document = service.evaluate_document(actual_document, expected_document)

Features: Multiple comparison methods (EXACT, FUZZY, SEMANTIC, NUMERIC_EXACT, LLM), per-attribute and per-document metrics, visual evaluation reports.

Rule Validation (idp_common.rule_validation)

Section titled “Rule Validation (idp_common.rule_validation)”

Validates extracted data against business rules using a two-step LLM approach.

from idp_common.rule_validation import RuleValidationService, RuleValidationOrchestratorService
# Step 1: Fact extraction per section
service = RuleValidationService(region="us-east-1", config=config)
document = service.validate_document(document)
# Step 2: Orchestrated consolidation
orchestrator = RuleValidationOrchestratorService(config=config)
document = orchestrator.consolidate_and_save(document, config=config, multiple_sections=True)

Features: Concurrent rule processing, page-aware chunking, customizable recommendation options, JSON and Markdown output.

Integration with Amazon Bedrock Data Automation for end-to-end document processing.

Key classes: BdaInvocationService (invoke BDA projects), BdaBlueprintService (manage BDA blueprints and schema conversion).

Automatic document class and schema discovery using LLMs. Includes single-document discovery (ClassesDiscovery) and multi-document collection discovery (MultiDocumentDiscovery).

ClassesDiscovery — Single-Document Discovery

Section titled “ClassesDiscovery — Single-Document Discovery”

Analyzes a single document to identify its type and generate a JSON Schema.

from idp_common.discovery.classes_discovery import ClassesDiscovery
discovery = ClassesDiscovery(input_bucket="bucket", input_prefix="doc.pdf", region="us-east-1")
result = discovery.discovery_classes_with_document(
input_bucket="bucket", input_prefix="doc.pdf", save_to_config=False
)

Features: JSON Schema generation, auto-detect section boundaries, page range selection, ground truth comparison.

MultiDocumentDiscovery — Multi-Document Collection Discovery

Section titled “MultiDocumentDiscovery — Multi-Document Collection Discovery”

Discovers document classes from a collection of documents using an embedding-based clustering pipeline: embed → cluster → analyze → reflect. Supports both S3-based processing (for Lambda/Step Functions) and local file processing (for CLI/SDK).

Requires extra dependencies: pip install "idp_common[multi_document_discovery]" or make setup from the project root. This installs scikit-learn, scipy, numpy, strands-agents, and pypdfium2.

Minimum 2 documents per class: Clusters with fewer than 2 documents are filtered as noise. Ensure you provide at least 2 documents for each expected document type.

Supported file types: .pdf, .png, .jpg, .jpeg, .tiff, .tif, .webp

from idp_common.discovery.multi_document_discovery import MultiDocumentDiscovery
discovery = MultiDocumentDiscovery(
region="us-east-1",
config={
"embedding_model_id": "us.cohere.embed-v4:0", # Bedrock embedding model
"analysis_model_id": "us.anthropic.claude-sonnet-4-6", # Strands agent model
"max_documents": 500, # Safety limit
"min_cluster_size": 2, # Minimum docs per cluster
"num_sample_documents": 3, # Samples per cluster for analysis
"max_concurrent_embeddings": 5, # Parallel embedding calls
"max_concurrent_clusters": 3, # Parallel cluster analysis
"max_sample_size": 5, # Max images sent to agent
},
)
ParameterTypeDefaultDescription
regionstr"us-east-1"AWS region for Bedrock calls
configDict{}Discovery configuration (from IDPConfig.discovery.multi_document)
bedrock_clientBedrockClientNoneOptional pre-configured Bedrock client

MultiDocumentDiscovery composes three specialized services:

ServiceClassPurpose
EmbeddingEmbeddingServiceGenerates vector embeddings for document images via Bedrock
ClusteringClusteringServiceKMeans clustering with silhouette analysis (scikit-learn)
AnalysisDiscoveryAgentStrands agent with Claude for cluster analysis and JSON Schema generation

The local pipeline processes documents from the local filesystem — no AWS infrastructure required beyond Bedrock model access.

run_local_pipeline() — Main entry point for local discovery

result = discovery.run_local_pipeline(
document_dir="/path/to/documents/", # Scan directory recursively
# document_paths=["/path/a.pdf", "/path/b.png"], # OR explicit file list
config_version="v1", # Optional: save results to DynamoDB config
progress_callback=my_callback, # Optional: callable(step_name, step_data)
)
print(f"Found {result.num_clusters} clusters from {result.total_documents} documents")
print(f"Successful schemas: {result.num_successful_schemas}")
print(result.reflection_report)
for cls in result.discovered_classes:
print(f" {cls['classification']}{cls['document_count']} docs")
print(f" Schema keys: {list(cls['json_schema']['properties'].keys())}")
ParameterTypeRequiredDescription
document_dirstrOne of document_dir or document_pathsDirectory to scan recursively
document_pathsList[str]One of document_dir or document_pathsExplicit list of file paths
config_versionstrNoConfig version to save discovered classes to
progress_callbackCallable[[str, Any], None]NoProgress updates callback

Pipeline steps:

  1. List — Scan directory or validate explicit paths
  2. Embed — Render PDFs to images (pypdfium2), generate embeddings via Bedrock
  3. Cluster — KMeans with automatic K selection via silhouette analysis
  4. Analyze — Strands agent examines sample images per cluster, generates classification + JSON Schema
  5. Reflect — Agent produces a Markdown report reviewing all discovered classes
  6. Save — (Optional) Merge schemas into a DynamoDB configuration version

list_local_documents() — Scan for supported files

paths = discovery.list_local_documents(
document_dir="/path/to/documents/", # Recursive scan
max_documents=500, # Safety limit
)
# Returns: ["/abs/path/invoice1.pdf", "/abs/path/w2.png", ...]
ParameterTypeRequiredDescription
document_dirstrOne of dir/pathsDirectory to scan recursively
document_pathsList[str]One of dir/pathsExplicit file paths to validate
max_documentsintNoOverride safety limit (default: 500)

generate_embeddings_local() — Generate embeddings from local files

embedding_result = discovery.generate_embeddings_local(
file_paths=paths,
progress_callback=lambda done, total: print(f"{done}/{total}"),
)
# embedding_result.embeddings — numpy array (N × embedding_dim)
# embedding_result.valid_keys — file paths that succeeded
# embedding_result.failed_keys — file paths that failed
ParameterTypeRequiredDescription
file_pathsList[str]YesLocal file paths
progress_callbackCallable[[int, int], None]NoProgress callback (done, total)

The S3 pipeline processes documents stored in Amazon S3. Designed for use with Step Functions orchestration (Lambda handlers call individual steps) or as a single high-level call.

run_full_pipeline() — End-to-end S3 pipeline

result = discovery.run_full_pipeline(
bucket="my-bucket",
prefix="documents/batch-001/",
config_version="v1", # Optional: save to DynamoDB config
progress_callback=my_callback, # Optional
)
ParameterTypeRequiredDescription
bucketstrYesS3 bucket containing documents
prefixstrYesS3 key prefix to scan
config_versionstrNoConfig version to save discovered classes to
progress_callbackCallable[[str, Any], None]NoProgress updates callback

Step-by-step methods (for Step Functions Map state integration):

# Step 1: List documents in S3
s3_keys = discovery.list_documents(bucket="my-bucket", prefix="docs/", max_documents=500)
# Step 2: Generate embeddings
embedding_result = discovery.generate_embeddings(
bucket="my-bucket", s3_keys=s3_keys, progress_callback=cb
)
# Step 3: Cluster
cluster_result = discovery.cluster_documents(embedding_result)
# Step 4: Load images for analysis
images = discovery._load_images_for_analysis(bucket="my-bucket", s3_keys=embedding_result.valid_keys)
# Step 5: Analyze each cluster (suitable for Step Functions Map iteration)
for cluster_id in range(cluster_result.num_clusters):
discovered_class = discovery.analyze_cluster(cluster_id, cluster_result, images)
# Step 6: Generate reflection report
report = discovery.reflect(discovered_classes)
# Step 7: Save to config (optional)
saved = discovery.save_to_config(discovered_classes, config_version="v1",
input_bucket="my-bucket", input_prefix="docs/")
MethodDescription
list_documents(bucket, prefix, max_documents)List supported files in S3
generate_embeddings(bucket, s3_keys, progress_callback)Generate embeddings for S3 documents
cluster_documents(embedding_result)Cluster documents based on embeddings
analyze_cluster(cluster_id, cluster_result, images)Analyze a single cluster (returns DiscoveredClass)
reflect(discovered_classes)Generate Markdown reflection report
save_to_config(discovered_classes, config_version, input_bucket, input_prefix)Save to DynamoDB config

MultiDocDiscoveryResult (dataclass)

FieldTypeDescription
discovered_classesList[Dict]List of discovered classes as serializable dicts
reflection_reportstrMarkdown reflection report
total_documentsintTotal documents processed
num_clustersintNumber of clusters found
num_failed_embeddingsintDocuments that failed embedding
num_successful_schemasintClusters with successful schema generation
num_failed_schemasintClusters where schema generation failed

Each entry in discovered_classes contains:

KeyTypeDescription
cluster_idintCluster identifier
classificationstrDiscovered document type name
json_schemaDictGenerated JSON Schema for extraction
document_countintNumber of documents in the cluster
sample_doc_idsList[str]Sample document identifiers (file paths or S3 keys)
errorstr | NoneError message if analysis failed for this cluster

Both run_local_pipeline() and run_full_pipeline() accept a progress_callback(step_name, step_data) that receives updates at each pipeline stage:

Step NameDataDescription
listing_documents{dir, paths} or {bucket, prefix}Starting document scan
documents_found{count}Number of documents found
generating_embeddings{total}Starting embedding generation
embedding_progress{done, total}Per-document embedding progress
embeddings_completeSerialized EmbeddingResultAll embeddings done
clustering{num_documents}Starting clustering
clustering_completeSerialized ClusterResultClustering done
analyzing_clusters{total}Starting cluster analysis
cluster_analysis_progress{done, total, classification}Per-cluster progress
reflectingStarting reflection
saving_to_config{version}Saving to DynamoDB (if requested)
pipeline_completeFull result dictPipeline finished

The multi-document discovery pipeline is also accessible through higher-level interfaces:

# Via IDP SDK
from idp_sdk import IDPClient
client = IDPClient()
result = client.discovery.run_multi_doc(
document_dir="/path/to/documents/",
progress_callback=my_callback,
)
Terminal window
# Via IDP CLI
idp-cli discover-multidoc --dir /path/to/documents/

See IDP SDK Reference and IDP CLI Reference for full details.

Dynamic Pydantic v2 model generation from JSON Schema definitions.

from idp_common.schema import create_pydantic_model_from_json_schema
Model = create_pydantic_model_from_json_schema(schema=schema_dict, class_label="Invoice")
validated = Model(**extracted_data)

Configuration management with system defaults and user overrides.

from idp_common import get_config, IDPConfig
config = get_config() # Load from DynamoDB or system defaults

Key functions: get_config(), load_system_defaults(pattern), merge_config_with_defaults(), create_config_template().

Conversational AI agent framework with specialized agents for analytics, error analysis, and code intelligence.

Key components: Agent factory/registry, Analytics agent, Error Analyzer agent, Code Intelligence agent, External MCP agent, Conversational orchestrator.

Utilities for invoking Amazon Bedrock LLMs with retry logic, prompt caching, and token tracking.

Document state persistence through the AppSync GraphQL API.

Document tracking, HITL state management, and configuration storage.

Analytics data storage for AWS Glue/Athena reporting pipelines.

S3 read/write utilities: get_text_content(), get_json_content(), write_content(), find_matching_files().

Image resizing, format conversion, and Bedrock attachment preparation: resize_image(), prepare_image(), prepare_bedrock_image_attachment().

Common helpers: build_s3_uri(), parse_s3_uri(), merge_metering_data(), extract_structured_data_from_text().

CloudWatch metrics publishing: publish_metric(), record_duration().

Each module has its own detailed README with comprehensive usage examples:

ModuleLocation
Core Modelslib/idp_common_pkg/idp_common/README.md
Classificationlib/idp_common_pkg/idp_common/classification/README.md
Extractionlib/idp_common_pkg/idp_common/extraction/README.md
Assessmentlib/idp_common_pkg/idp_common/assessment/README.md
Rule Validationlib/idp_common_pkg/idp_common/rule_validation/README.md
Discoverylib/idp_common_pkg/idp_common/discovery/README.md
Agentslib/idp_common_pkg/idp_common/agents/README.md
BDAlib/idp_common_pkg/idp_common/bda/README.md
OCRlib/idp_common_pkg/idp_common/ocr/README.md
Schemalib/idp_common_pkg/idp_common/schema/README.md