Using Notebooks with IDP Common Library
Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. SPDX-License-Identifier: MIT-0
Using Notebooks with IDP Common Library
Section titled “Using Notebooks with IDP Common Library”This guide provides detailed instructions on how to use existing notebooks and create new notebooks for experimentation with the IDP Common Library.
The /notebooks/examples directory contains a complete set of modular Jupyter notebooks that demonstrate the Intelligent Document Processing (IDP) pipeline using the idp_common library. Each notebook represents a distinct step in the IDP workflow and can be run independently or sequentially.
🏗️ Architecture Overview
Section titled “🏗️ Architecture Overview”The modular approach breaks down the IDP pipeline into discrete, manageable steps:
Step 0: Setup → Step 1: OCR → Step 2: Classification → Step 3: Extraction → Step 4: Assessment → Step 5: Summarization → Step 6: EvaluationKey Benefits
Section titled “Key Benefits”- Independent Execution: Each step can be run and tested independently
- Modular Configuration: Separate YAML configuration files for different components
- Data Persistence: Each step saves results for the next step to consume
- Easy Experimentation: Modify configurations without changing code
- Comprehensive Evaluation: Professional-grade evaluation with the EvaluationService
- Debugging Friendly: Isolate issues to specific processing steps
📁 Directory Structure
Section titled “📁 Directory Structure”notebooks/examples/├── README.md # This file├── step0_setup.ipynb # Environment setup and document initialization├── step1_ocr.ipynb # OCR processing using Amazon Textract├── step2_classification.ipynb # Document classification├── step3_extraction.ipynb # Structured data extraction├── step4_assessment.ipynb # Confidence assessment and explainability├── step5_summarization.ipynb # Content summarization├── step6_evaluation.ipynb # Final evaluation and reporting├── config/ # Modular configuration files│ ├── main.yaml # Main pipeline configuration│ ├── classes.yaml # Document classification definitions│ ├── ocr.yaml # OCR service configuration│ ├── classification.yaml # Classification method configuration│ ├── extraction.yaml # Extraction method configuration│ ├── assessment.yaml # Assessment method configuration│ ├── summarization.yaml # Summarization method configuration│ └── evaluation.yaml # Evaluation method configuration└── data/ # Step-by-step processing results ├── step0_setup/ # Setup outputs ├── step1_ocr/ # OCR results ├── step2_classification/ # Classification results ├── step3_extraction/ # Extraction results ├── step4_assessment/ # Assessment results ├── step5_summarization/ # Summarization results └── step6_evaluation/ # Final evaluation results🚀 Quick Start
Section titled “🚀 Quick Start”Prerequisites
Section titled “Prerequisites”- AWS Credentials: Ensure your AWS credentials are configured
- Required Libraries: Install the
idp_commonpackage - Sample Document: Place a PDF file in the project samples directory
Running the Complete Pipeline
Section titled “Running the Complete Pipeline”Execute the notebooks in sequence:
# 1. Setup environment and documentjupyter notebook step0_setup.ipynb
# 2. Process OCRjupyter notebook step1_ocr.ipynb
# 3. Classify document sectionsjupyter notebook step2_classification.ipynb
# 4. Extract structured datajupyter notebook step3_extraction.ipynb
# 5. Assess confidence and explainabilityjupyter notebook step4_assessment.ipynb
# 6. Generate summariesjupyter notebook step5_summarization.ipynb
# 7. Evaluate results and generate reportsjupyter notebook step6_evaluation.ipynbRunning Individual Steps
Section titled “Running Individual Steps”Each notebook can be run independently by ensuring the required input data exists:
# Each notebook loads its inputs from the previous step's data directoryprevious_step_dir = Path("data/step{n-1}_{previous_step_name}")⚙️ Configuration Management
Section titled “⚙️ Configuration Management”Modular Configuration Files
Section titled “Modular Configuration Files”Configuration is split across multiple YAML files for better organization:
config/main.yaml: Overall pipeline settings and AWS configurationconfig/classes.yaml: Document type definitions and attributes to extractconfig/ocr.yaml: Textract features and OCR-specific settingsconfig/classification.yaml: Classification model and method configurationconfig/extraction.yaml: Extraction model and prompting configurationconfig/assessment.yaml: Assessment model and confidence thresholdsconfig/summarization.yaml: Summarization models and output formatsconfig/evaluation.yaml: Evaluation metrics and reporting settings
Configuration Loading
Section titled “Configuration Loading”Each notebook automatically merges all configuration files:
# Automatic configuration loading in each notebookCONFIG = load_and_merge_configs("config/")Experimentation with Configurations
Section titled “Experimentation with Configurations”To experiment with different settings:
- Backup Current Config: Copy the config directory
- Modify Settings: Edit the relevant YAML files
- Run Specific Steps: Execute only the affected notebooks
- Compare Results: Review outputs in the data directories
📊 Data Flow
Section titled “📊 Data Flow”Input/Output Structure
Section titled “Input/Output Structure”Each step follows a consistent pattern:
# Input (from previous step)input_data_dir = Path("data/step{n-1}_{previous_name}")document = Document.from_json((input_data_dir / "document.json").read_text())config = json.load(open(input_data_dir / "config.json"))
# Processing# ... step-specific processing ...
# Output (for next step)output_data_dir = Path("data/step{n}_{current_name}")output_data_dir.mkdir(parents=True, exist_ok=True)(output_data_dir / "document.json").write_text(document.to_json())json.dump(config, open(output_data_dir / "config.json", "w"))Serialized Artifacts
Section titled “Serialized Artifacts”Each step produces:
document.json: Updated Document object with step resultsconfig.json: Complete merged configurationenvironment.json: Environment settings and metadata- Step-specific result files: Detailed processing outputs
🔬 Detailed Step Descriptions
Section titled “🔬 Detailed Step Descriptions”Step 0: Setup (step0_setup.ipynb)
Section titled “Step 0: Setup (step0_setup.ipynb)”- Purpose: Initialize the Document object and prepare the processing environment
- Inputs: PDF file path, configuration files
- Outputs: Document object with pages and metadata
- Key Features: Multi-page PDF support, metadata extraction
Step 1: OCR (step1_ocr.ipynb)
Section titled “Step 1: OCR (step1_ocr.ipynb)”- Purpose: Extract text and analyze document structure using Amazon Textract
- Inputs: Document object with PDF pages
- Outputs: OCR results with text blocks, tables, and forms
- Key Features: Textract API integration, feature selection, result caching
Step 2: Classification (step2_classification.ipynb)
Section titled “Step 2: Classification (step2_classification.ipynb)”- Purpose: Identify document types and create logical sections
- Inputs: Document with OCR results
- Outputs: Classified sections with confidence scores
- Key Features: Multi-modal classification, few-shot prompting, custom classes
Step 3: Extraction (step3_extraction.ipynb)
Section titled “Step 3: Extraction (step3_extraction.ipynb)”- Purpose: Extract structured data from each classified section
- Inputs: Document with classified sections
- Outputs: Structured data for each section based on class definitions
- Key Features: Class-specific extraction, JSON schema validation
Step 4: Assessment (step4_assessment.ipynb)
Section titled “Step 4: Assessment (step4_assessment.ipynb)”- Purpose: Evaluate extraction confidence and provide explainability
- Inputs: Document with extraction results
- Outputs: Confidence scores and reasoning for each extracted attribute
- Key Features: Confidence assessment, hallucination detection, explainability
Step 5: Summarization (step5_summarization.ipynb)
Section titled “Step 5: Summarization (step5_summarization.ipynb)”- Purpose: Generate human-readable summaries of processing results
- Inputs: Document with assessed extractions
- Outputs: Section and document-level summaries in multiple formats
- Key Features: Multi-format output (JSON, Markdown, HTML), customizable templates
Step 6: Evaluation (step6_evaluation.ipynb)
Section titled “Step 6: Evaluation (step6_evaluation.ipynb)”- Purpose: Comprehensive evaluation of pipeline performance and accuracy
- Inputs: Document with complete processing results
- Outputs: Evaluation reports, accuracy metrics, performance analysis
- Key Features: EvaluationService integration, ground truth comparison, detailed reporting
🧪 Experimentation Guide
Section titled “🧪 Experimentation Guide”Modifying Document Classes
Section titled “Modifying Document Classes”To add new document types or modify existing ones:
- Edit
config/classes.yaml:
classes: new_document_type: description: "Description of the new document type" attributes: - name: "attribute_name" description: "What this attribute represents" type: "string" # or "number", "date", etc.- Run from Step 2: Classification onwards to process with new classes
Changing Models
Section titled “Changing Models”To experiment with different AI models:
- Edit relevant config files:
# In config/extraction.yamlllm_method: model: "anthropic.claude-3-5-sonnet-20241022-v2:0" # Change model temperature: 0.1 # Adjust parameters- Run affected steps: Only the steps that use the changed configuration
Adjusting Confidence Thresholds
Section titled “Adjusting Confidence Thresholds”To experiment with confidence thresholds:
- Edit
config/assessment.yaml:
assessment: confidence_threshold: 0.7 # Lower threshold = more permissive- Run Steps 4-6: Assessment, Summarization, and Evaluation
Performance Optimization
Section titled “Performance Optimization”- Parallel Processing: Modify extraction/assessment to process sections in parallel
- Caching: Results are automatically cached between steps
- Batch Processing: Process multiple documents by running the pipeline multiple times
🐛 Troubleshooting
Section titled “🐛 Troubleshooting”Common Issues
Section titled “Common Issues”- AWS Credentials: Ensure proper AWS configuration
aws configure list- Missing Dependencies: Install required packages
pip install boto3 jupyter ipython-
Memory Issues: For large documents, consider processing sections individually
-
Configuration Errors: Validate YAML syntax
python -c "import yaml; yaml.safe_load(open('config/main.yaml'))"Debug Mode
Section titled “Debug Mode”Enable detailed logging in any notebook:
import logginglogging.basicConfig(level=logging.DEBUG)Data Inspection
Section titled “Data Inspection”Each step saves detailed results that can be inspected:
# Inspect intermediate resultsimport jsonwith open("data/step3_extraction/extraction_summary.json") as f: results = json.load(f) print(json.dumps(results, indent=2))📈 Performance Monitoring
Section titled “📈 Performance Monitoring”Metrics Tracked
Section titled “Metrics Tracked”Each step automatically tracks:
- Processing Time: Total time for the step
- Throughput: Pages per second
- Memory Usage: Peak memory consumption
- API Calls: Number of service calls made
- Error Rates: Failed operations
Performance Analysis
Section titled “Performance Analysis”The evaluation step provides comprehensive performance analysis:
- Step-by-step timing breakdown
- Bottleneck identification
- Resource utilization metrics
- Cost analysis (for AWS services)
🔒 Security and Best Practices
Section titled “🔒 Security and Best Practices”AWS Security
Section titled “AWS Security”- Use IAM roles with minimal required permissions
- Enable CloudTrail for API logging
- Store sensitive data in S3 with appropriate encryption
Data Privacy
Section titled “Data Privacy”- Documents are processed in your AWS account
- No data is sent to external services (except configured AI models)
- Temporary files are cleaned up automatically
Configuration Management
Section titled “Configuration Management”- Version control your configuration files
- Use environment-specific configurations for different deployments
- Document any custom modifications
🤝 Contributing
Section titled “🤝 Contributing”To extend or modify the notebooks:
- Follow the Pattern: Maintain the input/output structure for compatibility
- Update Configurations: Add new configuration options to appropriate YAML files
- Document Changes: Update this README and add inline documentation
- Test Thoroughly: Verify that changes work across the entire pipeline
📚 Additional Resources
Section titled “📚 Additional Resources”- idp_common API Reference
- Configuration Guide
- Evaluation Methods
- AWS Textract Documentation
- Amazon Bedrock Documentation
Happy Document Processing! 🚀
For questions or support, refer to the main project documentation or create an issue in the project repository.