InsuranceLake Schema Mapping Documentation

Schema mapping in InsuranceLake is configured by using a comma-delimited text file to describe the source and destination field names.

The filename of the schema mapping configuration file follows the convention of <database name>-<table name>.csv and is stored in the /etl/transformation-spec folder in the etl-scripts bucket. When using AWS CDK for deployment, the contents of the /lib/glue_scripts/lib/transformation-spec directory will be automatically deployed to this location.

Your schema mapping file should map field names to Apache Parquet and Athena-friendly names. Specifically, this means removing or replacing the invalid characters: ,;{}()\n\r\t=. We also recommend removing or replacing characters that are valid but require escaping: ` .:[]`.

Field names in the schema mapping file are automatically escaped in Spark using surrounding backtick characters ` so that all field name characters are considered part of the name. See Mapping Nested Data for cases when you want to override the escape characters with your own escaping.

The schema mapping operation is accomplished using Apache Spark DataFrame’s alias operation.

When making schema mapping changes to an existing data table, confirm that you understand the ETL’s schema evolution capabilities.

Contents

Behavior When There is No Schema Mapping

When there is no schema mapping file or an empty schema mapping file in the etl-scripts bucket for the workflow, the ETL will clean all column names so that they can be saved successfully in Parquet format. The ETL will also save a recommended mapping file to the AWS Glue Temp bucket, <environment>-insurancelake-<account>-<region>-glue-temp bucket, in the folder /etl/collect_to_cleanse following the naming convention <database>-<table>.csv.

When this behavior occurs, you will see the following log message in the AWS Glue job Output Logs:

No mapping found, generated recommended mapping to: s3://...

This recommended mapping file can be used as a starting point to build your own mapping file and will help ensure that you have valid column names.

Basic Mapping File Layout (CSV)

All schema mapping configuration files must contain a header row. The order of columns in the mapping file does not matter, but the name of each column must match the specification below.

FieldTypeDescription
SourceNamerequiredThe exact name of the column (spaces and special characters included) in the source file
DestNamerequiredThe desired Spark/Parquet-conforming name of the column; use null to explicitly drop the field
DescriptionoptionalExample extra field to use for describing the column or mapping; this value will be ignored, as will any other field not specifically identified in this documentation

If commas are included in the field names, you must surround the field name with quotes " following the CSV standard. Similarly, if quotes are in the field names, you must use the double quotes escape sequence (in other words ""). Other special characters such as spaces can be simply included in the SourceName field as they appear in the source data.

Example:

SourceName,DestName
partyid,customerno
party type,partytype
firstname,firstname
lastname,lastname
""nickname"",Null
"city, state",city_state
 Loss Paid,loss_paid
 Exp Paid,exp_paid
 Fee Paid,fee_paid
Original Claim Currency ,original_claim_currency

Dropping Columns

Fields can be dropped using the schema mapping configuration in two ways:

  1. Indicating Null (case insensitive) as the DestName (explicitly drop the field).
  2. Omiting the field from the schema mapping (implicitly drop the field).

How the ETL behaves in each scenario changes based on the strict_schema_mapping setting in the transformation-spec. Refer to the following chart to understand how the ETL handles all the combinations of dropping a field and strict schema mapping:

Source fileMapping fileStrict Schema MappingBehavior
Field existsField mapped normallyfalseField is included in Cleanse bucket, data pipeline succeeds
Field missingField mapped normallyfalseField is not included in Cleanse bucket, data pipeline succeeds
Field existsField mapped normallytrueField is included in Cleanse bucket, data pipeline succeeds
Field missingField mapped normallytrueData pipeline fails with schema mapping error
Field existsField mapped to NullfalseField is not included in Cleanse bucket, data pipeline succeeds
Field missingField mapped to NullfalseField is not included in Cleanse bucket, data pipeline succeeds
Field existsField mapped to NulltrueField is not included in Cleanse bucket, data pipeline succeeds
Field missingField mapped to NulltrueData pipeline fails with schema mapping error

Files with No Header

If a file format is configured with header: false, the column names will be enumerated by Apache Spark using the prefix _c with a number representing the column position in the schema, starting with 0.

For files with no header, the ETL will not be aware if columns change order in the source file schema. We recommend using data quality rules to check field types to catch this situation.

Example:

SourceName,DestName
_c0,line_number
_c1,description
_c2,amount

Duplicate Columns

If a file contains columns that have the same name, Apache Spark will automatically rename them to unique names by numbering every field name that has a duplicate with a number representing the column position in the schema, starting with 0. All occurrences, including the first, will be numbered.

For files with duplicate columns, the ETL will not be aware if those duplicate columns change location in the source file schema. We recommend using data quality rules to check field types to catch this situation.

Example:

SourceName,DestName
Claim Note,claim_note
Claim AMT1,claim_amt
Country2,country2
In Suit,in_suit
Trial Date,trial_date
Country5,country5
Claim AMT6,Null

Fuzzy Matching Column Names

Fuzzy matching of column names for schema mapping is implemented in InsuranceLake using the RapidFuzz Python library.

To use fuzzy matching in your schema mapping file, add the columns Threshold and Scorer. These columns do not need to be defined for every column in the schema mapping, only for the columns where you want to use fuzzy matching to map schema.

Fuzzy matching is always performed after direct schema mapping of fields. Only the fields from the source file that were not directly mapped are used for fuzzy matching. When a field is successfully fuzzy matched, it is removed from the available fields for subsequent fuzzy matching. This design increases the chance that you will get the matches you are expecting.

Full documentation of the RapidFuzz library is available on RapidFuzz Github Pages.

FieldTypeDescription
SourceNamerequiredText that will be used to fuzzy match columns in the source file
DestNamerequiredDesired Spark/Parquet-conforming name of the column
ThresholdrequiredFuzzy matching score that must be met in order to constitute a match for schema mapping
ScorerrequiredFuzzy matching scorer algorithm to use for matching, exactly how it is defined in the RapidFuzz library, case sensitive. Options include ratio, partial_ratio, token_sort_ratio, token_set_ratio, WRatio, QRatio. A description of scorers with examples is available in the RapidFuzz usage documentation.

Example schema mapping file with fuzzy matching for the NewOrRenewal column:

SourceName,DestName,Threshold,Scorer
PolicyNumber,PolicyNumber
EffectiveDate,EffectiveDate
ExpirationDate,ExpirationDate
NewRenewal,NewOrRenewal,90,ratio

Handling Line Breaks in Column Headers

The InsuranceLake ETL can handle mapping columns with line breaks using exact matching or fuzzy matching.

Consider a data source with the following column headers:

Date
Closed
Vehicle
Make
Vehicle
Model
Claim/Incident/
Report
    

Exact Matching

Using double-quotes around SourceName values allows you to specify line breaks in the CSV mapping file. This method performs an exact match. Specifically, this method requires you to match your line break characters exactly in the mapping configuration. If the data sources columns have varying line breaks, this can be a challenge. Consider using fuzzy matching for complex or changing column headers.

Example exact matching to handle line breaks:

SourceName,DestName
"Date
Closed",dateclosed
"Vehicle
Make",vehiclemake
"Vehicle
Model",vehiclemodel
"Claim/Incident/
Report",claim_incident_report

Fuzzy Matching

Fuzzy matching can be used to match mixed line breaks (Unix, Windows, legacy MacOS) with a high degree of accuracy. Use the ratio scorer with a 90 - 95 threshold, and treat each line break as a space (or other clear separator) in the SourceName.

Example fuzzy matching to handle line breaks:

SourceName,DestName,Threshold,Scorer
Date Closed,dateclosed,95,ratio
Vehicle Make,vehiclemake,95,ratio
Vehicle Model,vehiclemodel,95,ratio
Claim/Incident/Report,claim_incident_report,95,ratio

Fixed Width File Format

Fixed width data sources require a fixed JSON object in the input_schema section of the transformation configuration file. See the file formats documentation for more information.

FieldTypeDescription
SourceNamerequiredPlaceholder value that is ignored
DestNamerequiredThe desired Spark/Parquet-conforming name of the column
DescriptionoptionalExample extra field to use for describing the column or mapping; this value will be ignored, as will any other field not specifically identified in this documentation
WidthrequiredWidth of the field in characters, positive integer only
  • No start and end position is needed for fields, because the fields are defined in the order they appear in the file, and the width of each field is used to calculate the start position of the next field.

  • No fuzzy matching of columns is available for fixed width files because the fixed width format does not use column headers. There is no data source field name.

Example:

SourceName,DestName,Width
record_type,record_type,3
filler1,null,5
claim_number,claim_number,18
client_id,client_id,4

Mapping Nested Data

The schema mapping configuration can be used to flatten nested data structures.

Fuzzy matching is available for nested data and will match on the fully expressed nested entity names (for example, contacts.role.name).

The schema mapping file cannot be used to rename or map fields in place within nested data structures. When no mapping file is present, the recommended schema mapping file will only contain the first level of the nested data structure. This behavior reflects the ability of Spark’s alias function; it can only rename columns in the first level.

See changetype documentation for a different method to rename fields in place within a nested data structure.

When referencing elements within an array structure, it is not necessary to specify the element index. Simply use dot notation to reference any fields within the array.

Example schema mapping that flattens and renames a nested data structure:

SourceName,DestName
contacts,null
`contacts`.`cc:303`.`roles`.`role`.`name`,contacts_303_main_role
`contacts`.`cc:302`,contacts_302
`contacts`.`cc:303`,contacts_303

Works with the following structure:

"contacts": {
    "cc:303": {
        "name": "Sara Smith",
        "roles": [
            {
                "role": {
                    "name": "Main Contact"
                }
            }
        ]
    },
    "cc:302": {
        "name": "John Smith",
        "roles": [
            {
                "role": {
                    "name": "Secondary Contact"
                }
            }
        ]
    }
}

Back to top

Copyright Amazon.com and its affiliates; all rights reserved. This file is Amazon Web Services Content and may not be duplicated or distributed without permission.

Page last modified: Sep 26 2024.