InsuranceLake Schema Mapping Documentation

Schema mapping in InsuranceLake is configured by using a comma-delimited text file to describe the source and destination field names.

The filename of the schema mapping configuration file follows the convention of <database name>-<table name>.csv and is stored in the /etl/transformation-spec folder in the etl-scripts bucket. When using AWS CDK for deployment, the contents of the /lib/glue_scripts/lib/transformation-spec directory will be automatically deployed to this location.

Your schema mapping file should map field names to Apache Parquet and Athena-friendly names. Specifically, this means removing or replacing the invalid characters: ,;{}()\n\r\t=. We also recommend removing or replacing characters that are valid but require escaping: ` .:[]`.

Field names in the schema mapping file are automatically escaped in Spark using surrounding backtick characters ` so that all field name characters are considered part of the name. See Mapping Nested Data for cases when you want to override the escape characters with your own escaping.

The schema mapping operation is accomplished using Apache Spark DataFrame’s alias operation.

When making schema mapping changes to an existing data table, confirm that you understand the ETL’s schema evolution capabilities.

Behavior When There is No Schema Mapping
Basic Mapping File Layout (CSV)
Dropping Columns
Files with No Header
Duplicate Columns
Fuzzy Matching Column Names
Handling Line Breaks in Column Headers
- Exact Matching
- Fuzzy Matching
Fixed Width File Format
Mapping Nested Data

Behavior When There is No Schema Mapping

When there is no schema mapping file or an empty schema mapping file in the etl-scripts bucket for the workflow, the ETL will clean all column names so that they can be saved successfully in Parquet format. The ETL will also save a recommended mapping file to the AWS Glue Temp bucket, <environment>-insurancelake-<account>-<region>-glue-temp bucket, in the folder /etl/collect_to_cleanse following the naming convention <database>-<table>.csv.

When this behavior occurs, you will see the following log message in the AWS Glue job Output Logs:

No mapping found, generated recommended mapping to: s3://...

This recommended mapping file can be used as a starting point to build your own mapping file and will help ensure that you have valid column names.

Basic Mapping File Layout (CSV)

All schema mapping configuration files must contain a header row. The order of columns in the mapping file does not matter, but the name of each column must match the specification below.

Field	Type	Description
SourceName	required	The exact name of the column (spaces and special characters included) in the source file
DestName	required	The desired Spark/Parquet-conforming name of the column; use `null` to explicitly drop the field
Description	optional	Example extra field to use for describing the column or mapping; this value will be ignored, as will any other field not specifically identified in this documentation

If commas are included in the field names, you must surround the field name with quotes " following the CSV standard. Similarly, if quotes are in the field names, you must use the double quotes escape sequence (in other words ""). Other special characters such as spaces can be simply included in the SourceName field as they appear in the source data.

Example:

SourceName,DestName
partyid,customerno
party type,partytype
firstname,firstname
lastname,lastname
""nickname"",Null
"city, state",city_state
 Loss Paid,loss_paid
 Exp Paid,exp_paid
 Fee Paid,fee_paid
Original Claim Currency ,original_claim_currency

Dropping Columns

Fields can be dropped using the schema mapping configuration in two ways:

Indicating Null (case insensitive) as the DestName (explicitly drop the field).
Omiting the field from the schema mapping (implicitly drop the field).

How the ETL behaves in each scenario changes based on the strict_schema_mapping setting in the transformation-spec. Refer to the following chart to understand how the ETL handles all the combinations of dropping a field and strict schema mapping:

Source file	Mapping file	Strict Schema Mapping	Behavior
Field exists	Field mapped normally	false	Field is included in Cleanse bucket, data pipeline succeeds
Field missing	Field mapped normally	false	Field is not included in Cleanse bucket, data pipeline succeeds
Field exists	Field mapped normally	true	Field is included in Cleanse bucket, data pipeline succeeds
Field missing	Field mapped normally	true	Data pipeline fails with schema mapping error
Field exists	Field mapped to `Null`	false	Field is not included in Cleanse bucket, data pipeline succeeds
Field missing	Field mapped to `Null`	false	Field is not included in Cleanse bucket, data pipeline succeeds
Field exists	Field mapped to `Null`	true	Field is not included in Cleanse bucket, data pipeline succeeds
Field missing	Field mapped to `Null`	true	Data pipeline fails with schema mapping error

Files with No Header

If a file format is configured with header: false, the column names will be enumerated by Apache Spark using the prefix _c with a number representing the column position in the schema, starting with 0.

For files with no header, the ETL will not be aware if columns change order in the source file schema. We recommend using data quality rules to check field types to catch this situation.

Example:

SourceName,DestName
_c0,line_number
_c1,description
_c2,amount

Duplicate Columns

If a file contains columns that have the same name, Apache Spark will automatically rename them to unique names by numbering every field name that has a duplicate with a number representing the column position in the schema, starting with 0. All occurrences, including the first, will be numbered.

For files with duplicate columns, the ETL will not be aware if those duplicate columns change location in the source file schema. We recommend using data quality rules to check field types to catch this situation.

Example:

SourceName,DestName
Claim Note,claim_note
Claim AMT1,claim_amt
Country2,country2
In Suit,in_suit
Trial Date,trial_date
Country5,country5
Claim AMT6,Null

Fuzzy Matching Column Names

Fuzzy matching of column names for schema mapping is implemented in InsuranceLake using the RapidFuzz Python library.

To use fuzzy matching in your schema mapping file, add the columns Threshold and Scorer. These columns do not need to be defined for every column in the schema mapping, only for the columns where you want to use fuzzy matching to map schema.

Fuzzy matching is always performed after direct schema mapping of fields. Only the fields from the source file that were not directly mapped are used for fuzzy matching. When a field is successfully fuzzy matched, it is removed from the available fields for subsequent fuzzy matching. This design increases the chance that you will get the matches you are expecting.

Full documentation of the RapidFuzz library is available on RapidFuzz Github Pages.

Field	Type	Description
SourceName	required	Text that will be used to fuzzy match columns in the source file
DestName	required	Desired Spark/Parquet-conforming name of the column
Threshold	required	Fuzzy matching score that must be met in order to constitute a match for schema mapping
Scorer	required	Fuzzy matching scorer algorithm to use for matching, exactly how it is defined in the RapidFuzz library, case sensitive. Options include `ratio`, `partial_ratio`, `token_sort_ratio`, `token_set_ratio`, `WRatio`, `QRatio`. A description of scorers with examples is available in the RapidFuzz usage documentation.

Example schema mapping file with fuzzy matching for the NewOrRenewal column:

SourceName,DestName,Threshold,Scorer
PolicyNumber,PolicyNumber
EffectiveDate,EffectiveDate
ExpirationDate,ExpirationDate
NewRenewal,NewOrRenewal,90,ratio

Handling Line Breaks in Column Headers

The InsuranceLake ETL can handle mapping columns with line breaks using exact matching or fuzzy matching.

Consider a data source with the following column headers:

Date Closed	Vehicle Make	Vehicle Model	Claim/Incident/ Report

Exact Matching

Using double-quotes around SourceName values allows you to specify line breaks in the CSV mapping file. This method performs an exact match. Specifically, this method requires you to match your line break characters exactly in the mapping configuration. If the data sources columns have varying line breaks, this can be a challenge. Consider using fuzzy matching for complex or changing column headers.

Example exact matching to handle line breaks:

SourceName,DestName
"Date
Closed",dateclosed
"Vehicle
Make",vehiclemake
"Vehicle
Model",vehiclemodel
"Claim/Incident/
Report",claim_incident_report

Fuzzy Matching

Fuzzy matching can be used to match mixed line breaks (Unix, Windows, legacy MacOS) with a high degree of accuracy. Use the ratio scorer with a 90 - 95 threshold, and treat each line break as a space (or other clear separator) in the SourceName.

Example fuzzy matching to handle line breaks:

SourceName,DestName,Threshold,Scorer
Date Closed,dateclosed,95,ratio
Vehicle Make,vehiclemake,95,ratio
Vehicle Model,vehiclemodel,95,ratio
Claim/Incident/Report,claim_incident_report,95,ratio

Fixed Width File Format

Fixed width data sources require a fixed JSON object in the input_schema section of the transformation configuration file. See the file formats documentation for more information.

Field	Type	Description
SourceName	required	Placeholder value that is ignored
DestName	required	The desired Spark/Parquet-conforming name of the column
Description	optional	Example extra field to use for describing the column or mapping; this value will be ignored, as will any other field not specifically identified in this documentation
Width	required	Width of the field in characters, positive integer only

No start and end position is needed for fields, because the fields are defined in the order they appear in the file, and the width of each field is used to calculate the start position of the next field.
No fuzzy matching of columns is available for fixed width files because the fixed width format does not use column headers. There is no data source field name.

Example:

SourceName,DestName,Width
record_type,record_type,3
filler1,null,5
claim_number,claim_number,18
client_id,client_id,4

Mapping Nested Data

The schema mapping configuration can be used to flatten nested data structures.

Fuzzy matching is available for nested data and will match on the fully expressed nested entity names (for example, contacts.role.name).

The schema mapping file cannot be used to rename or map fields in place within nested data structures. When no mapping file is present, the recommended schema mapping file will only contain the first level of the nested data structure. This behavior reflects the ability of Spark’s alias function; it can only rename columns in the first level.

See changetype documentation for a different method to rename fields in place within a nested data structure.

When referencing elements within an array structure, it is not necessary to specify the element index. Simply use dot notation to reference any fields within the array.

Example schema mapping that flattens and renames a nested data structure:

SourceName,DestName
contacts,null
`contacts`.`cc:303`.`roles`.`role`.`name`,contacts_303_main_role
`contacts`.`cc:302`,contacts_302
`contacts`.`cc:303`,contacts_303

Works with the following structure:

"contacts": {
    "cc:303": {
        "name": "Sara Smith",
        "roles": [
            {
                "role": {
                    "name": "Main Contact"
                }
            }
        ]
    },
    "cc:302": {
        "name": "John Smith",
        "roles": [
            {
                "role": {
                    "name": "Secondary Contact"
                }
            }
        ]
    }
}

InsuranceLake Schema Mapping Documentation

Contents

Behavior When There is No Schema Mapping

Basic Mapping File Layout (CSV)

Dropping Columns

Files with No Header

Duplicate Columns

Fuzzy Matching Column Names

Handling Line Breaks in Column Headers

Exact Matching

Fuzzy Matching

Fixed Width File Format

Mapping Nested Data