InsuranceLake Schema Evolution Documentation

This section describes the capabilities and considerations for managing schema evolution.

Contents

Overview

The InsuranceLake ETL provides capabilities to detect, control, and configure schema evolution using Data Catalog, Athena, Spark, Parquet, and Iceberg:

Behavior and capabilities for allowing schema evolution vary depending on the data lake layer (Cleanse and Consume) and the table storage format.

Schema Change Setting

The allow_schema_change setting for your workflow is defined in the input-spec section of the workflow’s JSON configuration file. Details on this configuration file can be found in the File Formats and Input Specification Documentation.

SettingBehavior
strictNo schema change is permitted, including reordering of columns (case changes to field names are permitted because Data Catalog is case-insensitive)
reorderOnly reordering of fields in the schema is permitted
evolveThe following schema changes are permitted: reordering column, adding column, changing data types; deleting columns is not permitted
permissiveAll schema changes are permitted, even changes that could break schema merges in some table formats

If the allow_schema_change setting is not specified, the ETL defaults to a value based on the deployment environment:

EnvironmentDefault Setting
Devpermissive
Testreorder
Prodstrict

To change the environment-based default behavior, modify the conditional expression starting on Line 85 of the Collect-to-Cleanse AWS Glue job.

To extend or change the specific schema change detection and control behavior, modify the check_schema_change function starting on Line 88 of the glue_catalog_helpers module.

Evolve Setting Data Type Change Details

Current data typeAllowed new data types
stringbyte, tinyint, smallint, int, bigint
bytetinyint, smallint, int, bigint
tinyintsmallint, int, bigint
smallintint, bigint
intbigint
floatdouble
datetimestamp
decimaldecimal with larger precision or scale

Cleanse Layer

By default, the Collect-to-Cleanse AWS Glue job writes data incrementally by partition to the Cleanse layer table in the Cleanse S3 bucket. The default partition strategy uses the Collect S3 Bucket object creation year, month, and day. When re-loading data for an existing partition, the Collect-to-Cleanse job clears the specific partition from the Cleanse layer table in the Cleanse S3 bucket, and re-writes all the data for each ETL pipeline execution.

When data is read from the Cleanse layer table by Spark or Athena, any schema differences will need to be merged. Parquet/Hadoop schema change support will apply.

Consume Layer

By default, the Cleanse-to-Consume AWS Glue job clears the Consume layer table in the Consume S3 bucket and re-writes all data in each ETL pipeline execution. This makes it easy to change the schema, but has the disadvantage of the table being unavailable to consumers while it is being re-written. For details on how this impacts Spark SQL and how to modify this behavior to address performance at scale, refer to the Spark SQL section of the Using SQL documentation.

Because the entire Consume layer table is re-written, all types of schema changes are supported, including changing the partition layout. If you change this behavior to load data incrementally as described in the above link, Parquet/Hadoop schema change support will apply.

Parquet/Hadoop

By default, the Collect-to-Cleanse and Cleanse-to-Consume InsuranceLake AWS Glue jobs create Parquet/Hadoop data tables in S3 and the Data Catalog. Athena and Spark support many types of schema changes for a Parquet/Hadoop table.

If an unsupported change is published to the data lake, queries across partitions with those schema differences will fail. You can work around this issue by selectively querying specific partitions, but some partitions will always trigger the schema merge failure. This is because Athena and Spark will always use the schema in the Data Catalog as the target for schema merges.

Tables created by the ETL use read-by-name by default. This means that reordering columns, adding columns in any location, and removing columns is supported; however, renaming column in place is not supported.

When columns are missing from a partition, rows from that partition will have null values in the missing columns.

Spark SQL

General information on Spark schema merging can be found in the Apache Spark Parquet Files documentation.

When data types change, Spark will attempt to coerce the partition data type to the target data type. These data type coercions sometimes generate an error and other times create null values in the data.

We recommend testing all schema changes before publishing data and using AWS Glue Data Quality rules to check for expected values and completeness in columns.

For example, suppose you have a partition with a field of decimal(10,6), and a Data Catalog table definition of decimal(9,6). Spark will attempt to coerce all values, and if some data for the field does not fit in decimal(9,6) (in other words, a value of the field has 10 significant digits), no error will be raised and the field values that do not fit will be null. In contrast, Athena will give an error when trying to merge the partitions if there are values that do not fit in the new precision and scale.

Details on Spark’s data type coercion can be found in the Apache Spark ANSI Compliance documentation.

An unsupported data type change will generate an error similar to the following example:

org.apache.hadoop.hive.serde2.io.TimestampWritable cannot be cast to org.apache.hadoop.hive.serde2.io.DateWritable

Athena SQL

Athena behaves similarly to Spark when merging partitions with different schemas. AWS documentation is available in the following locations:

An unsupported data type change will generate an error similar to the following examples:

INVALID_CAST_ARGUMENT: Cannot cast DECIMAL(10, 2) '1494725.62' to DECIMAL(9, 3)
HIVE_PARTITION_SCHEMA_MISMATCH: There is a mismatch between the table and partition schemas. The types are incompatible and cannot be coerced. The column 'startdate' in table 'syntheticgeneraldata.claimdata' is declared as type 'date', but partition 'year=2024/month=05/day=10' declared column 'startdate' as type 'timestamp'.

Apache Iceberg

Apache Iceberg tables created by the ETL entity-match AWS Glue job support the following types of schema change:

  • add
  • drop
  • rename
  • update
  • reorder
  • changing the partition layout

Iceberg schema evolution support is independent and free of side-effects without rewriting files. For complete details, refer to the Iceberg Schema Evolution documentation.


Back to top

Copyright Amazon.com and its affiliates; all rights reserved. This file is Amazon Web Services Content and may not be duplicated or distributed without permission.

Page last modified: Sep 26 2024.