Slowly Changing Dimensions (SCD) Implementation in Snowflake

Spread the love

1. What are Slowly Changing Dimensions?

Slowly Changing Dimensions (SCDs) are a data warehousing concept that describes how changes in dimension attributes are managed over time.

There are three commonly used types of Slowly Changing Dimensions:

  • SCD Type 1: Overwrites existing values with new data. No history maintained.
  • SCD Type 2: Preserves full historical changes by creating records for each version
  • SCD Type 3: Maintains a limited history within the same record using additional columns

In this article, let us discuss about the implementation of all three SCD types in Snowflake.

2. SCD Type 1 Implementation in Snowflake

A Slowly Changing Dimension (SCD) Type 1 is applied when it is not required to maintain a history of a record in the dimension table. In this approach, the existing record is overwritten with the new values whenever one or more attributes change.

Let us understand the implementation of Slowly Changing Dimension (SCD) Type-1 in Snowflake.

2.1. Setting up the Stage table

A Stage table STG_EMPLOYEES is set up to store the raw incoming source data.

-- Setting up Source Data
CREATE OR REPLACE TABLE STG_EMPLOYEES(
    EMPLOYEE_ID    NUMBER(6,0),
    EMPLOYEE_NAME  VARCHAR(256),
    SALARY         NUMBER(8,2),
    DEPARTMENT_ID  NUMBER(4,0)
);

INSERT INTO STG_EMPLOYEES (EMPLOYEE_ID, EMPLOYEE_NAME, SALARY, DEPARTMENT_ID) VALUES
    (101, 'Tony Stark',       150000.00, 10),   -- Engineering
    (102, 'Bruce Banner',      98000.00, 20),   -- Research
    (103, 'Peter Parker',      65000.00, 30),   -- Photography
    (104, 'Natasha Romanoff', 120000.00, 40);   -- Security
;

SELECT * FROM STG_EMPLOYEES;
Raw Source Data before Initial Load
Raw Source Data before Initial Load

2.2. Creating the Dimension Table

The target dimension table DIM_EMPLOYEES is created, which will hold the transformed data along with a checksum to detect changes.

-- Creating SCD Type-1 Dimension table
CREATE OR REPLACE TABLE DIM_EMPLOYEES(
    EMPLOYEE_ID    NUMBER(6,0),
    EMPLOYEE_NAME  VARCHAR(256),
    SALARY         NUMBER(8,2),
    DEPARTMENT_ID  NUMBER(4,0),
    CHECKSUM       VARCHAR(256),
    CREATED_TIMESTAMP TIMESTAMP_NTZ,
    UPDATED_TIMESTAMP TIMESTAMP_NTZ
);

2.3. Generating Checksum for Change Detection

A checksum value acts as a unique identifier for each record and helps detect changes.

  • Instead of comparing multiple columns individually to identify modifications, comparing checksum values between the tables makes change detection simpler and faster.
  • The checksum value is generated by applying the MD5 function to all the refined source column values.
  • Any change in one or more column values results in a different checksum value, making it easier to identify any changes in the data.
-- Generating Cheksum for change detection
SELECT
    EMPLOYEE_ID,  
    UPPER(TRIM(EMPLOYEE_NAME)) AS EMPLOYEE_NAME,
    SALARY,
    DEPARTMENT_ID,
    MD5(CONCAT(
        EMPLOYEE_ID::STRING, '|',
        UPPER(TRIM(EMPLOYEE_NAME)), '|',
        SALARY::STRING, '|',
        DEPARTMENT_ID::STRING
    )) AS CHECKSUM
FROM STG_EMPLOYEES;
Generating Checksum for each record for change detection
Generating Checksum for each record for change detection

2.4. Implementing Merge Logic

A Merge statement is used to perform inserts and updates into the dimension table from the stage table.

-- Merge Logic for SCD Type-1 
MERGE INTO DIM_EMPLOYEES tgt
USING (
    SELECT
        EMPLOYEE_ID,  
        UPPER(TRIM(EMPLOYEE_NAME)) AS EMPLOYEE_NAME,
        SALARY,
        DEPARTMENT_ID,
        MD5(CONCAT(
            EMPLOYEE_ID::STRING, '|',
            UPPER(TRIM(EMPLOYEE_NAME)), '|',
            SALARY::STRING, '|',
            DEPARTMENT_ID::STRING
        )) AS CHECKSUM
    FROM STG_EMPLOYEES
) src
ON tgt.EMPLOYEE_ID = src.EMPLOYEE_ID

-- Update record if already existing and change is detected
WHEN MATCHED AND tgt.CHECKSUM != src.CHECKSUM THEN
    UPDATE SET
        tgt.EMPLOYEE_NAME     = src.EMPLOYEE_NAME,
        tgt.SALARY            = src.SALARY,
        tgt.DEPARTMENT_ID     = src.DEPARTMENT_ID,
        tgt.CHECKSUM          = src.CHECKSUM,
        tgt.UPDATED_TIMESTAMP = CURRENT_TIMESTAMP()

-- Insert record if no match found
WHEN NOT MATCHED THEN
    INSERT (EMPLOYEE_ID, EMPLOYEE_NAME, SALARY, DEPARTMENT_ID, CHECKSUM, CREATED_TIMESTAMP, UPDATED_TIMESTAMP)
    VALUES (src.EMPLOYEE_ID, src.EMPLOYEE_NAME, src.SALARY, src.DEPARTMENT_ID, src.CHECKSUM, CURRENT_TIMESTAMP(), CURRENT_TIMESTAMP())
;

2.5. Initial Data Load with Merge

In the initial load, as no data is present in the dimension table, all records from the stage table are inserted into the dimension table.

Dimension table after initial load
Dimension table after initial load

2.6. Demonstrating Updates (SCD Type-1 Behaviour)

To simulate changes, the stage table is reloaded with updated salary values and additional employees.

-- Updated source data to demonstarte SCD Type-1 behaviour
TRUNCATE TABLE STG_EMPLOYEES;

INSERT INTO STG_EMPLOYEES (EMPLOYEE_ID, EMPLOYEE_NAME, SALARY, DEPARTMENT_ID) VALUES
    (101, 'Tony Stark',       150000.00, 10),   -- No change
    (102, 'Bruce Banner',     100000.00, 20),   -- Salary updated
    (103, 'Peter Parker',      65000.00, 30),   -- No change
    (104, 'Natasha Romanoff', 125000.00, 40),   -- Salary updated
    (105, 'Steve Rogers',     110000.00, 50),   -- New employee
    (106, 'Thor Odinson',     130000.00, 60);   -- New employee
;

Re-running the MERGE statement updates salaries for Bruce Banner and Natasha Romanoff and inserts new employees Steve Rogers and Thor Odinson.

SELECT * FROM DIM_EMPLOYEES;
SCD Type-1 changes in the Dimension table
SCD Type-1 changes in the Dimension table

The result shows only the latest values for each employee. No historical records are retained since SCD Type-1 overwrites changes.

3. SCD Type 2 Implementation in Snowflake

A Slowly Changing Dimension Type-2 is used to maintain a complete history of a record in the target. When the value of chosen attribute(s) of the record changes, the record is made inactive and a new record is created with modified data values and is considered as active record.

Let us understand the implementation of Slowly Changing Dimension (SCD) Type-2 in Snowflake.

SCD Type-2 can also be implemented in Snowflake using Streams and Tasks to automate change capture and loading. However, in this demonstration the focus is on implementing SCD Type-2 without relying on Streams or Tasks, keeping the logic purely SQL-driven for better clarity.

3.1. Setting up the Stage table

The same stage table STG_EMPLOYEES is used as the source for loading data into the dimension table.

-- Setting up Source Data
CREATE OR REPLACE TABLE STG_EMPLOYEES(
    EMPLOYEE_ID    NUMBER(6,0),
    EMPLOYEE_NAME  VARCHAR(256),
    SALARY         NUMBER(8,2),
    DEPARTMENT_ID  NUMBER(4,0)
);

INSERT INTO STG_EMPLOYEES (EMPLOYEE_ID, EMPLOYEE_NAME, SALARY, DEPARTMENT_ID) VALUES
    (101, 'Tony Stark',       150000.00, 10),   -- Engineering
    (102, 'Bruce Banner',      98000.00, 20),   -- Research
    (103, 'Peter Parker',      65000.00, 30),   -- Photography
    (104, 'Natasha Romanoff', 120000.00, 40);   -- Security
;

SELECT * FROM STG_EMPLOYEES;
Raw Source Data before Initial Load
Raw Source Data before Initial Load

3.2. Creating the Dimension

The dimension table DIM_EMPLOYEES is created to capture historical changes. Along with the checksum for change detection, two additional columns START_DATE and END_DATE are introduced to track the validity period of each record.

START_DATE: Represents the date when the record version became active.

END_DATE: Represents the date when the record version became inactive.

  • For current active records, it is set to a high future date, commonly ‘9999-12-31’.
  • For inactive records, it is set to the current date value.
  • An active record is identified by checking where END_DATE = ‘9999-12-31’.
  • Any record with a different END_DATE is considered a historical version.

Together, START_DATE and END_DATE allow maintaining the full history of changes for a record.

-- Creating SCD Type-2 Dimension table
CREATE OR REPLACE TABLE DIM_EMPLOYEES(
    EMPLOYEE_ID    NUMBER(6,0),
    EMPLOYEE_NAME  VARCHAR(256),
    SALARY         NUMBER(8,2),
    DEPARTMENT_ID  NUMBER(4,0),
    CHECKSUM       VARCHAR(256),
    START_DATE     DATE,
    END_DATE       DATE,
    CREATED_TIMESTAMP TIMESTAMP_NTZ,
    UPDATED_TIMESTAMP TIMESTAMP_NTZ
);

3.3. Generating Checksum for Change Detection

The same approach as SCD-1 is followed to generate checksum values using the MD5 function. This helps detect whether the incoming record is new, changed, or unchanged.

-- Generating Cheksum for change detection
SELECT
    EMPLOYEE_ID,  
    UPPER(TRIM(EMPLOYEE_NAME)) AS EMPLOYEE_NAME,
    SALARY,
    DEPARTMENT_ID,
    MD5(CONCAT(
        EMPLOYEE_ID::STRING, '|',
        UPPER(TRIM(EMPLOYEE_NAME)), '|',
        SALARY::STRING, '|',
        DEPARTMENT_ID::STRING
    )) AS CHECKSUM
FROM STG_EMPLOYEES;
Generating Checksum for each record for change detection
Generating Checksum for each record for change detection

3.4. Handling Updates and Inserts with MERGE Statements

SCD-2 differs from SCD-1 by preserving historical records by closing old versions of a record and inserting a new version with the updated values.

Step 1: Mark Updated Records as Inactive

If the checksum between source and target does not match, indicating the record already exists but there is a change in the row data, the existing record in the dimension table is closed by updating its END_DATE to the current date.

-- Update record as Inactive if already existing and change is detected
MERGE INTO DIM_EMPLOYEES tgt
USING (
    SELECT
        EMPLOYEE_ID,  
        UPPER(TRIM(EMPLOYEE_NAME)) AS EMPLOYEE_NAME,
        SALARY,
        DEPARTMENT_ID,
        MD5(CONCAT(
            EMPLOYEE_ID::STRING, '|',
            UPPER(TRIM(EMPLOYEE_NAME)), '|',
            SALARY::STRING, '|',
            DEPARTMENT_ID::STRING
        )) AS CHECKSUM
    FROM STG_EMPLOYEES
) src
ON tgt.EMPLOYEE_ID = src.EMPLOYEE_ID AND tgt.END_DATE = '9999-12-31'

WHEN MATCHED AND tgt.CHECKSUM != src.CHECKSUM THEN
    UPDATE SET
        tgt.END_DATE          = CURRENT_DATE,
        tgt.UPDATED_TIMESTAMP = CURRENT_TIMESTAMP
;

 Step 2: Insert New Records (for updated and new employees)

A new row is inserted with the updated values and a new validity range (START_DATE = CURRENT_DATE, END_DATE = ‘9999-12-31’).

-- Insert new records that are not already existing or updated
MERGE INTO DIM_EMPLOYEES tgt
USING (
    SELECT
        EMPLOYEE_ID,  
        UPPER(TRIM(EMPLOYEE_NAME)) AS EMPLOYEE_NAME,
        SALARY,
        DEPARTMENT_ID,
        MD5(CONCAT(
            EMPLOYEE_ID::STRING, '|',
            UPPER(TRIM(EMPLOYEE_NAME)), '|',
            SALARY::STRING, '|',
            DEPARTMENT_ID::STRING
        )) AS CHECKSUM
    FROM STG_EMPLOYEES
) src
ON tgt.EMPLOYEE_ID = src.EMPLOYEE_ID AND tgt.END_DATE = '9999-12-31'
        
WHEN NOT MATCHED THEN 
    INSERT (EMPLOYEE_ID, EMPLOYEE_NAME, SALARY, DEPARTMENT_ID, CHECKSUM, START_DATE, END_DATE, CREATED_TIMESTAMP, UPDATED_TIMESTAMP) 
    VALUES (src.EMPLOYEE_ID, src.EMPLOYEE_NAME, src.SALARY, src.DEPARTMENT_ID, src.CHECKSUM, CURRENT_DATE, DATE '9999-12-31', CURRENT_TIMESTAMP, CURRENT_TIMESTAMP)
;

Step 3: Mark Records Missing in Source as Inactive

If a record exists in the dimension table but is missing from the source, indicating the row is deleted from the source, it is also made inactive by setting its END_DATE to the current date.

-- Update Records Missing from Source as Inactive
UPDATE DIM_EMPLOYEES
SET
    END_DATE = CURRENT_DATE,
    UPDATED_TIMESTAMP = CURRENT_TIMESTAMP
WHERE
EMPLOYEE_ID IN (
    SELECT 
        a.EMPLOYEE_ID
    FROM DIM_EMPLOYEES a
    LEFT JOIN STG_EMPLOYEES b
         ON a.EMPLOYEE_ID = b.EMPLOYEE_ID
    WHERE 
        b.EMPLOYEE_ID IS NULL -- No match in the source
        AND a.END_DATE = '9999-12-31' -- And it's an active record
);

3.5. Initial Data Load with Merge

In the initial load, as no data is present in the dimension table, all records from the stage table are inserted into the dimension table.

Dimension table after initial load
Dimension table after initial load

3.6. Demonstrating Updates (SCD Type-2 Behaviour)

To simulate changes, the stage table is reloaded with updated salary values and additional employees.

-- Updated source data to demonstarte SCD Type-2 behaviour
TRUNCATE TABLE STG_EMPLOYEES;

INSERT INTO STG_EMPLOYEES (EMPLOYEE_ID, EMPLOYEE_NAME, SALARY, DEPARTMENT_ID) VALUES
    (101, 'Tony Stark',       150000.00, 10),   -- No change
    (102, 'Bruce Banner',     100000.00, 20),   -- Salary updated
    (103, 'Peter Parker',      65000.00, 30),   -- No change
    (104, 'Natasha Romanoff', 125000.00, 40),   -- Salary updated
    (105, 'Steve Rogers',     110000.00, 50),   -- New employee
    (106, 'Thor Odinson',     130000.00, 60);   -- New employee
;

Running the MERGE and UPDATE statements:

  • Marks old versions of Bruce Banner and Natasha Romanoff as inactive (with an END_DATE of today).
  • Inserts new versions of Bruce Banner and Natasha Romanoff with updated salaries.
  • Inserts Steve Rogers and Thor Odinson as new employees.
SELECT * FROM DIM_EMPLOYEES;
SCD Type-2 changes in Dimension table
SCD Type-2 changes in Dimension table

The result shows historical records for employees with validity ranges defined by START_DATE and END_DATE.

4. SCD Type 3 Implementation in Snowflake

A Slowly Changing Dimension (SCD) Type 3 is used to maintain a partial history of a record in the dimension table. The record stores the previous value of a chosen attribute along with the current attribute value in SCD Type-3 table.

  • Instead of overwriting data (Type-1) or storing multiple historical records (Type-2), SCD Type-3 maintains a limited history by storing both the current value and the immediately previous value within the same record.
  • This is useful in scenarios where only recent changes are relevant for analysis, such as tracking a department change or the latest two salary revisions.

Let us understand the implementation of Slowly Changing Dimension (SCD) Type-3 in Snowflake.

4.1. Setting up the Stage table

The same stage table STG_EMPLOYEES is used as the source for loading data into the dimension table.

-- Setting up Source Data
CREATE OR REPLACE TABLE STG_EMPLOYEES(
    EMPLOYEE_ID    NUMBER(6,0),
    EMPLOYEE_NAME  VARCHAR(256),
    SALARY         NUMBER(8,2),
    DEPARTMENT_ID  NUMBER(4,0)
);

INSERT INTO STG_EMPLOYEES (EMPLOYEE_ID, EMPLOYEE_NAME, SALARY, DEPARTMENT_ID) VALUES
    (101, 'Tony Stark',       150000.00, 10),   -- Engineering
    (102, 'Bruce Banner',      98000.00, 20),   -- Research
    (103, 'Peter Parker',      65000.00, 30),   -- Photography
    (104, 'Natasha Romanoff', 120000.00, 40);   -- Security
;

SELECT * FROM STG_EMPLOYEES;
Raw Source Data before Initial Load
Raw Source Data before Initial Load

4.2. Creating the Dimension Table

The dimension table DIM_EMPLOYEES_SCD3 is designed to hold both the current and previous values of selected attributes (Salary and Department in this case).

-- Creating SCD Type-3 Dimension table
CREATE OR REPLACE TABLE DIM_EMPLOYEES_SCD3 (
    EMPLOYEE_ID        NUMBER(6,0),
    EMPLOYEE_NAME      VARCHAR(256),
    SALARY             NUMBER(8,2),
    PREV_SALARY        NUMBER(8,2), --SCD Type-3 column
    DEPARTMENT_ID      NUMBER(4,0),
    PREV_DEPARTMENT_ID NUMBER(4,0), --SCD Type-3 column
    CHECKSUM           VARCHAR(256),
    CREATED_TIMESTAMP  TIMESTAMP_NTZ,
    UPDATED_TIMESTAMP  TIMESTAMP_NTZ
);

4.3. Generating Checksum for Change Detection

The same approach as SCD-1 and SCD-2 is followed to generate checksum values using the MD5 function. This helps detect whether the incoming record is new, changed, or unchanged.

-- Generating Cheksum for change detection
SELECT
    EMPLOYEE_ID,  
    UPPER(TRIM(EMPLOYEE_NAME)) AS EMPLOYEE_NAME,
    SALARY,
    DEPARTMENT_ID,
    MD5(CONCAT(
        EMPLOYEE_ID::STRING, '|',
        UPPER(TRIM(EMPLOYEE_NAME)), '|',
        SALARY::STRING, '|',
        DEPARTMENT_ID::STRING
    )) AS CHECKSUM
FROM STG_EMPLOYEES;
Generating Checksum for each record for change detection
Generating Checksum for each record for change detection

4.4. Implementing Merge Logic

A Merge statement is used to perform inserts and updates into the dimension table from the stage table based on the changes identified in the row data.

  • If a change is detected, the existing value is shifted to the PREV_* column, and the new value replaces the current column.
  • If the record is new, it is inserted with NULL values in the PREV_* columns.
-- Merge Logic for SCD Type-3
MERGE INTO DIM_EMPLOYEES_SCD3 tgt
USING (
    SELECT
        EMPLOYEE_ID,  
        UPPER(TRIM(EMPLOYEE_NAME)) AS EMPLOYEE_NAME,
        SALARY,
        DEPARTMENT_ID,
        MD5(CONCAT(
            EMPLOYEE_ID::STRING, '|',
            UPPER(TRIM(EMPLOYEE_NAME)), '|',
            SALARY::STRING, '|',
            DEPARTMENT_ID::STRING
        )) AS CHECKSUM
    FROM STG_EMPLOYEES
) src
ON tgt.EMPLOYEE_ID = src.EMPLOYEE_ID

-- If record exists and has changed → move old values into PREV_* columns
WHEN MATCHED AND tgt.CHECKSUM != src.CHECKSUM THEN
    UPDATE SET
        tgt.PREV_SALARY        = tgt.SALARY,
        tgt.SALARY             = src.SALARY,
        tgt.PREV_DEPARTMENT_ID = tgt.DEPARTMENT_ID,
        tgt.DEPARTMENT_ID      = src.DEPARTMENT_ID,
        tgt.CHECKSUM           = src.CHECKSUM,
        tgt.UPDATED_TIMESTAMP  = CURRENT_TIMESTAMP

-- If new record → insert as a new employee
WHEN NOT MATCHED THEN
    INSERT (
        EMPLOYEE_ID, EMPLOYEE_NAME,
        SALARY, PREV_SALARY,
        DEPARTMENT_ID, PREV_DEPARTMENT_ID,
        CHECKSUM,
        CREATED_TIMESTAMP, UPDATED_TIMESTAMP
    )
    VALUES (
        src.EMPLOYEE_ID, src.EMPLOYEE_NAME,
        src.SALARY, NULL,
        src.DEPARTMENT_ID, NULL,
        src.CHECKSUM,
        CURRENT_TIMESTAMP, CURRENT_TIMESTAMP
    );

4.5. Initial Data Load with Merge

In the initial load, as no data is present in the dimension table, all records from the stage table are inserted into the dimension table.

Dimension table after initial load
Dimension table after initial load

4.6. Demonstrating Updates (SCD Type-3 Behaviour)

To simulate changes, the stage table is refreshed with modified salary and department values.

-- Updated source data to demonstarte SCD Type-3 behaviour
TRUNCATE TABLE STG_EMPLOYEES;

INSERT INTO STG_EMPLOYEES (EMPLOYEE_ID, EMPLOYEE_NAME, SALARY, DEPARTMENT_ID) VALUES
    (101, 'Tony Stark',       150000.00, 10),   -- No change
    (102, 'Bruce Banner',     100000.00, 70),   -- Salary and department updated
    (103, 'Peter Parker',      65000.00, 30),   -- No change
    (104, 'Natasha Romanoff', 125000.00, 80);   -- Salary and department updated

Re-running the MERGE statement:

  • Bruce Banner’s previous department (20) is stored in PREV_DEPARTMENT_ID, and his new department (70) is reflected in DEPARTMENT_ID. Similarly, previous salary (98000) is stored in PREV_SALARY, and new salary(10000) is reflected in the SALARY column.
  • Natasha Romanoff’s previous salary (120000) and department (40) are stored in PREV_SALARY and PREV_DEPARTMENT_ID, while new values replace the current columns.
SELECT * FROM DIM_EMPLOYEES_SCD3;
SCD Type-3 changes in the Dimension table
SCD Type-3 changes in the Dimension table

The output shows both the current and previous values, providing a limited history for changed attributes.

Subscribe to our Newsletter !!

Related Articles:

Leave a Comment

Related Posts