HOW TO: Remove Duplicates in Snowflake?

Spread the love

Introduction

There are several ways in which we can remove duplicates from a Snowflake table. Consider below sample Employee data for the demonstration purpose.

CREATE OR REPLACE TABLE EMPLOYEE (
  EMPLOYEE_ID NUMBER(6,0),
  NAME VARCHAR2(20),
  SALARY NUMBER(8,2)
);

INSERT INTO EMPLOYEE(EMPLOYEE_ID,NAME,SALARY) VALUES
(100,'Jennifer',4400),
(100,'Jennifer',4400),
(101,'Michael',13000),
(101,'Michael',13000),
(101,'Michael',13000),
(102,'Pat',6000),
(102,'Pat',6000),
(103,'Den',11000)
;

SELECT * FROM EMPLOYEE;

RESULT :

EMPLOYEE_IDNAMESALARY
100Jennifer4400
100Jennifer4400
101Michael13000
101Michael13000
101Michael13000
102Pat6000
102Pat6000
103Den11000

Let us first understand different ways of extracting unique records from a table.

Extracting Unique records from a Snowflake table

The following methods can be used to extract unique records from a Snowflake table.

  • Using DISTINCT Keyword
  • Using GROUP BY Clause
  • Using ROW_NUMBER Analytic function

Using DISTINCT Keyword

The DISTINCT keyword is used in conjunction with SELECT is used to return only distinct (unique) values from a dataset.

The below SQL query returns unique records from the EMPLOYEE table using DISTINCT keyword.

SELECT DISTINCT * FROM EMPLOYEE;

RESULT :

EMPLOYEE_IDNAMESALARY
100Jennifer4400
101Michael13000
102Pat6000
103Den11000

Using GROUP BY Clause

The GROUP BY clause is used with SELECT statement to collect data from multiple records and group the results by one or more columns. By applying GROUP BY function on all the source columns, unique records can be extracted from dataset.

The below SQL query returns unique records from the EMPLOYEE table using GROUP BY clause.

SELECT 
  EMPLOYEE_ID,
  NAME,
  SALARY
FROM EMPLOYEE
GROUP BY EMPLOYEE_ID, NAME, SALARY;

RESULT :

EMPLOYEE_IDNAMESALARY
100Jennifer4400
101Michael13000
102Pat6000
103Den11000

The GROUP BY ALL clause is similar to GROUP BY clause except that it automatically include all non-aggregate items in the SELECT clause for grouping, eliminating the need to specifying them again for data grouping.

The below query is equivalent to the above written query using GROUP BY clause.

SELECT 
  EMPLOYEE_ID,
  NAME,
  SALARY
FROM EMPLOYEE
GROUP BY ALL;

Using ROW_NUMBER Analytic function

The ROW_NUMBER analytic function is used to assign sequential numbering to the rows within a window partition of the result set.

The below SQL query assigns row numbers to each unique set of records using ROW_NUMBER analytic function.

SELECT
  EMPLOYEE_ID,
  NAME,
  SALARY,
  ROW_NUMBER() OVER(PARTITION BY EMPLOYEE_ID, NAME, SALARY ORDER BY EMPLOYEE_ID) AS ROW_NUMBER
FROM EMPLOYEE;

RESULT :

EMPLOYEE_IDNAMESALARYROW_NUMBER
100Jennifer44001
100Jennifer44002
101Michael130001
101Michael130002
101Michael130003
102Pat60001
102Pat60002
103Den110001

Once row numbers are assigned, the unique records from the table can be extracted by querying the rows with row number 1.

The below SQL query extracts unique records from EMPLOYEE table using ROW_NUMBER analytic function.

SELECT EMPLOYEE_ID, NAME, SALARY
FROM( 
  SELECT
    EMPLOYEE_ID,
    NAME,
    SALARY,
    ROW_NUMBER() OVER(PARTITION BY EMPLOYEE_ID, NAME, SALARY ORDER BY EMPLOYEE_ID) AS ROW_NUMBER
  FROM EMPLOYEE)
WHERE ROW_NUMBER = 1;

RESULT :

EMPLOYEE_IDNAMESALARY
101Michael13000
100Jennifer4400
102Pat6000
103Den11000

This is a two-step process to filter the results of a query which uses a window function. To avoid this additional step, Snowflake introduced QUALIFY clause.

QUALIFY clause in a SELECT statement allows you to filter query results of Window functions within the same query. What QUALIFY does with window functions is analogous to what HAVING does with GROUP BY clause.

The below SQL query extracts unique records from EMPLOYEE table using ROW_NUMBER analytic function in conjunction with QUALIFY clause.

SELECT
  EMPLOYEE_ID,
  NAME,
  SALARY,
  ROW_NUMBER() OVER(PARTITION BY EMPLOYEE_ID, NAME, SALARY ORDER BY EMPLOYEE_ID) AS ROW_NUMBER
FROM EMPLOYEE;
QUALIFY ROW_NUMBER = 1;

Removing Duplicate records from a Snowflake table

The following methods can be used to remove exact row duplicates from a Snowflake table.

  • Using SWAP WITH command
  • Using OVERWRITE command

Using SWAP WITH command

The SWAP WITH command swaps all content and metadata between two specified tables, including any integrity constraints defined for the tables. The two tables are essentially renamed in a single transaction.

Removing duplicate using this method involves four steps.

  1. Create a new table with the structure of source table without any data.
  2. Insert unique records from the source table into the newly created table.
  3. Swap the data between two tables.
  4. Delete the table created in first step.

STEP-1:

The Snowflake CREATE TABLE LIKE statement creates a new table with just the structure of the existing table without copying the data with exact same column names, data types, default values and constraints.

The below SQL statement creates EMPLOYEE_DUP table with same structure as the EMPLOYEE table without any data.

CREATE OR REPLACE TABLE EMPLOYEE_DUP LIKE EMPLOYEE;

STEP-2:

The below SQL statement inserts unique records into the intermediate table created in earlier step.

INSERT INTO EMPLOYEE_DUP SELECT DISTINCT * FROM EMPLOYEE;

Note that you could use any of the three methods discussed in the earlier section in the select clause to insert unique records into the intermediate table.

STEP-3:

The below SQL statement swaps the data between two tables EMPLOYEE and EMPLOYEE_DUP.

ALTER TABLE EMPLOYEE_DUP SWAP WITH EMPLOYEE;

This results in the actual source table EMPLOYEE having only unique records.

STEP-4:

The below statement drops the intermediate table used which now contains the entire data from source table.

DROP TABLE EMPLOYEE_DUP;

Insert using OVERWRITE command

The INSERT statement with OVERWRITE command deletes the existing records in the table before inserting the data into the table.

The below SQL statement overrides EMPLOYEE data with unique records truncating existing data with duplicates.

INSERT OVERWRITE INTO EMPLOYEE SELECT DISTINCT * FROM EMPLOYEE;

This method avoids steps of creating an intermediate table to hold the unique records from the table and swapping the data between them. All these actions are handled in a single SQL statement using OVERWRITE command.

Note that you could use any of the three methods discussed in the earlier section in the select clause to override data in the table with unique records.

Removing Duplicate records based on a Key field from a Snowflake table

In most data warehouse tables, a surrogate key column is used. The surrogate keys are simple, system generated, incremental unique values. This column is used as an identifier for each row rather than relying on pre-existing attributes.

Let us make a small change to the EMPLOYEE table used as an example for demonstration in earlier sections by adding a surrogate key column EMPLOYEE_KEY to understand working with tables containing surrogate keys.

CREATE OR REPLACE TABLE EMPLOYEE (
  EMPLOYEE_KEY NUMBER(6,0),
  EMPLOYEE_ID NUMBER(6,0),
  NAME VARCHAR2(20),
  SALARY NUMBER(8,2)
);

INSERT INTO EMPLOYEE(EMPLOYEE_KEY,EMPLOYEE_ID,NAME,SALARY) VALUES
(1,100,'Jennifer',4400),
(2,100,'Jennifer',4400),
(3,101,'Michael',13000),
(4,101,'Michael',13000),
(5,101,'Michael',13000),
(6,102,'Pat',6000),
(7,102,'Pat',6000),
(8,103,'Den',11000)
;

SELECT * FROM EMPLOYEE;

RESULT :

EMPLOYEE_KEYEMPLOYEE_IDNAMESALARY
1100Jennifer4400
2100Jennifer4400
3101Michael13000
4101Michael13000
5101Michael13000
6102Pat6000
7102Pat6000
8103Den11000

Note that if you use any of the three methods discussed in the initial section of the article as is to extract unique records, it will not remove duplicates as the EMPLOYEE_KEY is unique for each record.

If we are considering the record with highest EMPLOYEE_KEY value as the latest record out of records with same EMPLOYEE_ID, then we can make use of ROW_NUMBER analytic function to assign row numbers to records with same EMPLOYEE_ID.

The below SQL query assigns row numbers to the records in EMPLOYEE table partitioned by EMPLOYEE_ID starting with highest EMPLOYEE_KEY value.

SELECT
    EMPLOYEE_KEY,
    EMPLOYEE_ID,
    NAME,
    SALARY,
    ROW_NUMBER() OVER(PARTITION BY EMPLOYEE_ID ORDER BY EMPLOYEE_KEY DESC) AS ROW_NUMBER
FROM EMPLOYEE
;

RESULT :

EMPLOYEE_KEYEMPLOYEE_IDNAMESALARYROW_NUMBER
2100Jennifer44001
1100Jennifer44002
5101Michael130001
4101Michael130002
3101Michael130003
7102Pat60001
6102Pat60002
8103Den110001

The records with row number 1 represent the unique records in the table. The below SQL query extracts unique records from EMPLOYEE table using ROW_NUMBER analytic function.

SELECT EMPLOYEE_KEY, EMPLOYEE_ID, NAME, SALARY
FROM( 
  SELECT
    EMPLOYEE_KEY,
    EMPLOYEE_ID,
    NAME,
    SALARY,
    ROW_NUMBER() OVER(PARTITION BY EMPLOYEE_ID ORDER BY EMPLOYEE_KEY DESC) AS ROW_NUMBER
  FROM EMPLOYEE
)
WHERE ROW_NUMBER = 1
;

RESULT :

EMPLOYEE_KEYEMPLOYEE_IDNAMESALARY
2100Jennifer4400
5101Michael13000
7102Pat6000
8103Den11000

Once we have the query which gets the unique records from the table, we can use any of the two methods to replace records with unique records in the table.

The below SQL query overrides the data in the EMPLOYEE table with unique records using OVERWRITE command.

INSERT OVERWRITE INTO EMPLOYEE
SELECT EMPLOYEE_KEY, EMPLOYEE_ID, NAME, SALARY
FROM( 
  SELECT
    EMPLOYEE_KEY,
    EMPLOYEE_ID,
    NAME,
    SALARY,
    ROW_NUMBER() OVER(PARTITION BY EMPLOYEE_ID ORDER BY EMPLOYEE_KEY DESC) AS ROW_NUMBER
  FROM EMPLOYEE
)
WHERE ROW_NUMBER = 1
;

Subscribe to our Newsletter !!

Related Articles:

  • What is Snowflake?

    Understand what Snowflake is and how is it different from other data cloud platforms, its key features and benefits.

    READ MORE

  • Snowflake Architecture

    Snowflake’s unique multi-cluster Shared Data architecture combines benefits of both Shared-Disk and Shared-Nothing architectures.

    READ MORE

  • HOW TO: Create Snowflake Free Trial Account?

    Understand how to create a Snowflake free trail account which lasts for 30 days and provides $400 of free usage credits.

    READ MORE

Leave a Comment

Related Posts