HOW TO: Remove Duplicates in a Snowpark DataFrame?

Spread the love

Contents hide

dropDuplicates() Method in Snowflake Snowpark

1. Removing Duplicates across all columns

2. Removing Duplicates Based on a Subset of columns

dropDuplicates() vs distinct() in Snowflake Snowpark

dropDuplicates() Method in Snowflake Snowpark

The dropDuplicates() method in Snowpark returns a new DataFrame with all duplicate rows removed. This method removes all rows that have identical values across all columns. It also allows you to remove duplicates based on a subset of columns in the DataFrame.

Below are the two possible approaches to remove the duplicate rows using the dropDuplicates() method.

1. Removing Duplicates across all columns

Consider the following DataFrame as a source for the demonstration of removing duplicates in Snowpark.

data = [[1,1,1],
        [1,1,1],
        [1,1,0],
        [1,0,0],
        [1,0,0],
        [0,0,0]]

df = session.createDataFrame(data, schema=["a","b","c"])
df.show()

-------------------
|"A"  |"B"  |"C"  |
-------------------
|1    |1    |1    |
|1    |1    |1    |
|1    |1    |0    |
|1    |0    |0    |
|1    |0    |0    |
|0    |0    |0    |
-------------------

The following code demonstrates removing duplicate rows from a DataFrame using the dropDuplicates() method.

df_unique = df.dropDuplicates()
df_unique.show()

-------------------
|"A"  |"B"  |"C"  |
-------------------
|1    |1    |1    |
|1    |1    |0    |
|1    |0    |0    |
|0    |0    |0    |
-------------------

The duplicate rows (1,1,1) and (1,0,0) are removed in the new DataFrame created using the dropDuplicates() method.

2. Removing Duplicates Based on a Subset of columns

The following code demonstrates removing duplicate rows based on a subset of columns from the DataFrame using the dropDuplicates() method.

#// Removing Duplicates Based on a Column "A"  //

df_unique = df.dropDuplicates("a")
df_unique.show()

-------------------
|"A"  |"B"  |"C"  |
-------------------
|0    |0    |0    |
|1    |1    |1    |
-------------------

In the above example, duplicates are removed based on column ‘A’, where the unique values are 1 and 0 (highlighted in red in the image below). Based on these two values, two unique records (highlighted in blue in the image below) are returned as output.

Removing Duplicates Based on a Subset of columns

Below is another example where duplicates are removed from a Snowpark DataFrame based on columns “A” and “B”.

# Removing Duplicates Based on a Column "A" and "B"

df_unique = df.dropDuplicates("a","b")
df_unique.show()

-------------------
|"A"  |"B"  |"C"  |
-------------------
|0    |0    |0    |
|1    |0    |0    |
|1    |1    |1    |
-------------------

dropDuplicates() vs distinct() in Snowflake Snowpark

The `dropDuplicates()` method in Snowpark behaves exactly like the `distinct()` method when no subset of columns is specified for removing duplicates.

The following example demonstrates that the output of the `dropDuplicates()` and `distinct()` methods is identical.

df.show()
-------------------
|"A"  |"B"  |"C"  |
-------------------
|1    |1    |1    |
|1    |1    |1    |
|1    |1    |0    |
|1    |0    |0    |
|1    |0    |0    |
|0    |0    |0    |
-------------------

df.dropDuplicates().show()
-------------------
|"A"  |"B"  |"C"  |
-------------------
|1    |1    |1    |
|1    |1    |0    |
|1    |0    |0    |
|0    |0    |0    |
-------------------

df.distinct().show()
-------------------
|"A"  |"B"  |"C"  |
-------------------
|1    |1    |1    |
|1    |1    |0    |
|1    |0    |0    |
|0    |0    |0    |
-------------------

In both cases, the DataFrame is translated and executed as SQL using the “GROUP BY” clause in Snowflake by the Snowpark API, as shown below.

SELECT "A", "B", "C"
FROM(
       SELECT
              $1 AS "A",
              $2 AS "B",
              $3 AS "C"
       FROM
       VALUES
              (1::INT, 1::INT, 1::INT),
              (1::INT, 1::INT, 1::INT),
              (1::INT, 1::INT, 0::INT),
              (1::INT, 0::INT, 0::INT),
              (1::INT, 0::INT, 0::INT),
              (0::INT, 0::INT, 0::INT)
       )
GROUP BY "A", "B", "C"
;

When a subset of columns is specified in the `dropDuplicates()` method, the DataFrame is translated and executed as SQL using the “ROW_NUMBER” analytical function in Snowflake by the Snowpark API, as shown below.

df.dropDuplicates("a").show()
-------------------
|"A"  |"B"  |"C"  |
-------------------
|0    |0    |0    |
|1    |1    |1    |
-------------------

SELECT "A", "B", "C"
FROM
    (
    SELECT "A", "B", "C",
        row_number() OVER (PARTITION BY "A" ORDER BY "A" ASC NULLS FIRST) AS "TXT3PTQCE5"
    FROM
        (
        SELECT
            $1 AS "A",
            $2 AS "B",
            $3 AS "C"
        FROM
        VALUES
            (1::INT, 1::INT, 1::INT),
            (1::INT, 1::INT, 1::INT),
            (1::INT, 1::INT, 0::INT),
            (1::INT, 0::INT, 0::INT),
            (1::INT, 0::INT, 0::INT),
            (0::INT, 0::INT, 0::INT)
        )
    )
WHERE
    ("TXT3PTQCE5" = 1::INT)
;