IN Operator in Snowflake Snowpark

Spread the love

Contents hide

1. Introduction

2. IN Operator in Snowflake Snowpark

3. Demonstration of IN operator in Snowpark

3.1. Filtering Data from a Snowpark DataFrame using a Single Value

3.2. Filtering Data from a Snowpark DataFrame using Multiple Values

4. Implementing SubQueries in Snowpark using Column.in_() method

STEP-1: Extract the list of values to be passed to the IN operator into a DataFrame

STEP-2: Pass the DataFrame representing a Subquery as Input Parameter to the Column.in_() Method

5. Implementing the IN operator in the Snowpark SELECT clause

6. NOT IN Operator in Snowflake Snowpark

STEP-1: Use the IN Operator in the SELECT Clause to Identify Values Not Present in the Specified List

STEP-2: Filter Values Not Present in the Specified List

STEP-3: Filter DataFrame by Passing a List of Values Not Present in the Specified List

1. Introduction

The IN operator in SQL allows you to specify multiple values in a WHERE clause to filter the data. It serves as a shorthand for employing multiple OR conditions. Additionally, the IN operator can be utilized with a subquery within the WHERE clause.

In this article, let us explore how the IN operator can be implemented with DataFrames in Snowflake Snowpark.

2. IN Operator in Snowflake Snowpark

The Column.in_() method in Snowpark returns a list of values that can be passed to the DataFrame.filter() method (equivalent to WHERE in SQL) to perform the equivalent of an IN operator in SQL.

The supported values of Column.in_() method are a sequence of values or a DataFrame that represents a subquery.

3. Demonstration of IN operator in Snowpark

Consider the EMPLOYEES and DEPARTMENTS data below for the demonstration of the IN operator in Snowpark.

#// create dataframe with employees data
employee_data = [
    [1,'TONY',101],
    [2,'STEVE',101],
    [3,'BRUCE',102],
    [4,'WANDA',102],
    [5,'VICTOR',103],
    [6,'HANK',105],
]

employee_schema = ["ID", "NAME", "DEPT_ID"]

df_emp =session.createDataFrame(employee_data, schema=employee_schema)
df_emp.show()

-----------------------------
|"ID"  |"NAME"  |"DEPT_ID"  |
-----------------------------
|1     |TONY    |101        |
|2     |STEVE   |101        |
|3     |BRUCE   |102        |
|4     |WANDA   |102        |
|5     |VICTOR  |103        |
|6     |HANK    |105        |
-----------------------------

#// create dataframe with departments data
department_data = [
    [101,'HR'],
    [102,'SALES'],
    [103,'IT'],
    [104,'FINANCE'],
]

department_schema = ["DEPT_ID", "NAME"]

df_dept =session.createDataFrame(department_data, schema=department_schema)
df_dept.show()

-----------------------
|"DEPT_ID"  |"NAME"   |
-----------------------
|101        |HR       |
|102        |SALES    |
|103        |IT       |
|104        |FINANCE  |
-----------------------

3.1. Filtering Data from a Snowpark DataFrame using a Single Value

The following example extracts details of employees with ID=1 from the EMPLOYEES DataFrame.

df_emp.filter(col("ID")==1).show()

-----------------------------
|"ID"  |"NAME"  |"DEPT_ID"  |
-----------------------------
|1     |TONY    |101        |
-----------------------------

The above Snowpark code is equivalent to the following SQL query.

SELECT * FROM EMPLOYEES WHERE ID = 1;

3.2. Filtering Data from a Snowpark DataFrame using Multiple Values

The following example extracts details of employees with ID 1, 2, and 3 from the EMPLOYEES DataFrame using Column.in_() method.

from snowflake.snowpark.functions import col

df_emp.filter(col("ID").in_(1,2,3)).show()

#//         (or)         //
 
df_emp.filter(df_emp.col("ID").in_(1,2,3)).show()

-----------------------------
|"ID"  |"NAME"  |"DEPT_ID"  |
-----------------------------
|1     |TONY    |101        |
|2     |STEVE   |101        |
|3     |BRUCE   |102        |
-----------------------------

The above Snowpark code is equivalent to the following SQL query.

SELECT * FROM EMPLOYEES WHERE ID IN (1,2,3);

4. Implementing SubQueries in Snowpark using Column.in_() method

A Subquery also known as an inner query or nested query is a query nested within another SQL statement. The inner query is executed first and its results are used by the outer query to further filter, join, or manipulate data.

Consider a scenario where we need to extract the details of all employees belonging to the SALES department. The same can be achieved using the below SQL query using a filter condition on the SALES department in a subquery.

SELECT * FROM EMPLOYEES WHERE DEPT_ID IN (
SELECT DEPT_ID FROM DEPARTMENTS WHERE NAME = 'SALES');

Let us understand how the same can be implemented in Snowpark.

STEP-1: Extract the list of values to be passed to the IN operator into a DataFrame

In this scenario, we need to extract the DEPT_ID value of the SALES department from the DEPARTMENTS DataFrame into a new DataFrame.

The following code applies a filter on the NAME field with values as ‘SALES’ and selects only the DEPT_ID field from the DEPARTMENTS DataFrame into a new DataFrame.

df_dept_SALES = df_dept.filter(col("NAME")=="SALES").select("DEPT_ID")
df_dept_SALES.show()

-------------
|"DEPT_ID"  |
-------------
|102        |
-------------

STEP-2: Pass the DataFrame representing a Subquery as Input Parameter to the Column.in_() Method

The following code returns the details of all employees belonging to the SALES department by passing the DataFrame that holds the SALES department ID value as input to the Column.in_() method.

df_emp_SALES = df_emp.filter(col("DEPT_ID").in_(df_dept_SALES))
df_emp_SALES.show()

-----------------------------
|"ID"  |"NAME"  |"DEPT_ID"  |
-----------------------------
|3     |BRUCE   |102        |
|4     |WANDA   |102        |
-----------------------------

5. Implementing the IN operator in the Snowpark SELECT clause

The Column.in_() method in Snowpark can also be passed to a DataFrame.select() call. The expression returns a Boolean value and evaluates to true if the value in the column is one of the values in the specified sequence.

The following code returns the ID column from the EMPLOYEES DataFrame along with a new column that returns true if the ID value is present in one of the values passed to the IN operator.

#// IN operator in SELECT clause
df_emp.select(col("ID"), col("ID").in_(1,2,3).alias("IS_EXISTS")).show()

----------------------
|"ID"  |"IS_EXISTS"  |
----------------------
|1     |True         |
|2     |True         |
|3     |True         |
|4     |False        |
|5     |False        |
|6     |False        |
----------------------

The above Snowpark code is equivalent to the following SQL query.

SELECT ID, ID IN (1,2,3) AS IS_EXISTS FROM EMPLOYEES;

6. NOT IN Operator in Snowflake Snowpark

In SQL, using the NOT keyword preceding the IN operator retrieves all records that do not match any of the values in the list. For example, the following SQL statement returns all employee records whose ID is not 1, 2, or 3.

SELECT * FROM EMPLOYEES WHERE ID NOT IN (1,2,3);

There is no in-built method available in Snowpark that can perform the same actions as NOT IN operator in SQL.

To implement the NOT IN operator in Snowpark, we still utilize the Column.in_() method. However, it’s essential to ensure that the DataFrame passed as an input parameter to the method contains a list of values other than those in the specified list.

GET all employee records whose ID is not 1, 2, or 3.

STEP-1: Use the IN Operator in the SELECT Clause to Identify Values Not Present in the Specified List

The following code returns ‘True’ for IDs that are present in the list passed to the Column.in_() method, and ‘False’ if they are not present.

df1 = df_emp.select(col("ID"), col("ID").in_(1,2,3).alias("IS_EXISTS"))
df1.show()

----------------------
|"ID"  |"IS_EXISTS"  |
----------------------
|1     |True         |
|2     |True         |
|3     |True         |
|4     |False        |
|5     |False        |
|6     |False        |
----------------------

STEP-2: Filter Values Not Present in the Specified List

The following code retrieves all the IDs from the EMPLOYEES DataFrame, excluding those that are present in the specified list, by filtering out the records that returned ‘False’ in the previous step.

df2 = df1.filter(col("IS_EXISTS")=='False').select("ID")
df2.show()

--------
|"ID"  |
--------
|4     |
|5     |
|6     |
--------

STEP-3: Filter DataFrame by Passing a List of Values Not Present in the Specified List

The following code retrieves all employee records whose ID is not 1, 2 or 3 from EMPLOYEES DataFrame by passing a DataFrame that holds all the employee IDs except for 1, 2, and 3.

df3 = df_emp.filter(col("ID").in_(df2))
df3.show()

-----------------------------
|"ID"  |"NAME"  |"DEPT_ID"  |
-----------------------------
|4     |WANDA   |102        |
|5     |VICTOR  |103        |
|6     |HANK    |105        |
-----------------------------

All the above mentioned steps are equivalent to the following SQL query.

SELECT * FROM EMPLOYEES WHERE ID IN (
       SELECT ID FROM(
              SELECT ID, ID IN (1,2,3) IS_EXISTS FROM EMPLOYEES)
       WHERE IS_EXISTS = 'False');

The process to identify the values that are not present in the required list may vary depending on the specific scenario. However, the overall approach is to identify the values absent from the list specified in the NOT IN condition and leverage them to filter the records.

Subscribe to our Newsletter !!

Join our WhatsApp Channel

Connect with us on LinkedIn

Related Articles: