Joins in Snowflake Snowpark

Spread the love

Contents hide

1. Introduction

2. Joins in Snowpark

3. Demonstration of JOINS in Snowpark

3.1. Join DataFrames in Snowpark

3.2. Join DataFrames referring to a Single Column Name in Snowpark

3.3. Rename Ambiguous Columns of Join operation Output in Snowpark

3.4. Rename Ambiguous Columns of Join operation Output using lsuffix and rsuffix

3.5. Join DataFrames based on Multiple Conditions in Snowpark.

3.6. Join Types in Snowpark

1. Introduction

Joins are used to combine rows from two or more tables, based on a related column between them. Joins allow for the creation of a comprehensive result set that incorporates relevant information from multiple tables.

In this article, let us explore how to join data between two DataFrames in Snowflake Snowpark.

2. Joins in Snowpark

The DataFrame.join method in Snowpark helps in performing a join of specified type on the data of the current DataFrame with another DataFrame based on a list of columns.

Syntax:

DataFrame.join(right DataFrame, <join_condition>, join_type=<join_type>)

Parameters:

right DataFrame – The other DataFrame to be joined with.

<join_condition> – The condition using which data in both the DataFrames are joined. The valid values for a join condition are

A column or a list of column names. When a column or list of column names are specified, this method assumes the same named columns are present in both the DataFrames.
Column names from both the DataFrames specifying the join condition.

<join_type> – The type of join to be applied to join data between two DataFrames. The Snowpark API for Python supports the following join types found in SQL.

SQL Join Type	Supported Value
Inner join	“inner” (the default value)
Left outer join	“left”, “leftouter”
Right outer join	“right”, “rightouter”
Full outer join	“full” “outer”, “fullouter”
Left semi join	“semi”, “leftsemi”
Left anti join	“anti”, “leftanti”
Cross join	“cross”

3. Demonstration of JOINS in Snowpark

Consider the EMPLOYEES and DEPARTMENTS data below for the demonstration of the implementation of the Joins in Snowpark.

#// create dataframe with employees data
employee_data = [
    [1,'TONY',101],
    [2,'STEVE',101],
    [3,'BRUCE',102],
    [4,'WANDA',102],
    [5,'VICTOR',103],
    [6,'HANK',105],
]

employee_schema = ["ID", "NAME", "DEPT_ID"]

df_emp =session.createDataFrame(employee_data, schema=employee_schema)
df_emp.show()

-----------------------------
|"ID"  |"NAME"  |"DEPT_ID"  |
-----------------------------
|1     |TONY    |101        |
|2     |STEVE   |101        |
|3     |BRUCE   |102        |
|4     |WANDA   |102        |
|5     |VICTOR  |103        |
|6     |HANK    |105        |
-----------------------------

#// create dataframe with departments data
department_data = [
    [101,'HR'],
    [102,'SALES'],
    [103,'IT'],
    [104,'FINANCE'],
]

department_schema = ["DEPT_ID", "NAME"]

df_dept =session.createDataFrame(department_data, schema=department_schema)
df_dept.show()

-----------------------
|"DEPT_ID"  |"NAME"   |
-----------------------
|101        |HR       |
|102        |SALES    |
|103        |IT       |
|104        |FINANCE  |
-----------------------

3.1. Join DataFrames in Snowpark

The EMPLOYEES and DEPARTMENTS DataFrames can be joined using the DataFrame.join method in Snowpark as shown below.

#// Joining two DataFrames

#// Method-1
df_emp.join(df_dept, df_emp.DEPT_ID == df_dept.DEPT_ID).show()

#// Method-2
df_emp.join(df_dept, df_emp["DEPT_ID"] == df_dept["DEPT_ID"]).show()

------------------------------------------------------------------------------
|"ID"  |"l_vrun_NAME"  |"l_vrun_DEPT_ID"  |"r_jc4z_DEPT_ID"  |"r_jc4z_NAME"  |
------------------------------------------------------------------------------
|1     |TONY           |101               |101               |HR             |
|2     |STEVE          |101               |101               |HR             |
|3     |BRUCE          |102               |102               |SALES          |
|4     |WANDA          |102               |102               |SALES          |
|5     |VICTOR         |103               |103               |IT             |
------------------------------------------------------------------------------

3.2. Join DataFrames referring to a Single Column Name in Snowpark

The DataFrames can be joined by referring to a single column name if the name of the column is same in both the DataFrames.

The EMPLOYEES and DEPARTMENTS DataFrames can be joined by referring to a single column DEPT_ID as shown below.

#// Joining two DataFrames referring to a single column
df_emp.join(df_dept, "DEPT_ID").show()

----------------------------------------------------
|"DEPT_ID"  |"ID"  |"l_9ml8_NAME"  |"r_8dfz_NAME"  |
----------------------------------------------------
|101        |1     |TONY           |HR             |
|101        |2     |STEVE          |HR             |
|102        |3     |BRUCE          |SALES          |
|102        |4     |WANDA          |SALES          |
|103        |5     |VICTOR         |IT             |
----------------------------------------------------

3.3. Rename Ambiguous Columns of Join operation Output in Snowpark

When two DataFrames are joined, the overlapping columns will have random column names in the resulting DataFrame as seen in the above examples.

The randomly named columns can be renamed using Column.alias as shown below.

#// Renaming the ambiguous columns
df_emp.join(df_dept, df_emp.DEPT_ID == df_dept.DEPT_ID).\
    select(df_emp.ID, df_emp.NAME.alias("EMP_NAME"), df_emp.DEPT_ID.alias("DEPT_ID"), df_dept.NAME.alias("DEPT_NAME")).show()

-----------------------------------------------
|"ID"  |"EMP_NAME"  |"DEPT_ID"  |"DEPT_NAME"  |
-----------------------------------------------
|1     |TONY        |101        |HR           |
|2     |STEVE       |101        |HR           |
|3     |BRUCE       |102        |SALES        |
|4     |WANDA       |102        |SALES        |
|5     |VICTOR      |103        |IT           |
-----------------------------------------------

3.4. Rename Ambiguous Columns of Join operation Output using lsuffix and rsuffix

The randomly named overlapping columns can be renamed using lsuffix and rsuffix parameters in DataFrame.join method.

lsuffix – Suffix to add to the overlapping columns of the left DataFrame.
rsuffix – Suffix to add to the overlapping columns of the right DataFrame.

#// Renaming the ambiguous columns using lsuffix and rsuffix
df_emp.join(df_dept, df_emp.DEPT_ID == df_dept.DEPT_ID, lsuffix="_EMP", rsuffix="_DEPT").show()

--------------------------------------------------------------------
|"ID"  |"NAME_EMP"  |"DEPT_ID_EMP"  |"DEPT_ID_DEPT"  |"NAME_DEPT"  |
--------------------------------------------------------------------
|1     |TONY        |101            |101             |HR           |
|2     |STEVE       |101            |101             |HR           |
|3     |BRUCE       |102            |102             |SALES        |
|4     |WANDA       |102            |102             |SALES        |
|5     |VICTOR      |103            |103             |IT           |
--------------------------------------------------------------------

It is recommended to use lsuffix and rsuffix parameters within DataFrame.join method when there are overlapping columns between the DataFrames.

3.5. Join DataFrames based on Multiple Conditions in Snowpark.

DataFrames can be joined based on multiple conditions separated by the “&” symbol as shown below.

DataFrame.join(right DataFrame, (<join_condition_1>) & (< join_condition_2>))

When the names of the columns are the same between both the DataFrames, they can be joined by passing a list of column names as shown below.

DataFrame.join(right DataFrame, ["col_1", "col_2", ..]

The following is an example of joining EMPLOYEES and DEPARTMENTS based on multiple conditions.

#// Joining two DataFrames based on Multiple conditions
df_emp.join(df_dept, (df_emp.DEPT_ID == df_dept.DEPT_ID) & (df_emp.ID < df_dept.DEPT_ID), \
            lsuffix="_EMP", rsuffix="_DEPT").show()

3.6. Join Types in Snowpark

By default, the DataFrame.join method applies an inner join to join the data between the two DataFrames. The other supported join types can be specified to join the data between two DataFrames as shown below.

#// Left Outer Join
df_emp.join(df_dept, df_emp.DEPT_ID == df_dept.DEPT_ID, lsuffix="_EMP", rsuffix="_DEPT", join_type="left").show()
--------------------------------------------------------------------
|"ID"  |"NAME_EMP"  |"DEPT_ID_EMP"  |"DEPT_ID_DEPT"  |"NAME_DEPT"  |
--------------------------------------------------------------------
|1     |TONY        |101            |101             |HR           |
|2     |STEVE       |101            |101             |HR           |
|3     |BRUCE       |102            |102             |SALES        |
|4     |WANDA       |102            |102             |SALES        |
|5     |VICTOR      |103            |103             |IT           |
|6     |HANK        |105            |NULL            |NULL         |
--------------------------------------------------------------------

Instead of the “join_type” parameter, we can also use the “how” parameter to specify the join condition.

#// Right Outer Join
df_emp.join(df_dept, df_emp.DEPT_ID == df_dept.DEPT_ID, lsuffix="_EMP", rsuffix="_DEPT", how="right").show()

--------------------------------------------------------------------
|"ID"  |"NAME_EMP"  |"DEPT_ID_EMP"  |"DEPT_ID_DEPT"  |"NAME_DEPT"  |
--------------------------------------------------------------------
|1     |TONY        |101            |101             |HR           |
|2     |STEVE       |101            |101             |HR           |
|3     |BRUCE       |102            |102             |SALES        |
|4     |WANDA       |102            |102             |SALES        |
|5     |VICTOR      |103            |103             |IT           |
|NULL  |NULL        |NULL           |104             |FINANCE      |
--------------------------------------------------------------------