1. Introduction
Joins are used to combine rows from two or more tables, based on a related column between them. Joins allow for the creation of a comprehensive result set that incorporates relevant information from multiple tables.
In this article, let us explore how to join data between two DataFrames in Snowflake Snowpark.
2. Joins in Snowpark
The DataFrame.join method in Snowpark helps in performing a join of specified type on the data of the current DataFrame with another DataFrame based on a list of columns.
Syntax:
DataFrame.join(right DataFrame, <join_condition>, join_type=<join_type>)
Parameters:
right DataFrame – The other DataFrame to be joined with.
<join_condition> – The condition using which data in both the DataFrames are joined. The valid values for a join condition are
- A column or a list of column names. When a column or list of column names are specified, this method assumes the same named columns are present in both the DataFrames.
- Column names from both the DataFrames specifying the join condition.
<join_type> – The type of join to be applied to join data between two DataFrames. The Snowpark API for Python supports the following join types found in SQL.
SQL Join Type | Supported Value |
Inner join | “inner” (the default value) |
Left outer join | “left”, “leftouter” |
Right outer join | “right”, “rightouter” |
Full outer join | “full” “outer”, “fullouter” |
Left semi join | “semi”, “leftsemi” |
Left anti join | “anti”, “leftanti” |
Cross join | “cross” |
3. Demonstration of JOINS in Snowpark
Consider the EMPLOYEES and DEPARTMENTS data below for the demonstration of the implementation of the Joins in Snowpark.
#// create dataframe with employees data
employee_data = [
[1,'TONY',101],
[2,'STEVE',101],
[3,'BRUCE',102],
[4,'WANDA',102],
[5,'VICTOR',103],
[6,'HANK',105],
]
employee_schema = ["ID", "NAME", "DEPT_ID"]
df_emp =session.createDataFrame(employee_data, schema=employee_schema)
df_emp.show()
-----------------------------
|"ID" |"NAME" |"DEPT_ID" |
-----------------------------
|1 |TONY |101 |
|2 |STEVE |101 |
|3 |BRUCE |102 |
|4 |WANDA |102 |
|5 |VICTOR |103 |
|6 |HANK |105 |
-----------------------------
#// create dataframe with departments data
department_data = [
[101,'HR'],
[102,'SALES'],
[103,'IT'],
[104,'FINANCE'],
]
department_schema = ["DEPT_ID", "NAME"]
df_dept =session.createDataFrame(department_data, schema=department_schema)
df_dept.show()
-----------------------
|"DEPT_ID" |"NAME" |
-----------------------
|101 |HR |
|102 |SALES |
|103 |IT |
|104 |FINANCE |
-----------------------
3.1. Join DataFrames in Snowpark
The EMPLOYEES and DEPARTMENTS DataFrames can be joined using the DataFrame.join method in Snowpark as shown below.
#// Joining two DataFrames
#// Method-1
df_emp.join(df_dept, df_emp.DEPT_ID == df_dept.DEPT_ID).show()
#// Method-2
df_emp.join(df_dept, df_emp["DEPT_ID"] == df_dept["DEPT_ID"]).show()
------------------------------------------------------------------------------
|"ID" |"l_vrun_NAME" |"l_vrun_DEPT_ID" |"r_jc4z_DEPT_ID" |"r_jc4z_NAME" |
------------------------------------------------------------------------------
|1 |TONY |101 |101 |HR |
|2 |STEVE |101 |101 |HR |
|3 |BRUCE |102 |102 |SALES |
|4 |WANDA |102 |102 |SALES |
|5 |VICTOR |103 |103 |IT |
------------------------------------------------------------------------------
3.2. Join DataFrames referring to a Single Column Name in Snowpark
The DataFrames can be joined by referring to a single column name if the name of the column is same in both the DataFrames.
The EMPLOYEES and DEPARTMENTS DataFrames can be joined by referring to a single column DEPT_ID as shown below.
#// Joining two DataFrames referring to a single column
df_emp.join(df_dept, "DEPT_ID").show()
----------------------------------------------------
|"DEPT_ID" |"ID" |"l_9ml8_NAME" |"r_8dfz_NAME" |
----------------------------------------------------
|101 |1 |TONY |HR |
|101 |2 |STEVE |HR |
|102 |3 |BRUCE |SALES |
|102 |4 |WANDA |SALES |
|103 |5 |VICTOR |IT |
----------------------------------------------------
3.3. Rename Ambiguous Columns of Join operation Output in Snowpark
When two DataFrames are joined, the overlapping columns will have random column names in the resulting DataFrame as seen in the above examples.
The randomly named columns can be renamed using Column.alias as shown below.
#// Renaming the ambiguous columns
df_emp.join(df_dept, df_emp.DEPT_ID == df_dept.DEPT_ID).\
select(df_emp.ID, df_emp.NAME.alias("EMP_NAME"), df_emp.DEPT_ID.alias("DEPT_ID"), df_dept.NAME.alias("DEPT_NAME")).show()
-----------------------------------------------
|"ID" |"EMP_NAME" |"DEPT_ID" |"DEPT_NAME" |
-----------------------------------------------
|1 |TONY |101 |HR |
|2 |STEVE |101 |HR |
|3 |BRUCE |102 |SALES |
|4 |WANDA |102 |SALES |
|5 |VICTOR |103 |IT |
-----------------------------------------------
3.4. Rename Ambiguous Columns of Join operation Output using lsuffix and rsuffix
The randomly named overlapping columns can be renamed using lsuffix and rsuffix parameters in DataFrame.join method.
- lsuffix – Suffix to add to the overlapping columns of the left DataFrame.
- rsuffix – Suffix to add to the overlapping columns of the right DataFrame.
#// Renaming the ambiguous columns using lsuffix and rsuffix
df_emp.join(df_dept, df_emp.DEPT_ID == df_dept.DEPT_ID, lsuffix="_EMP", rsuffix="_DEPT").show()
--------------------------------------------------------------------
|"ID" |"NAME_EMP" |"DEPT_ID_EMP" |"DEPT_ID_DEPT" |"NAME_DEPT" |
--------------------------------------------------------------------
|1 |TONY |101 |101 |HR |
|2 |STEVE |101 |101 |HR |
|3 |BRUCE |102 |102 |SALES |
|4 |WANDA |102 |102 |SALES |
|5 |VICTOR |103 |103 |IT |
--------------------------------------------------------------------
It is recommended to use lsuffix and rsuffix parameters within DataFrame.join method when there are overlapping columns between the DataFrames.
3.5. Join DataFrames based on Multiple Conditions in Snowpark.
DataFrames can be joined based on multiple conditions separated by the “&” symbol as shown below.
DataFrame.join(right DataFrame, (<join_condition_1>) & (< join_condition_2>))
When the names of the columns are the same between both the DataFrames, they can be joined by passing a list of column names as shown below.
DataFrame.join(right DataFrame, ["col_1", "col_2", ..]
The following is an example of joining EMPLOYEES and DEPARTMENTS based on multiple conditions.
#// Joining two DataFrames based on Multiple conditions
df_emp.join(df_dept, (df_emp.DEPT_ID == df_dept.DEPT_ID) & (df_emp.ID < df_dept.DEPT_ID), \
lsuffix="_EMP", rsuffix="_DEPT").show()
3.6. Join Types in Snowpark
By default, the DataFrame.join method applies an inner join to join the data between the two DataFrames. The other supported join types can be specified to join the data between two DataFrames as shown below.
#// Left Outer Join
df_emp.join(df_dept, df_emp.DEPT_ID == df_dept.DEPT_ID, lsuffix="_EMP", rsuffix="_DEPT", join_type="left").show()
--------------------------------------------------------------------
|"ID" |"NAME_EMP" |"DEPT_ID_EMP" |"DEPT_ID_DEPT" |"NAME_DEPT" |
--------------------------------------------------------------------
|1 |TONY |101 |101 |HR |
|2 |STEVE |101 |101 |HR |
|3 |BRUCE |102 |102 |SALES |
|4 |WANDA |102 |102 |SALES |
|5 |VICTOR |103 |103 |IT |
|6 |HANK |105 |NULL |NULL |
--------------------------------------------------------------------
Instead of the “join_type” parameter, we can also use the “how” parameter to specify the join condition.
#// Right Outer Join
df_emp.join(df_dept, df_emp.DEPT_ID == df_dept.DEPT_ID, lsuffix="_EMP", rsuffix="_DEPT", how="right").show()
--------------------------------------------------------------------
|"ID" |"NAME_EMP" |"DEPT_ID_EMP" |"DEPT_ID_DEPT" |"NAME_DEPT" |
--------------------------------------------------------------------
|1 |TONY |101 |101 |HR |
|2 |STEVE |101 |101 |HR |
|3 |BRUCE |102 |102 |SALES |
|4 |WANDA |102 |102 |SALES |
|5 |VICTOR |103 |103 |IT |
|NULL |NULL |NULL |104 |FINANCE |
--------------------------------------------------------------------
Subscribe to our Newsletter !!
Related Articles:
- Introduction to Snowflake Snowpark for Python
- HOW TO: Create and Read Data from Snowflake Snowpark DataFrames?
- HOW TO: Write data into Snowflake from a Snowpark DataFrame?
- HOW TO: COPY Data from CSV Files INTO Snowflake Table using Snowpark?
- HOW TO: Add a New Column to a Snowpark DataFrame?
- HOW TO: Drop a Column from a Snowpark DataFrame?
- HOW TO: Remove Duplicates in a Snowflake Snowpark DataFrame?
- HOW TO: Update a DataFrame in Snowflake Snowpark?
- Aggregate Functions in Snowflake Snowpark
- GROUP BY in Snowflake Snowpark
- Joins in Snowflake Snowpark
- IN Operator in Snowflake Snowpark
- Window Functions in Snowflake Snowpark
- CASE Statement in Snowflake Snowpark
- UDFs in Snowflake Snowpark