GROUP BY in Snowflake Snowpark

Spread the love

1. Introduction

The GROUP BY clause in SQL is utilized in conjunction with the SELECT statement to aggregate data from multiple records and organize the results based on one or more columns. The GROUP BY clause returns a single row for each group.

In this article, let us explore how to implement the GROUP BY clause on rows of a DataFrame in Snowflake Snowpark.

2. GROUP BY clause in Snowpark

The DataFrame.group_by method in Snowpark is similar to the GROUP BY clause in Snowflake that helps in grouping of rows based on specified columns.

Syntax:

One or multiple columns can be passed as inputs to the group_by method as shown below.

DataFrame.group_by("col_1", "col_2",…)

A List of column names can be passed as inputs to the group_by method as shown below.

DataFrame.group_by(["col_1", "col_2",…])

Return Value:

The DataFrame.group_by method returns a RelationalGroupedDataFrame as an output. A RelationalGroupedDataFrame is a representation of an underlying DataFrame where rows are organized into groups based on common values. Aggregations can then be defined on top of this grouped data.

>>> df = df_employee.group_by("DEPT_ID")
>>> type(df)

<class 'snowflake.snowpark.relational_grouped_dataframe.RelationalGroupedDataFrame'>

Unlike a regular DataFrame, only a limited set of methods are supported on a RelationalGroupedDataFrame. To know the full list of methods supported on a RelationalGroupedDataFrame, refer to the Snowflake Documentation.

3. Demonstration of GROUP BY clause in Snowpark

Consider the EMPLOYEE data below for the demonstration of the implementation of the GROUP BY in Snowpark.

#// create dataframe with employee data
employee_data = [
    [1,'TONY',24000,101],
    [2,'STEVE',17000,101],
    [3,'BRUCE',9000,101],
    [4,'WANDA',20000,102],
    [5,'VICTOR',12000,102],
    [6,'STEPHEN',10000,103],
    [7,'HANK',15000,103],
    [8,'THOR',21000,103]
]

employee_schema = ["EMP_ID", "EMP_NAME", "SALARY", "DEPT_ID"]

df_employee =session.createDataFrame(employee_data, schema=employee_schema)
df_employee.show()

------------------------------------------------
|"EMP_ID"  |"EMP_NAME"  |"SALARY"  |"DEPT_ID"  |
------------------------------------------------
|1         |TONY        |24000     |101        |
|2         |STEVE       |17000     |101        |
|3         |BRUCE       |9000      |101        |
|4         |WANDA       |20000     |102        |
|5         |VICTOR      |12000     |102        |
|6         |STEPHEN     |10000     |103        |
|7         |HANK        |15000     |103        |
|8         |THOR        |21000     |103        |
------------------------------------------------

3.1. Find the Number of Employees in each Department

The following is the SQL Query that calculates the number of employees in each department.

SELECT DEPT_ID, COUNT(EMP_ID)
FROM EMPLOYEES
GROUP BY DEPT_ID;

The same can be achieved in Snowpark using the DataFrame.group_by method as shown below

>>> from snowflake.snowpark.functions import count
>>> df_employee.group_by("DEPT_ID").agg(count("EMP_ID")).show()

-------------------------------
|"DEPT_ID"  |"COUNT(EMP_ID)"  |
-------------------------------
|101        |3                |
|102        |2                |
|103        |3                |
-------------------------------

3.2. Find the MAX and MIN Salary of employees in each Department

The following is the SQL Query that calculates the MAX and MIN salary of employees in each department.

SELECT DEPT_ID,
       MAX(SALARY) MAX_SALARY, MIN(SALARY) MIN_SALARY
FROM EMPLOYEES
GROUP BY DEPT_ID
;

The same can be achieved in Snowpark using the DataFrame.group_by method as shown below.

>>> from snowflake.snowpark.functions import max, min
>>> df_employee.group_by("DEPT_ID").agg(max("SALARY"), min("SALARY")).show()

---------------------------------------------
|"DEPT_ID"  |"MAX(SALARY)"  |"MIN(SALARY)"  |
---------------------------------------------
|101        |24000          |9000           |
|102        |20000          |12000          |
|103        |21000          |10000          |
---------------------------------------------

Note that we are employing Aggregate Functions using the DataFrame.agg method in conjunction with the DataFrame.group_by method to achieve the solution.

Adding aliases to the aggregate fields using Column.alias method for returning a renamed column name.

>>> df_employee.group_by("DEPT_ID").agg(max("SALARY").alias("MAX_SALARY"), min("SALARY").alias("MIN_SALARY")).show()

-------------------------------------------
|"DEPT_ID"  |"MAX_SALARY"  |"MIN_SALARY"  |
-------------------------------------------
|101        |24000         |9000          |
|102        |20000         |12000         |
|103        |21000         |10000         |
-------------------------------------------

4. HAVING Clause in Snowflake Snowpark

The HAVING clause in SQL is used in conjunction with the GROUP BY clause to filter the results of a query based on aggregated values. Unlike the WHERE clause, which filters individual rows before they are grouped, the HAVING clause filters the result set after the grouping and aggregation process.

In Snowpark, there is no equivalent method that provides the functionality of the HAVING clause in SQL. Instead, we can use DataFrame.filter method which filters rows of a DataFrame based on the specified conditional expression (similar to WHERE in SQL).

Let us understand with an example.

4.1. Find the Departments with more than two employees

The following is the SQL Query that returns the departments with more than two employees.

SELECT DEPT_ID
FROM EMPLOYEES
GROUP BY DEPT_ID
HAVING COUNT(EMP_ID)>2;

Follow the below steps to return the departments with more than two employees in Snowpark.

STEP-1: Find the Number of Employees in each Department using DataFrame.group_by and DataFrame.agg methods as shown below.

>>> df1 = df_employee.group_by("DEPT_ID").agg(count("EMP_ID").alias("EMP_COUNT"))
>>> df1.show()
---------------------------
|"DEPT_ID"  |"EMP_COUNT"  |
---------------------------
|101        |3            |
|102        |2            |
|103        |3            |
---------------------------

STEP-2: Filter the records with employee count >2 using the DataFrame.filter method as shown below.

>>> df2 = df1.filter(col("EMP_COUNT") > 2)
>>> df2.show()

---------------------------
|"DEPT_ID"  |"EMP_COUNT"  |
---------------------------
|101        |3            |
|103        |3            |
---------------------------

STEP-3: Select only the Department ID field using the DataFrame.select method as shown below.

>>> df3 = df2.select("DEPT_ID")
>>> df3.show()

-------------
|"DEPT_ID"  |
-------------
|101        |
|103        |
-------------

All these steps can be combined into a single command, as shown below.

>>> df_employee.group_by("DEPT_ID").agg(count("EMP_ID").alias("EMP_COUNT")).\
    filter(col("EMP_COUNT")>2).select("DEPT_ID").show()

-------------
|"DEPT_ID"  |
-------------
|101        |
|103        |
-------------

This is equivalent to the SQL query below, where an outer query is employed to filter the departments using the WHERE clause as shown below.

SELECT DEPT_ID FROM(
   SELECT DEPT_ID, COUNT(EMP_ID) EMP_COUNT
   FROM EMPLOYEES
   GROUP BY DEPT_ID)
WHERE EMP_COUNT>2;

Subscribe to our Newsletter !!

Related Articles:

Leave a Comment

Related Posts