HOW TO: Create and Read Data from Snowflake Snowpark DataFrames?

Spread the love

Contents hide

1. Introduction

2. What is a DataFrame?

3. Pre-requisites to create a DataFrame in Snowpark

4. How to create a DataFrame in Snowpark?

5. How to Read data from a Snowpark DataFrame?

6. How to create a DataFrame in Snowpark with a List of Specified Values?

7. How to create a DataFrame in Snowpark with a List of Specified Values and Schema?

8. How to create a DataFrame in Snowpark using Pandas?

9. How to create a DataFrame in Snowpark from a range of numbers?

10. How to create a DataFrame in Snowpark from a Database Table?

11. How to create a DataFrame in Snowpark by reading files from a stage?

1. Introduction

Snowpark is a developer framework from Snowflake that allows developers to interact with Snowflake directly and build complex data pipelines. In our previous article, we discussed what Snowflake Snowpark is and how to set up a Python development environment for Snowpark.

In Snowpark, the primary method for querying and processing data is through a DataFrame. In this article, we will explore what DataFrames are and guide you through the process of creating them in Snowpark.

2. What is a DataFrame?

A DataFrame in Snowpark acts like a virtual table that organizes data in a structured manner. Think of it as a way to express a SQL query, but in a different language. It operates lazily, meaning it doesn’t process the data until you instruct it to perform a specific task, such as retrieving or analyzing information.

The Snowpark API finally converts the DataFrames into SQL to execute your code in Snowflake.

3. Pre-requisites to create a DataFrame in Snowpark

To construct a DataFrame, you have to make use of Session class in Snowpark which establishes a connection with a Snowflake database and provides methods for creating DataFrames and accessing objects.

When you create a Session object, you provide connection parameters to establish a connection with a Snowflake database as shown below

import snowflake.snowpark as snowpark
from snowflake.snowpark import Session

connection_parameters = {
   "account": "snowflake account",
   "user": "snowflake username",
   "password": "snowflake password",
   "role": "snowflake role",  # optional
   "warehouse": "snowflake warehouse",  # optional
   "database": "snowflake database",  # optional
   "schema": "snowflake schema"  # optional
}

session = Session.builder.configs(connection_parameters).create()

To create DataFrames in a Snowsight Python worksheet, construct them within the handler function (main) and utilize the Session object (session) passed into the function.

def main(session: snowpark.Session):
    # your code goes here

4. How to create a DataFrame in Snowpark?

The createDataFrame method of Session class in Snowpark creates a new DataFrame containing the specified values from the local data.

Syntax:

The following is the syntax to create a DataFrame using createDataFrame method.

session.createDataFrame(data[, schema])

The accepted values for data in the createDataFrame method are List, Tuple or a Pandas DataFrame.

Lists are used to store multiple items in a single variable and are created using square brackets.
ex: myList = [“one”, “two”, “three”]
Tuples are used to store multiple items in a single variable and are created using round brackets. The contents of a tuple cannot change once they have been created in Python.
ex: myTuple = (“one”, “two”, “three”)
Pandas is a Python library used for working with data sets. Pandas allows the creation of DataFrames natively in Python.

The schema in the createDataFrame method can be a StructType containing names and data types of columns, or just a list of column names, or None.

5. How to Read data from a Snowpark DataFrame?

Data from a Snowpark DataFrame can be retrieved by utilizing the show method.

Syntax:

The following is the syntax to read data from a Snowpark DataFrame using show method.

DataFrame.show([n, max_width])

Parameters:

n – The value represents the number of rows to print out. This default value is 10.
max_width – The maximum number of characters to print out for each column.

6. How to create a DataFrame in Snowpark with a List of Specified Values?

Example-1:

The following is an example of creating a DataFrame with a list of values and assigning the column name as “a”.

df1 = session.createDataFrame([1,2,3,4], schema=["a"])
df1.show()

------
|"A" |
------
|1   |
|2   |
|3   |
|4   |
------

The DataFrame df1 when executed is translated and executed as SQL in Snowflake by Snowpark API as shown below.

SELECT "A" FROM ( 
  SELECT $1 AS "A" 
  FROM  VALUES (1::INT), (2::INT), (3::INT), (4::INT)
) LIMIT 10

Example-2:

The following is an example of creating a DataFrame with multiple lists of values and assigning the column names as “a”,”b”, “c” and “d”.

df2 = session.createDataFrame([[1,2,3,4],[5,6,7,8]], schema=["a","b","c","d"])
df2.show()

--------------------------
|"A"  |"B"  |"C"  |"D"   |
--------------------------
|1    |2    |3    |4     |
|5    |6    |7    |8     |
--------------------------

The DataFrame df2 when executed is translated and executed as SQL in Snowflake by Snowpark API as shown below.

SELECT "A", "B", "C", "D" FROM ( 
  SELECT $1 AS "A", $2 AS "B", $3 AS "C", $4 AS "D" 
  FROM  VALUES 
  (1::INT, 2::INT, 3::INT, 4::INT), 
  (5::INT, 6::INT, 7::INT, 8::INT)
) LIMIT 10

7. How to create a DataFrame in Snowpark with a List of Specified Values and Schema?

When schema parameter in the createDataFrame method is passed as a list of column names or None, the schema of the DataFrame will be inferred from the data across all rows.

Example-3:

The following is an example of creating a DataFrame with multiple lists of values with different data types and assigning the column names as “a”,”b”, “c” and “d”.

df3 = session.createDataFrame([[1, 2, 'Snow', '2024-01-01'],[3, 4, 'Park', '2024-01-02']], schema=["a","b","c","d"])
df3.show()

----------------------------------
|"A"  |"B"  |"C"   |"D"          |
----------------------------------
|1    |2    |Snow  |2024-01-01   |
|3    |4    |Park  |2024-01-02   |
----------------------------------

The DataFrame df3 when executed is translated and executed as SQL in Snowflake by Snowpark API as shown below.

SELECT "A", "B", "C", "D" FROM ( 
  SELECT $1 AS "A", $2 AS "B", $3 AS "C", $4 AS "D" 
  FROM  VALUES 
  (1::INT, 2::INT, 'Snow'::STRING, '2024-01-01'::STRING), 
  (3::INT, 4::INT, 'Park'::STRING, '2024-01-02'::STRING)
) LIMIT 10

Note that in the above query, since we did not explicitly specify the data types of the columns during definition, the values ‘2024-01-01’ and ‘2024-01-02’, despite being of “Date” data type, are identified as “String” data type.

Example-4:

Create a custom schema parameter of StructType containing names and data types of columns and pass it to the createDataFrame method as shown below.

#//create dataframe with schema
from snowflake.snowpark.types import IntegerType, StringType, StructField, StructType, DateType

my_schema = StructType(
    [StructField("a", IntegerType()),
     StructField("b", IntegerType()),
     StructField("c", StringType()),
     StructField("d", DateType())]
    )

df4 = session.createDataFrame([[1, 2, 'Snow', '2024-01-01'],[3, 4, 'Park', '2024-01-02']], schema=my_schema)
df4.show()

------------------------------------------
|"A"  |"B"  |"C"   |"D"                  |
------------------------------------------
|1    |2    |Snow  |2024-01-01 00:00:00  |
|3    |4    |Park  |2024-01-02 00:00:00  |
------------------------------------------

The DataFrame df4 when executed is translated and executed as SQL in Snowflake by Snowpark API referencing to the columns with the defined data types as shown below.

SELECT 
  "A", "B", "C", 
  to_date("D") AS "D" 
FROM ( 
  SELECT $1 AS "A", $2 AS "B", $3 AS "C", $4 AS "D" 
  FROM  VALUES 
  (1::INT, 2::INT, 'Snow'::STRING, '2024-01-01'::STRING), 
  (3::INT, 4::INT, 'Park'::STRING, '2024-01-02'::STRING)
) LIMIT 10

Note that in the above query, the column “D” is read as Date data type in Snowflake.

8. How to create a DataFrame in Snowpark using Pandas?

A Pandas DataFrame can be passed as “data” to create a DataFrame in Snowpark.

Example-5:

The following is an example of creating a Snowpark DataFrame using pandas DataFrame.

import pandas as pd

df_pandas = session.createDataFrame(pd.DataFrame([1,2,3],columns=["a"]))
df_pandas.show()

------
|"a" |
------
|1   |
|2   |
|3   |
------

Unlike DataFrames created with Lists or Tuples using the ‘createDataFrame‘ method, when a DataFrame is created using a pandas DataFrame, the Snowpark API creates a temporary table and imports the data from the pandas DataFrame into it. When extracting data from the Snowpark DataFrame created using the pandas DataFrame, the data is retrieved by querying the temporary table.

The DataFrame df_pandas when executed is translated and executed as SQL in Snowflake by Snowpark API as shown below.

SELECT  *  FROM "SNOWPARK_DEMO_DB"."SNOWPARK_DEMO_SCHEMA"."SNOWPARK_TEMP_TABLE_9RSV8KITUO" LIMIT 10

9. How to create a DataFrame in Snowpark from a range of numbers?

A DataFrame from a range of numbers can be created using range method of Session class in Snowpark. The resulting DataFrame has single column named “ID” containing elements in a range from start to end.

Syntax:

The following is the syntax to create a DataFrame using range method.

session.range(start[, end, step])

Parameters:

start : The start value of the range. If end is not specified, start will be used as the value of end.
end : The end value of the range.
step : The step or interval between numbers.

Example-6:

The following is an example of creating a DataFrame with a range of numbers from 1 to 9.

df_range = session.range(1,10).to_df("a")
df_range.show()

-------
|"A"  |
-------
|1    |
|2    |
|3    |
|4    |
|5    |
|6    |
|7    |
|8    |
|9    |
-------

The DataFrame df_range when executed is translated and executed as SQL in Snowflake by Snowpark API as shown below.

SELECT  *  FROM ( 
  SELECT ( ROW_NUMBER()  OVER ( ORDER BY  SEQ8() ) -  1 ) * (1) + (1) AS id 
  FROM ( TABLE (GENERATOR(ROWCOUNT => 9)))
) LIMIT 10

Example-7:

The following is an example of creating a DataFrame with a range of numbers from 1 to 9 with a step value of 2 and returning the output column renamed as “A”.

df_range2 = session.range(1,10,2).to_df("a")
df_range2.show()

-------
|"A"  |
-------
|1    |
|3    |
|5    |
|7    |
|9    |
-------

The DataFrame df_range2 when executed is translated and executed as SQL in Snowflake by Snowpark API as shown below.

SELECT "ID" AS "A" FROM ( 
  SELECT ( ROW_NUMBER()  OVER ( ORDER BY  SEQ8() ) -  1 ) * (2) + (1) AS id 
  FROM ( TABLE (GENERATOR(ROWCOUNT => 5)))
) LIMIT 10

10. How to create a DataFrame in Snowpark from a Database Table?

The sql and table methods of Session class in Snowpark can be used to create a DataFrame from a Database Table.

Example-8:

The following is an example of creating a DataFrame from a database table by executing a SQL query using sql method of Session class in Snowpark.

df_sql = session.sql("SELECT * FROM SNOWPARK_DEMO_DB.SNOWPARK_DEMO_SCHEMA.MONTHLY_REVENUE")
df_sql.show(5)

----------------------------------
|"YEAR"  |"MONTH"  |"REVENUE"    |
----------------------------------
|2012    |5        |3264300.11   |
|2012    |6        |3208482.33   |
|2012    |7        |3311966.98   |
|2012    |8        |3311752.81   |
|2012    |9        |3208563.06   |
----------------------------------

The DataFrame df_sql when executed is translated and executed as SQL in Snowflake by Snowpark API as shown below.

SELECT * FROM (SELECT * FROM SNOWPARK_DEMO_DB.SNOWPARK_DEMO_SCHEMA.MONTHLY_REVENUE) LIMIT 5

Example-9:

The following is an example of creating a DataFrame from a database table by executing a SQL query using table method of Session class in Snowpark.

df_sql = session.table("MONTHLY_REVENUE")
df_sql.show(5)

----------------------------------
|"YEAR"  |"MONTH"  |"REVENUE"    |
----------------------------------
|2012    |5        |3264300.11   |
|2012    |6        |3208482.33   |
|2012    |7        |3311966.98   |
|2012    |8        |3311752.81   |
|2012    |9        |3208563.06   |
----------------------------------

The DataFrame df_table when executed is translated and executed as SQL in Snowflake by Snowpark API as shown below.

SELECT  *  FROM MONTHLY_REVENUE LIMIT 5

11. How to create a DataFrame in Snowpark by reading files from a stage?

DataFrameReader class in Snowpark provides methods for loading data from a Snowflake stage to a DataFrame with format-specific options. To use it:

Create a DataFrameReader object through Session.read method.
For CSV file format, create a custom schema parameter of StructType containing names and data types of columns.
Set the file format specific properties such as delimiter using options() method.
Specify the file path and stage details by calling the method corresponding to the CSV format, csv().

Example-10:

The following is an example of creating a DataFrame in Snowpark by reading CSV files from S3 stage.

from snowflake.snowpark.types import IntegerType, StringType, StructField, StructType

schema = StructType(
    [StructField("EMPLOYEE_ID", IntegerType()),
     StructField("FIRST_NAME", StringType()),
     StructField("LAST_NAME", StringType()),
     StructField("EMAIL", StringType())
    ])

df_s3_employee = session.read.schema(schema).options({"field_delimiter": ",", "skip_header": 1}).csv('@my_s3_stage/Inbox/')
df_s3_employee.show(5)

--------------------------------------------------------------
| EMPLOYEE_ID | FIRST_NAME | LAST_NAME | EMAIL               |
--------------------------------------------------------------
| 204384      | Steven     | King      | SKING@test.com      |
| 204388      | Neena      | Kochhar   | NKOCHHAR@test.com   |
| 204392      | Lex        | De Haan   | LDEHAAN@test.com    |
| 204393      | Alexander  | Hunold    | AHUNOLD@test.com    |
| 204394      | Bruce      | Ernst     | BERNST@test.com     |
--------------------------------------------------------------

The DataFrame df_s3_employee when executed is translated and executed as SQL in Snowflake by Snowpark API as shown below.

A temporary file format is created using the properties specified in the options() method.
The stage files are queried using the file format created in the first step and the columns are cast into the data types specified in the schema defined
The file format created in the first step is dropped.

--create a temporary file format
CREATE SCOPED TEMPORARY FILE  FORMAT  If  NOT  EXISTS "SNOWPARK_DEMO_DB"."SNOWPARK_DEMO_SCHEMA".SNOWPARK_TEMP_FILE_FORMAT_Y00K7HK598 
TYPE  = CSV  FIELD_DELIMITER = ',' SKIP_HEADER = 1

--select data from stage files using the temporary file format
SELECT  *  FROM ( 
  SELECT 
    $1::INT AS "EMPLOYEE_ID", 
    $2::STRING AS "FIRST_NAME", 
    $3::STRING AS "LAST_NAME", 
    $4::STRING AS "EMAIL" 
  FROM @my_s3_stage/Inbox/( FILE_FORMAT  => '"SNOWPARK_DEMO_DB"."SNOWPARK_DEMO_SCHEMA".SNOWPARK_TEMP_FILE_FORMAT_Y00K7HK598')
) LIMIT 5

--drop the temporary file format
DROP  FILE  FORMAT  If  EXISTS "SNOWPARK_DEMO_DB"."SNOWPARK_DEMO_SCHEMA".SNOWPARK_TEMP_FILE_FORMAT_Y00K7HK598