HOW TO: Load and Query JSON data in Snowflake?

Spread the love

Introduction

Snowflake supports loading semi structured data like JSON in database tables. Once the data is loaded from stage into a database table, Snowflake has an excellent functionality for directly querying semi-structured data along with flattening it into a columnar structure.

In this article, let us discuss the entire process of loading and parsing JSON data in Snowflake.

Local Machine → Snowflake Internal Stage → Database table → Querying and Flattening the JSON data

For the demonstration, we will consider below JSON data containing novel’s information of three authors.

[
  {
    "AuthorName": "Author-1",
    "Category": [{
        "CategoryName": "Fiction",
        "Genre": [{
            "GenreName": "Action and Adventure",
            "Novel": [{
                "Novel": "Novel-1",
                "Sales": "200"
              },
              {
                "Novel": "Novel-2",
                "Sales": "850"
              }
            ]
          },
          {
            "GenreName": "Romance",
            "Novel": [{
              "Novel": "Novel-3",
              "Sales": "400"
            }]
          }
        ]
      },
      {
        "CategoryName": "NonFiction",
        "Genre": [{
          "GenreName": "Autobiography",
          "Novel": [{
            "Novel": "Novel-4",
            "Sales": "900"
          }]
        }]
      }
    ]
  },
  {
    "AuthorName": "Author-2",
    "Category": [{
        "CategoryName": "Fiction",
        "Genre": [{
            "GenreName": "Action and Adventure",
            "Novel": [{
              "Novel": "Novel-5",
              "Sales": "340"
            }]
          },
          {
            "GenreName": "Crime",
            "Novel": [{
                "Novel": "Novel-6",
                "Sales": "940"
              },
              {
                "Novel": "Novel-7",
                "Sales": "540"

              }
            ]
          }
        ]
      }
    ]
  },
  {
    "AuthorName": "Author-3",
    "Category": [{
        "CategoryName": "Fiction",
        "Genre": [{
            "GenreName": "Romance",
            "Novel": [{
                "Novel": "Novel-8",
                "Sales": "820"
              },
              {
                "Novel": "Novel-9",
                "Sales": "620"
              }
            ]
          },
          {
            "GenreName": "Thriller",
            "Novel": [{
              "Novel": "Novel-10",
              "Sales": "770"
            }]
          }
        ]
      },
      {
        "CategoryName": "NonFiction",
        "Genre": [{
            "GenreName": "History",
            "Novel": [{
              "Novel": "Novel-11",
              "Sales": "450"
            }]
          },
          {
            "GenreName": "Travel",
            "Novel": [{
              "Novel": "Novel-12",
              "Sales": "150"
            }]
          }
        ]
      }
    ]
  }
]

Our goal is to load the above JSON data into Snowflake and flatten it to achieve below columnar format of the data.

AuthorCategoryGenreNovelSales
Author1FictionAction and AdventureNovel-1200
Author1FictionAction and AdventureNovel-2850
Author1FictionRomanceNovel-3400
Author1Non-FictionAutobiographyNovel-4900
Author2FictionAction and AdventureNovel-5340
Author2FictionCrimeNovel-6940
Author2FictionCrimeNovel-7540
Author3FictionRomanceNovel-8820
Author3FictionRomanceNovel-9620
Author3FictionThrillerNovel-10770
Author3Non-FictionHistoryNovel-11450
Author3Non-FictionTravelNovel-12150

Loading JSON data from local machine into Snowflake Internal Stage

Before we load the data, we must choose the database and schema where the data is staged and later loaded into table. For the demonstration we will be using MY_DB database and MY_DB.MY_SCHEMA schema.

USE DATABASE my_db;
USE SCHEMA my_schema;

Step-1: Create a Snowflake Internal named Stage

Create an Internal Named Stage in Snowflake to hold the data loaded from local machine as shown below.

CREATE OR REPLACE STAGE my_internal_stage;

Step-2: Load data from local machine into Internal Stage using SnowSQL

To load data from our local machine into the Snowflake Internal stage, we have to use the Snowflake’s CLI tool which is SnowSQL.

To know more about this, read our detailed article on Snowflake SnowSQL.

1. Login to Snowflake SnowSQL

2. Using PUT command, copy the files from the local folder into snowflake internal stage created in earlier step.

put file://C:\SourceFiles\authors.json @my_internal_stage;
Loading JSON file from local machine into Snowflake Internal Stage
Loading JSON file from local machine into Snowflake Internal Stage

Step-3: Verify the file in Internal Stage using List command

Use List command in Snowflake worksheet to verify the file in internal stage loaded from SnowSQL as shown below.

List @my_internal_stage;
Listing files in Snowflake Internal Stage
Listing files in Snowflake Internal Stage

Loading JSON data from Snowflake internal Stage into database table

Step-4: Create a JSON file format in Snowflake

Create a Snowflake File format of type JSON which encapsulates information of data files which helps in processing the files from stage.

CREATE OR REPLACE FILE FORMAT my_json_format
    type = json
    strip_outer_array = true
;

We have used an option called STRIP_OUTER_ARRAY for this load. It helps in removing the outer set of square brackets [ ] when loading the data, separating the initial array into multiple lines. Else the entire JSON data gets loaded into single record instead of multiple records.

Step-5: Create database table to load JSON data

Create a table to load JSON data from Snowflake internal stage as shown below. Snowflake stores semi-structured data using the VARIANT field type in tables.

CREATE OR REPLACE TABLE Authors (
  JSON_DATA VARIANT
);

Note that the data of each author in the JSON file will be loaded into a single column JSON_DATA of type VARIANT in the table named Authors.

Step-6: Load data from Internal Stage into database table using COPY command

COPY command in Snowflake helps in loading data from staged files to an existing table. Load the data from a JSON file in internal stage to Authors database table using the MY_JSON_FORMAT file format as shown below

COPY INTO Authors
FROM @my_internal_stage/authors.json
FILE_FORMAT = (format_name = MY_JSON_FORMAT);
Copying files from Snowflake internal stage into database table
Copying files from Snowflake internal stage into database table

Querying JSON data from Snowflake database table

The data from the Authors table can be queried directly as shown below. By using the STRIP_OUTER_ARRAY option, we were able remove this initial array [] and treat each object in the array as a row in Snowflake. Hence each author object loaded as a separate row.

SELECT * FROM Authors;
Querying data from database table with JSON data
Querying data from database table with JSON data

The individual elements in the column JSON_DATA can be queried using standard : notation as shown below.

SELECT 
    JSON_DATA:AuthorName,
    JSON_DATA:Category
FROM Authors;
Querying individual JSON elements from database table
Querying individual JSON elements from database table

The data in the Category can be further drilled down and required elements information can be fetched as shown below.

SELECT 
    JSON_DATA:AuthorName,
    JSON_DATA:Category[0]:CategoryName,
    JSON_DATA:Category[1]:CategoryName
FROM Authors;
Querying individual JSON elements using index from database table
Querying individual JSON elements using index from database table

The novels of the authors are categorized into two different types in the JSON data. The details of each category can be accessed using the index [0], [1].. notation as shown above.

The outer quotes in the column data can be removed by using :: notation which lets you define the end data type of the values being retrieved. Notice how in this example, the outer quotes “ are removed.

SELECT 
    JSON_DATA:AuthorName::string,
    JSON_DATA:Category[0]:CategoryName::string,
    JSON_DATA:Category[1]:CategoryName::string
FROM Authors;
Casting the datatype of fields while querying JSON data
Casting the datatype of fields while querying JSON data

Further more details of author can be drilled down as shown below.

SELECT 
    JSON_DATA:AuthorName::string,
    JSON_DATA:Category[0]:CategoryName::string,
    JSON_DATA:Category[0]:Genre[0]:GenreName::string,
    JSON_DATA:Category[0]:Genre[1]:GenreName::string,
    JSON_DATA:Category[1]:CategoryName::string,
    JSON_DATA:Category[1]:Genre[0]:GenreName::string,
    JSON_DATA:Category[1]:Genre[1]:GenreName::string
FROM Authors;
Querying details of each category from JSON data
Querying details of each category from JSON data

Unfortunately, this approach is not ideal. As the data increases, you need to add additional levels of category and genre in the query statement specifying the index values. Using the : and [] notation alone is not sufficient to dynamically get every object in an array.

Flattening Arrays in JSON data

Flattening is a process of unpacking the semi-structured data into a columnar format by converting arrays into different rows of data.

Using the LATERAL FLATTEN function we can explode arrays into individual JSON objects. The input for the function is the array in the JSON structure that we want to flatten (In the example shown below, the array is Category). The flattened output is stored in a VALUE column. The individual elements from unpacked array can be accessed through the VALUE column as shown below.

SELECT 
    JSON_DATA:AuthorName::string AS Author,
    VALUE:CategoryName::string AS CategoryName
FROM Authors
,LATERAL FLATTEN (input => JSON_DATA:Category);
Flattening the Category array from JSON data
Flattening the Category array from JSON data

When there are multiple arrays which you need to flatten, it is mandatory to pass an alias to every input array. The VALUE column also should be used along with the alias you passed to the input array.

In our example, we need to flatten the Category, Genre and Novel arrays to get the desired output. Also note that the Novel array is present inside Genre array which is present inside Category array. So the flattened array output VALUE becomes input for the array present inside it.

The final query to get the desired output is as below.

SELECT 
    JSON_DATA:AuthorName::string AS  Author_Name,
    Flatten_Category.VALUE:CategoryName::string AS  Category_Name,
    Flatten_Genre.VALUE:GenreName::string AS Genre_Name,
    Flatten_Novel.VALUE:Novel::string AS Novel_Name,
    Flatten_Novel.VALUE:Sales:: number AS Sales_in_Millions
FROM Authors
,LATERAL FLATTEN (input => JSON_DATA:Category) AS Flatten_Category
,LATERAL FLATTEN (input => Flatten_Category.VALUE:Genre) AS Flatten_Genre
,LATERAL FLATTEN (input => Flatten_Genre.VALUE:Novel) AS Flatten_Novel
;
Final output after flattening the Category, Genre and Novel arrays in JSON data
Final output after flattening the Category, Genre and Novel arrays in JSON data

Summary

Snowflake supports loading semi structured data files from external and internal stages into database tables. Once the data is loaded into the table, it is important to understand the data structure and identify the arrays to flatten which provides the required output. The transformed data can then be loaded into another database tables with proper field names and data types easily. 

This is a good example of Snowflake’s ELT features which is extremely helpful as the semi structured data can be easily transformed once loaded without the help of external ETL tools.

Subscribe to our Newsletter !!

Related Articles:

  • Snowflake SnowSQL: Command Line Tool to access Snowflake

    A definitive guide on how to download, install, configure and use Snowflake SnowSQL Command Line tool

    READ MORE

  • Types of Snowflake Stages: Data Loading and Unloading Features

    A complete guide to types of Snowflake Stages and how to load data into and unload data from Snowflake tables using stages.

    READ MORE

  • HOW TO: Create External Stages in Snowflake

    Learn how to create external stages on AWS S3, Microsoft Azure and Google Cloud platforms in Snowflake.

    READ MORE

Leave a Comment

Related Posts