AI_PARSE_DOCUMENT: Read Documents using SQL in Snowflake

Spread the love

1. Introduction

Working with unstructured data like PDFs, Word documents, or images has always been a challenge for data teams. Traditionally, reading such files required complex preprocessing, external tools, or third-party services.

With Snowflake’s AI_PARSE_DOCUMENT function, you can now extract content directly from documents stored in your stage using nothing more than a simple SQL query.

In this article, let us explore AI_PARSE_DOCUMENT, which has made it easier than ever to work with documents natively within your Snowflake data cloud.

2. AI_PARSE_DOCUMENT in Snowflake

The AI_PARSE_DOCUMENT function allows you to read and extract content from files such as PDFs, Word documents, or images stored in your Snowflake stage. It uses Snowflake Cortex AI capabilities to interpret and return text in structured JSON format, including layout information, page details, and content extraction options.

Syntax:

AI_PARSE_DOCUMENT( FILE, [ <options> ] )

2.1. Creating FILE Object using TO_FILE

A FILE object represents metadata of a file stored in an internal or external stage. It does not store the actual file’s data but only a reference to it. A FILE object can be constructed using the TO_FILE function.

The inputs required to construct the FILE object using the TO_FILE function are:

  • Snowflake Stage Name
  • Relative path to the document in the Snowflake stage.

The following example creates a FILE object that references the specific file stored in your stage location.

-- Creating FILE object for a specific stage file 
SELECT TO_FILE('@STG_DOCS', 'The_Ultimate_Downhill_Bike.pdf') ;

Output:

{
  "CONTENT_TYPE": "application/pdf",
  "ETAG": "81a1b412ae3d3e0204db50d319f338f9",
  "LAST_MODIFIED": "Sat, 11 Oct 2025 16:01:16 GMT",
  "RELATIVE_PATH": "The_Ultimate_Downhill_Bike.pdf",
  "SIZE": 60380,
  "STAGE": "@CC_QUICKSTART_CORTEX_SEARCH_DOCS.DATA.STG_DOCS"
}

Using the Directory table, the FILE objects can be created on all the files in a stage location.

-- Creating FILE objects for all stage files 
SELECT TO_FILE('@STG_DOCS', RELATIVE_PATH) FROM DIRECTORY(@STG_DOCS);

Output:

Creating FILE objects for all stage files using RELATIVE_PATH
Creating FILE objects for all stage files using RELATIVE_PATH

Alternatively, we can also use the file URLs directly to generate the FILE object as shown below.

SELECT TO_FILE(FILE_URL) FROM DIRECTORY(@STG_DOCS);

Output:

Creating FILE objects for all stage files using FILE_URL
Creating FILE objects for all stage files using FILE_URL

2.2. Options for Parsing Documents using AI_PARSE_DOCUMENT

The options for parsing documents using AI_PARSE_DOCUMENT are entered as an OBJECT value, which is a semi-structured data type used to store key-value pairs.

The supported options are:

  • mode: Specifies the parsing mode. The supported values for mode are:
    • OCR : The function extracts text only and is the default mode.
    • LAYOUT : The function extracts layout as well as text, including structural content such as tables.
  • page_split: The function splits the document into pages and processes each page separately. The supported values are TRUE and FALSE. This feature supports only PDF, PowerPoint (.pptx), and Word (.docx) documents.
  • page_filter: Allows reading specified ranges of pages from a multi-page document. The value consists of an array with start and end fields that specify the first (inclusive) and last (exclusive) page in the range.

3. Extracting Content from a Document

The following SQL statement extracts the full document content using the AI_PARSE_DOCUMENT function. The ‘mode’: ‘LAYOUT’ option preserves the document’s structure and layout.

-- Extracting content from a document 
SELECT AI_PARSE_DOCUMENT (
    TO_FILE('@STG_DOCS', 'The_Ultimate_Downhill_Bike.pdf'), {'mode': 'LAYOUT'}
    ) AS DOCUMENT_TEXT;

Output:

{
  "content": "# The Ultimate Downhill Bike \"Rincon del Cielo\": \n\nDownhill biking, also known as downhill mountain biking (DH), is an exhilarating and challenging sport that requires the right equipment and skills.
  .... 
  ....
  Sometimes, they even enjoy creating quickstarts and running Virtual Hands On Lab!",
  "metadata": {
    "pageCount": 4
  }
}

4. Extracting Content from a Document Page-Wise

The following SQL statement extracts the full document content page-wise using the AI_PARSE_DOCUMENT function. The ‘page_split’: true option returns an array of JSON objects, each containing text and layout details for individual pages.

-- Extracting content from a document page-wise
SELECT AI_PARSE_DOCUMENT (
    TO_FILE('@STG_DOCS', 'The_Ultimate_Downhill_Bike.pdf'), {'mode': 'LAYOUT', 'page_split': true}
    ) AS DOCUMENT_TEXT;

Output:

{
  "metadata": {
    "pageCount": 4
  },
  "pages": [
    {
      "content": "# The Ultimate Downhill Bike \"Rincon del Cielo\": \n\nDownhill biking, also known as downhill mountain biking (DH), is an exhilarating and challenging sport that requires the right equipment and skills. ... Selecting the Right Downhill Bike:",
      "index": 0
    },
    {
      "content": "Choosing the right downhill bike involves considering several factors, including your riding style, budget, and terrain preferences. ... Rinse thoroughly and dry with a clean towel to prevent corrosion.",
      "index": 1
    },
    {
      "content": "- Suspension Setup: Regularly inspect and adjust the sag, rebound, and compression settings on your suspension forks and shocks according to your weight, riding style, and terrain. ... Walk challenging sections to assess the best line and determine your approach.",
      "index": 2
    },
    {
      "content": "- Ride Within Your Limits: Know your skill level and ride within your limits to avoid crashes and injuries. Progress gradually and practice essential techniques such as braking, cornering, and body positioning to improve your skills over time. ... Sometimes, they even enjoy creating quickstarts and running Virtual Hands On Lab!",
      "index": 3
    }
  ]
}

5. Extracting Content from Certain Pages of a Document

The following SQL statement extracts content from specific page ranges using the AI_PARSE_DOCUMENT function. The ‘page_filter’: [{‘start’: 0, ‘end’: 2}] option extracts the content from the first two pages of the document.

-- Extracting content from certain pages of a document 
SELECT AI_PARSE_DOCUMENT (
    TO_FILE('@STG_DOCS', 'The_Ultimate_Downhill_Bike.pdf'), {'mode': 'LAYOUT', 'page_filter': [{'start': 0, 'end': 2}]}
    ) AS DOCUMENT_TEXT;

Output:

{
  "metadata": {
    "pageCount": 4
  },
  "pages": [
    {
      "content": "# The Ultimate Downhill Bike \"Rincon del Cielo\": \n\nDownhill biking, also known as downhill mountain biking (DH), is an exhilarating and challenging sport that requires the right equipment and skills. ... Selecting the Right Downhill Bike:",
      "index": 0
    },
    {
      "content": "Choosing the right downhill bike involves considering several factors, including your riding style, budget, and terrain preferences. ... Rinse thoroughly and dry with a clean towel to prevent corrosion.",
      "index": 1
    }
  ]
}

6. Extracting Text from a Document

The following SQL statement extracts only the textual content (without layout or page-level metadata) from the document.

-- Extracting only text from a document
SELECT TO_VARCHAR(
    AI_PARSE_DOCUMENT( TO_FILE('@STG_DOCS', 'The_Ultimate_Downhill_Bike.pdf'), {'mode': 'LAYOUT'} ):content
    ) AS DOCUMENT_TEXT;

Output:

# The Ultimate Downhill Bike "Rincon del Cielo": 
Downhill biking, also known as downhill mountain biking (DH), is an exhilarating and challenging sport that requires the right equipment and skills. 
...
This guide provides specific information for Rincon del Cielo downhill bike.

## 1. Understanding Downhill Bike Components:
Before diving into selecting a downhill bike, it's essential to understand the various components that make up a downhill bike:
...

2. Selecting the Right Downhill Bike:
Choosing the right downhill bike involves considering several factors, including your riding style, budget, and terrain preferences. Here's what to keep in mind:
...

# 3. Maintaining Your Downhill Bike: 
Proper maintenance is essential for keeping your downhill bike in top condition and ensuring reliable performance on the trails. Here are some maintenance tips:
...

# 4. Safety Considerations: 
Downhill biking can be a physically demanding and potentially dangerous sport, so it's essential to prioritize safety at all times. Here are some safety tips to keep in mind:
...

Testing:
- The Ultimate Downhill Bike Rincon del Cielo bike has been tested by the specialized riders, Dash, Julian and Carlos. They are famous now only for his skills with snow sports but also his expertise riding bikes. Sometimes, they even enjoy creating quickstarts and running Virtual Hands On Lab! 

7. Extracting Content from All Documents in a Stage Location

The following SQL statement extracts the full document content from all documents in the stage location using the AI_PARSE_DOCUMENT function. This approach allows you to bulk extract text from multiple documents in a single query.

-- Extracting content from all documents in stage location
SELECT 
    RELATIVE_PATH,
    AI_PARSE_DOCUMENT (
        TO_FILE('@STG_DOCS', RELATIVE_PATH), {'mode': 'LAYOUT'}
    ) AS DOCUMENT_TEXT
FROM DIRECTORY(@STG_DOCS)
;

Output:

Extracting content from all documents in stage location
Extracting content from all documents in stage location

Alternatively, FILE_URLs can also be used to extract content from all documents in the stage location.

SELECT 
    RELATIVE_PATH,
    AI_PARSE_DOCUMENT (
        TO_FILE(FILE_URL), {'mode': 'LAYOUT'}
    ) AS DOCUMENT_TEXT
FROM DIRECTORY(@STG_DOCS)
;

8. Conclusion

Snowflake’s AI_PARSE_DOCUMENT eliminates the complexity of working with unstructured document data. With just a few lines of SQL, you can read, extract, and analyze text from documents with no external tools or integrations required.

Whether you’re building document search systems, automating data extraction, or creating knowledge bases, AI_PARSE_DOCUMENT makes document intelligence directly accessible within Snowflake.

Subscribe to our Newsletter !!

Related Articles:

Leave a Comment

Related Posts