Snowflake AI_TRANSCRIBE - Transform Audio to Insights with SQL in Seconds

Introduction

Snowflake's unstructured data analytics has taken another leap forward! After expanding Cortex AI capabilities for images and documents throughout 2025, we can now work with audio data directly from SQL!

The new AI_TRANSCRIBE function, released in Public Preview as part of Snowflake Cortex AISQL, transforms how we handle audio data. Customer support calls, meeting recordings, interviews - all these previously hard-to-leverage audio assets can now be transcribed with a single SQL query and combined with other AISQL functions for advanced analytics.

With support for images, documents, and now audio - the three major unstructured data formats - Snowflake has dramatically expanded the possibilities for business data analytics. Let's explore how AI_TRANSCRIBE works, its practical applications, and I'll even share a voice-enabled AI chatbot built with Streamlit in Snowflake!

Note: AI_TRANSCRIBE is currently in Public Preview, so features may undergo significant updates in the future.

Note: This article represents my personal views and not those of Snowflake.

What is AI_TRANSCRIBE?

AI_TRANSCRIBE is Snowflake Cortex AISQL's audio-to-text transcription function. Previously, leveraging audio data required external services or third-party packages, but AI_TRANSCRIBE enables direct audio transcription within SQL queries.

Key Features

SQL Native: Call directly from SQL like other AISQL functions for simple integration
Multi-language Support: Supports numerous languages including English, Spanish, French, German, Chinese, and many more
Speaker Identification: Distinguishes and labels multiple speakers
Timestamp Generation: Provides timestamps at word or speaker level
Secure Processing: All data processing occurs within Snowflake's secure environment

Part of the Cortex AISQL Family

AI_TRANSCRIBE becomes even more powerful when combined with existing Cortex AISQL functions:

AI_SENTIMENT: Analyze sentiment in transcribed audio
AI_CLASSIFY: Automatically categorize audio content
AI_COMPLETE: Summarize or answer questions about audio content
AI_AGG: Extract insights from grouped audio data
AI_EMBED: Vectorize audio transcripts for similarity search

Basic Usage

The basic syntax for AI_TRANSCRIBE is straightforward:

AI_TRANSCRIBE( <audio_file> [ , <options> ] )

Parameters

audio_file: FILE type object representing the audio file. Use TO_FILE function to create a reference to staged files
options: Optional OBJECT type with the following fields:
- timestamp_granularity: Specifies timestamp granularity
- "word": Timestamps for each word
- "speaker": Timestamps and labels for each speaker

Example 1: Simple Text Transcription

The simplest use case is converting audio to text:

-- Convert audio file to text
SELECT AI_TRANSCRIBE(
    TO_FILE('@audio_stage', 'customer_call_001.wav')
);

{
  "audio_duration": 19.08,
  "text": "Hi, I'd like to inquire about the product I purchased last week. The packaging was damaged when it arrived, and I'd like to request an exchange if possible. Could you help me with this? Thank you."
}

Processing time for this 19-second audio file was approximately 2 seconds - impressively fast for analytics scenarios!

Example 2: Word-Level Timestamps

For detailed analysis, add word-level timestamps:

-- Transcribe with word-level timestamps
SELECT AI_TRANSCRIBE(
    TO_FILE('@audio_stage', 'meeting_recording.wav'),
    {'timestamp_granularity': 'word'}
);

{
  "audio_duration": 19.08,
  "segments": [
    {
      "end": 1.254,
      "start": 0.993,
      "text": "Hi"
    },
    {
      "end": 1.434,
      "start": 1.254,
      "text": "I'd"
    },
    {
      "end": 1.514,
      "start": 1.434,
      "text": "like"
    }
    // ... more segments
  ],
  "text": "Hi I'd like to inquire about the product..."
}

Example 3: Speaker Identification

For meetings or interviews with multiple speakers, use speaker identification:

-- Transcribe with speaker identification
SELECT AI_TRANSCRIBE(
    TO_FILE('@audio_stage', 'interview_2025.mp3'),
    {'timestamp_granularity': 'speaker'}
);

{
  "audio_duration": 16.2,
  "segments": [
    {
      "end": 8.461,
      "speaker_label": "SPEAKER_00",
      "start": 0.511,
      "text": "Good morning, thank you for joining us today. My name is Sarah."
    },
    {
      "end": 15.153,
      "speaker_label": "SPEAKER_01",
      "start": 9.048,
      "text": "Thank you for having me. I'm John, pleased to be here."
    }
  ],
  "text": "Good morning, thank you for joining us today. My name is Sarah. Thank you for having me. I'm John, pleased to be here."
}

Supported Languages and Formats

Supported Languages

AI_TRANSCRIBE supports an extensive list of languages:

English, Spanish, French, German
Mandarin Chinese, Cantonese
Japanese, Korean
Arabic, Bulgarian, Catalan
Czech, Dutch, Greek
Hungarian, Indonesian, Italian
Latvian, Polish, Portuguese
Romanian, Russian, Serbian
Slovenian, Swedish, Thai
Turkish, Ukrainian

Supported Audio Formats

Major audio formats are supported:

MP3: Most common audio format
WAV: Uncompressed high-quality audio
FLAC: Lossless compressed audio
Ogg: Open-source format
WebM: Web standard format

Limitations and Considerations

Technical Limitations

Limitation	Details
Maximum File Size	700MB
Maximum Duration (without timestamps)	120 minutes
Maximum Duration (with timestamps)	60 minutes
Concurrent Processing	Depends on account compute resources

Usage Considerations

Audio Quality Impact: Background noise or poor audio quality may reduce transcription accuracy
Technical Terminology: Industry-specific terms or proper nouns may not be accurately transcribed
Language-Specific Behavior: Some languages may have unique behaviors with word-level timestamps
Real-time Processing: Currently supports batch processing only, not real-time streaming

Regional Availability

AI_TRANSCRIBE is natively available in:

AWS US West 2 (Oregon)
AWS US East 1 (N. Virginia)
AWS EU Central 1 (Frankfurt)
Azure East US 2 (Virginia)

For other regions: Use cross-region inference to access AI_TRANSCRIBE functionality with potentially slight latency.

Business Use Cases

AI_TRANSCRIBE excels in various business scenarios:

1. Customer Service Quality Enhancement

Transform call center recordings into actionable insights:

Sentiment Analysis: Use AI_SENTIMENT to analyze professionalism, problem resolution, and wait time perspectives
Call Classification: Automatically categorize calls as complaints, inquiries, or praise with AI_CLASSIFY
Speaker Separation: Analyze operator and customer speech separately for detailed insights
Real-time Dashboards: Visualize analysis results for immediate service quality improvements

2. Meeting Automation and Action Item Extraction

Transform meeting recordings into productivity tools:

Automatic Meeting Minutes: Instantly obtain full text from lengthy meetings
Summary Generation: Use AI_COMPLETE to create concise meeting summaries
Action Item Extraction: Automatically identify decisions and to-dos for efficient follow-up
Participant Analysis: Track who said what using speaker identification

3. Legal and Compliance Automation

Strengthen risk management with transcribed legal conversations:

Complete Documentation: Preserve all contract negotiations and legal discussions as text
Compliance Risk Detection: Classify conversation content by risk level using AI_CLASSIFY
Evidence Preservation: Accurately record who said what and when with speaker identification and timestamps
Automated Audit Reports: Extract key points and generate audit-ready documentation

4. Education and Training Enhancement

Maximize learning effectiveness with transcribed educational content:

Lecture Archives: Save course content as searchable text
Subtitle Creation: Add captions to video materials using word-level timestamps
Training Feedback Analysis: Identify improvement areas in training methodologies
Multilingual Support: Transcribe foreign language courses for easier review

5. Healthcare Documentation (with proper privacy controls)

Streamline medical documentation workflows:

Automated Clinical Notes: Generate structured medical records from doctor-patient conversations
EHR Integration: Extract relevant information for electronic health records
Multilingual Patient Care: Support international patients with transcription and translation
Quality Assurance: Analyze consultation content for healthcare improvement

Building a Voice-Enabled AI Chatbot with Streamlit in Snowflake

Let's build a simple voice-enabled AI chatbot using AI_TRANSCRIBE in Streamlit in Snowflake. Users can ask questions via voice, which gets transcribed and answered by AI (including the newly added OpenAI GPT-5!).

Application Overview

This application provides:

Voice Recording: Record audio directly from browser and save to stage
Audio Transcription: Convert to text using AI_TRANSCRIBE
AI Response Generation: Generate answers using AI_COMPLETE

Prerequisites

Environment Requirements

Python Version: 3.11 or higher
Additional Packages: None required (works with standard packages only)
Streamlit in Snowflake: Environment to create and run applications

Regional Verification

Ensure your region supports AI_TRANSCRIBE and AI_COMPLETE functions, or enable cross-region inference.

Implementation Steps

1. Create a New Streamlit in Snowflake App

Navigate to 'Streamlit' in Snowsight's left pane and click '+ Streamlit' to create a new app.

2. Paste the Sample Code

Copy and paste the sample code below directly into the app editor. No modifications needed - stage names are automatically configured.

3. Run the Application

Click the "Run" button to launch the app. The stage will be created automatically on first run.

4. Use the Application

Voice Input: Click the microphone button to speak
Model Selection: Choose your preferred AI model from the sidebar
Text Input: Regular chat input is also available

Sample Code

import streamlit as st
import io
import uuid
import json
from datetime import datetime
from snowflake.snowpark.context import get_active_session
from snowflake.snowpark.functions import ai_complete, to_file

# Get Snowflake session
session = get_active_session()

# Constants
STAGE_NAME = "AUDIO_TRANSCRIBE_STAGE"

# Page configuration
st.set_page_config(layout="wide")
st.title("AI Voice Chatbot")

# Sidebar: Model selection
st.sidebar.title("⚙️ Settings")

# Model options
model_options = [
    "━━━ 🟢 OpenAI ━━━",
    "openai-gpt-oss-120b",
    "openai-gpt-oss-20b",
    "openai-gpt-5",
    "openai-gpt-5-mini",
    "openai-gpt-5-nano",
    "openai-gpt-5-chat",
    "openai-gpt-4.1",
    "openai-o4-mini",
    "━━━ 🔵 Claude ━━━",
    "claude-4-opus",
    "claude-4-sonnet",
    "claude-3-7-sonnet",
    "claude-3-5-sonnet",
    "━━━ 🦙 Llama ━━━",
    "llama4-maverick",
    "llama4-scout",
    "llama3.3-70b",
    "llama3.2-3b",
    "llama3.2-1b",
    "llama3.1-405b",
    "llama3.1-70b",
    "llama3.1-8b",
    "llama3-70b",
    "llama3-8b",
    "━━━ 🟣 Mistral ━━━",
    "mistral-large2",
    "mistral-large",
    "mixtral-8x7b",
    "mistral-7b",
    "━━━ ❄️ Snowflake ━━━",
    "snowflake-arctic",
    "snowflake-llama-3.3-70b",
    "snowflake-llama-3.1-405b",
    "━━━ 🔴 Others ━━━",
    "deepseek-r1",
    "reka-core",
    "reka-flash",
    "jamba-1.5-large",
    "jamba-1.5-mini",
    "jamba-instruct",
    "gemma-7b"
]

# Default model setting
default_model = "llama4-maverick"
default_index = model_options.index(default_model) if default_model in model_options else 1

llm_model = st.sidebar.radio(
    "Select AI Model",
    options=model_options,
    index=default_index,
    format_func=lambda x: x if "━━━" in x else f"  • {x}"
)

# Use next model if separator is selected
if "━━━" in llm_model:
    llm_model = "llama4-maverick"

# Stage setup
@st.cache_resource
def setup_stage():
    """Setup stage for audio file storage"""
    try:
        session.sql(f"DESC STAGE {STAGE_NAME}").collect()
    except:
        session.sql(f"""
            CREATE STAGE IF NOT EXISTS {STAGE_NAME}
            ENCRYPTION = (TYPE = 'SNOWFLAKE_SSE')
            DIRECTORY = (ENABLE = TRUE)
        """).collect()

setup_stage()

# Initialize session state
if 'messages' not in st.session_state:
    st.session_state.messages = []
    st.session_state.chat_history = ""

def extract_text_from_transcript(transcript_result):
    """Extract text from AI_TRANSCRIBE result"""
    if isinstance(transcript_result, str) and transcript_result.startswith('{'):
        try:
            return json.loads(transcript_result).get('text', '')
        except:
            return transcript_result
    return transcript_result

def clean_ai_response(response):
    """Clean up AI response"""
    if isinstance(response, str):
        response = response.strip('"')
        response = response.replace('\\n', '\n')
    return response

def generate_ai_response(prompt, model):
    """Generate AI response"""
    df = session.range(1).select(
        ai_complete(model=model, prompt=prompt).alias("response")
    )
    return clean_ai_response(df.collect()[0]['RESPONSE'])

# Display chat history
for message in st.session_state.messages:
    with st.chat_message(message["role"]):
        st.markdown(message["content"])

# Voice input section
st.subheader("Voice Input")
audio_value = st.audio_input("Click the microphone button to speak")

if st.button("📤 Send Voice", disabled=(audio_value is None), use_container_width=True):
    if audio_value:
        try:
            # Upload audio file
            with st.spinner("🎤 Uploading audio..."):
                audio_filename = f"audio_{datetime.now().strftime('%Y%m%d_%H%M%S')}_{uuid.uuid4().hex[:8]}.wav"
                audio_stream = io.BytesIO(audio_value.getvalue())
                session.file.put_stream(
                    audio_stream,
                    f"@{STAGE_NAME}/{audio_filename}",
                    auto_compress=False,
                    overwrite=True
                )

            # Transcribe
            with st.spinner("📝 Transcribing audio..."):
                query = f"""
                    SELECT AI_TRANSCRIBE(
                        TO_FILE('@{STAGE_NAME}/{audio_filename}')
                    ) as transcript
                """
                result = session.sql(query).collect()

            if result and len(result) > 0:
                transcribed_text = extract_text_from_transcript(result[0]['TRANSCRIPT'])

                if transcribed_text:
                    # Add user message
                    st.session_state.messages.append({"role": "user", "content": transcribed_text})
                    st.session_state.chat_history += f"User: {transcribed_text}\n"

                    # Generate AI response
                    with st.spinner("🤖 AI is generating response..."):
                        full_prompt = st.session_state.chat_history + "AI: "
                        response = generate_ai_response(full_prompt, llm_model)

                    st.session_state.messages.append({"role": "assistant", "content": response})
                    st.session_state.chat_history += f"AI: {response}\n"

                    st.rerun()
                else:
                    st.warning("Transcription failed. Please try again.")
        except Exception as e:
            st.error(f"Error occurred: {str(e)}")

# Text input
if prompt := st.chat_input("Enter your message..."):
    # Add and display user message
    st.session_state.messages.append({"role": "user", "content": prompt})
    st.session_state.chat_history += f"User: {prompt}\n"
    with st.chat_message("user"):
        st.markdown(prompt)

    # Generate and display AI response
    try:
        with st.spinner("🤖 AI is generating response..."):
            full_prompt = st.session_state.chat_history + "AI: "
            response = generate_ai_response(full_prompt, llm_model)

        st.session_state.messages.append({"role": "assistant", "content": response})
        st.session_state.chat_history += f"AI: {response}\n"
        with st.chat_message("assistant"):
            st.markdown(response)
    except Exception as e:
        st.error(f"Error occurred: {str(e)}")

# Clear chat history
if st.button("🗑️ Clear Chat History"):
    st.session_state.messages = []
    st.session_state.chat_history = ""
    st.rerun()

Application Screenshots

Implementation Highlights

Simple Implementation: No additional packages required, works with standard libraries only
Audio Management: Store audio data in stages and process with AI_TRANSCRIBE
Multimodal Support: Supports both voice and text input
Rich Model Selection: Choose from latest models including OpenAI GPT-5

Cost Considerations

AI_TRANSCRIBE pricing follows the same token-based model as other AISQL functions:

Token Consumption and Pricing

50 tokens per second of audio: Consistent across languages and timestamp granularities
1 hour of audio = 180,000 tokens
Estimated cost: At 1.3 credits per million tokens and assuming $3 per credit, 1 hour of audio processing costs approximately $0.117

For example, a 60-second audio file:

60 seconds × 50 tokens = 3,000 tokens

Note: Audio files under 1 minute are billed as 1 minute (3,000 tokens) minimum. For processing many short audio files, consider batching them together for cost optimization.

For latest pricing information, refer to the Snowflake Service Consumption Table.

Summary

AI_TRANSCRIBE represents a breakthrough function that opens the door to audio data analytics. Combined with Snowflake's enhanced support for images and documents in 2025, the addition of audio - the third major unstructured data format - truly positions Snowflake as a comprehensive multimodal data platform.

Key Benefits

Unified Data Processing: Process all data types including audio within Snowflake
AISQL Function Integration: Combine with sentiment analysis, classification, summarization, and vectorization
Secure Environment: No external data movement required, maintaining governance
Development Efficiency: Build audio analytics pipelines with just SQL, no third-party packages needed

From customer service and meeting transcription to healthcare documentation and legal compliance, AI_TRANSCRIBE unlocks valuable insights from previously untapped audio data. Start exploring how this function can transform your business analytics today!

Have you tried audio analytics in your data workflows? What use cases are you most excited about? Share your experiences in the comments below!

Promotion

Snowflake What's New Updates on X

I share Snowflake What's New updates on X. Follow for the latest insights:

English Version

Snowflake What's New Bot (English Version)

Japanese Version

Snowflake's What's New Bot (Japanese Version)

Change Log

(20250810) Initial post

Original Japanese Article

https://zenn.dev/tsubasa_tech/articles/65e96e2bd257ec

Tsubasa Kanno @tsubasa_tech