Snowflake AI_TRANSCRIBE - Transform Audio to Insights with SQL in Seconds
Tsubasa Kanno

Tsubasa Kanno @tsubasa_tech

About: Senior Solution Engineer at Snowflake. I will be posting information mainly about Snowflake. System Engineer at SIer -> Senior Solutions Architect at AWS -> Snowflake

Location:
Tokyo, Japan
Joined:
Sep 14, 2024

Snowflake AI_TRANSCRIBE - Transform Audio to Insights with SQL in Seconds

Publish Date: Aug 10
2 0

Introduction

Snowflake's unstructured data analytics has taken another leap forward! After expanding Cortex AI capabilities for images and documents throughout 2025, we can now work with audio data directly from SQL!

The new AI_TRANSCRIBE function, released in Public Preview as part of Snowflake Cortex AISQL, transforms how we handle audio data. Customer support calls, meeting recordings, interviews - all these previously hard-to-leverage audio assets can now be transcribed with a single SQL query and combined with other AISQL functions for advanced analytics.

With support for images, documents, and now audio - the three major unstructured data formats - Snowflake has dramatically expanded the possibilities for business data analytics. Let's explore how AI_TRANSCRIBE works, its practical applications, and I'll even share a voice-enabled AI chatbot built with Streamlit in Snowflake!

Note: AI_TRANSCRIBE is currently in Public Preview, so features may undergo significant updates in the future.

Note: This article represents my personal views and not those of Snowflake.

What is AI_TRANSCRIBE?

AI_TRANSCRIBE is Snowflake Cortex AISQL's audio-to-text transcription function. Previously, leveraging audio data required external services or third-party packages, but AI_TRANSCRIBE enables direct audio transcription within SQL queries.

Key Features

  • SQL Native: Call directly from SQL like other AISQL functions for simple integration
  • Multi-language Support: Supports numerous languages including English, Spanish, French, German, Chinese, and many more
  • Speaker Identification: Distinguishes and labels multiple speakers
  • Timestamp Generation: Provides timestamps at word or speaker level
  • Secure Processing: All data processing occurs within Snowflake's secure environment

Part of the Cortex AISQL Family

AI_TRANSCRIBE becomes even more powerful when combined with existing Cortex AISQL functions:

  • AI_SENTIMENT: Analyze sentiment in transcribed audio
  • AI_CLASSIFY: Automatically categorize audio content
  • AI_COMPLETE: Summarize or answer questions about audio content
  • AI_AGG: Extract insights from grouped audio data
  • AI_EMBED: Vectorize audio transcripts for similarity search

Basic Usage

The basic syntax for AI_TRANSCRIBE is straightforward:

AI_TRANSCRIBE( <audio_file> [ , <options> ] )
Enter fullscreen mode Exit fullscreen mode

Parameters

  • audio_file: FILE type object representing the audio file. Use TO_FILE function to create a reference to staged files
  • options: Optional OBJECT type with the following fields:
    • timestamp_granularity: Specifies timestamp granularity
    • "word": Timestamps for each word
    • "speaker": Timestamps and labels for each speaker

Example 1: Simple Text Transcription

The simplest use case is converting audio to text:

-- Convert audio file to text
SELECT AI_TRANSCRIBE(
    TO_FILE('@audio_stage', 'customer_call_001.wav')
);
Enter fullscreen mode Exit fullscreen mode
{
  "audio_duration": 19.08,
  "text": "Hi, I'd like to inquire about the product I purchased last week. The packaging was damaged when it arrived, and I'd like to request an exchange if possible. Could you help me with this? Thank you."
}
Enter fullscreen mode Exit fullscreen mode

Processing time for this 19-second audio file was approximately 2 seconds - impressively fast for analytics scenarios!

Example 2: Word-Level Timestamps

For detailed analysis, add word-level timestamps:

-- Transcribe with word-level timestamps
SELECT AI_TRANSCRIBE(
    TO_FILE('@audio_stage', 'meeting_recording.wav'),
    {'timestamp_granularity': 'word'}
);
Enter fullscreen mode Exit fullscreen mode
{
  "audio_duration": 19.08,
  "segments": [
    {
      "end": 1.254,
      "start": 0.993,
      "text": "Hi"
    },
    {
      "end": 1.434,
      "start": 1.254,
      "text": "I'd"
    },
    {
      "end": 1.514,
      "start": 1.434,
      "text": "like"
    }
    // ... more segments
  ],
  "text": "Hi I'd like to inquire about the product..."
}
Enter fullscreen mode Exit fullscreen mode

Example 3: Speaker Identification

For meetings or interviews with multiple speakers, use speaker identification:

-- Transcribe with speaker identification
SELECT AI_TRANSCRIBE(
    TO_FILE('@audio_stage', 'interview_2025.mp3'),
    {'timestamp_granularity': 'speaker'}
);
Enter fullscreen mode Exit fullscreen mode
{
  "audio_duration": 16.2,
  "segments": [
    {
      "end": 8.461,
      "speaker_label": "SPEAKER_00",
      "start": 0.511,
      "text": "Good morning, thank you for joining us today. My name is Sarah."
    },
    {
      "end": 15.153,
      "speaker_label": "SPEAKER_01",
      "start": 9.048,
      "text": "Thank you for having me. I'm John, pleased to be here."
    }
  ],
  "text": "Good morning, thank you for joining us today. My name is Sarah. Thank you for having me. I'm John, pleased to be here."
}
Enter fullscreen mode Exit fullscreen mode

Supported Languages and Formats

Supported Languages

AI_TRANSCRIBE supports an extensive list of languages:

  • English, Spanish, French, German
  • Mandarin Chinese, Cantonese
  • Japanese, Korean
  • Arabic, Bulgarian, Catalan
  • Czech, Dutch, Greek
  • Hungarian, Indonesian, Italian
  • Latvian, Polish, Portuguese
  • Romanian, Russian, Serbian
  • Slovenian, Swedish, Thai
  • Turkish, Ukrainian

Supported Audio Formats

Major audio formats are supported:

  • MP3: Most common audio format
  • WAV: Uncompressed high-quality audio
  • FLAC: Lossless compressed audio
  • Ogg: Open-source format
  • WebM: Web standard format

Limitations and Considerations

Technical Limitations

Limitation Details
Maximum File Size 700MB
Maximum Duration (without timestamps) 120 minutes
Maximum Duration (with timestamps) 60 minutes
Concurrent Processing Depends on account compute resources

Usage Considerations

  • Audio Quality Impact: Background noise or poor audio quality may reduce transcription accuracy
  • Technical Terminology: Industry-specific terms or proper nouns may not be accurately transcribed
  • Language-Specific Behavior: Some languages may have unique behaviors with word-level timestamps
  • Real-time Processing: Currently supports batch processing only, not real-time streaming

Regional Availability

AI_TRANSCRIBE is natively available in:

  • AWS US West 2 (Oregon)
  • AWS US East 1 (N. Virginia)
  • AWS EU Central 1 (Frankfurt)
  • Azure East US 2 (Virginia)

For other regions: Use cross-region inference to access AI_TRANSCRIBE functionality with potentially slight latency.

Business Use Cases

AI_TRANSCRIBE excels in various business scenarios:

1. Customer Service Quality Enhancement

Transform call center recordings into actionable insights:

  • Sentiment Analysis: Use AI_SENTIMENT to analyze professionalism, problem resolution, and wait time perspectives
  • Call Classification: Automatically categorize calls as complaints, inquiries, or praise with AI_CLASSIFY
  • Speaker Separation: Analyze operator and customer speech separately for detailed insights
  • Real-time Dashboards: Visualize analysis results for immediate service quality improvements

2. Meeting Automation and Action Item Extraction

Transform meeting recordings into productivity tools:

  • Automatic Meeting Minutes: Instantly obtain full text from lengthy meetings
  • Summary Generation: Use AI_COMPLETE to create concise meeting summaries
  • Action Item Extraction: Automatically identify decisions and to-dos for efficient follow-up
  • Participant Analysis: Track who said what using speaker identification

3. Legal and Compliance Automation

Strengthen risk management with transcribed legal conversations:

  • Complete Documentation: Preserve all contract negotiations and legal discussions as text
  • Compliance Risk Detection: Classify conversation content by risk level using AI_CLASSIFY
  • Evidence Preservation: Accurately record who said what and when with speaker identification and timestamps
  • Automated Audit Reports: Extract key points and generate audit-ready documentation

4. Education and Training Enhancement

Maximize learning effectiveness with transcribed educational content:

  • Lecture Archives: Save course content as searchable text
  • Subtitle Creation: Add captions to video materials using word-level timestamps
  • Training Feedback Analysis: Identify improvement areas in training methodologies
  • Multilingual Support: Transcribe foreign language courses for easier review

5. Healthcare Documentation (with proper privacy controls)

Streamline medical documentation workflows:

  • Automated Clinical Notes: Generate structured medical records from doctor-patient conversations
  • EHR Integration: Extract relevant information for electronic health records
  • Multilingual Patient Care: Support international patients with transcription and translation
  • Quality Assurance: Analyze consultation content for healthcare improvement

Building a Voice-Enabled AI Chatbot with Streamlit in Snowflake

Let's build a simple voice-enabled AI chatbot using AI_TRANSCRIBE in Streamlit in Snowflake. Users can ask questions via voice, which gets transcribed and answered by AI (including the newly added OpenAI GPT-5!).

Application Overview

This application provides:

  1. Voice Recording: Record audio directly from browser and save to stage
  2. Audio Transcription: Convert to text using AI_TRANSCRIBE
  3. AI Response Generation: Generate answers using AI_COMPLETE

Prerequisites

Environment Requirements

  • Python Version: 3.11 or higher
  • Additional Packages: None required (works with standard packages only)
  • Streamlit in Snowflake: Environment to create and run applications

Regional Verification

Ensure your region supports AI_TRANSCRIBE and AI_COMPLETE functions, or enable cross-region inference.

Implementation Steps

1. Create a New Streamlit in Snowflake App

Navigate to 'Streamlit' in Snowsight's left pane and click '+ Streamlit' to create a new app.

2. Paste the Sample Code

Copy and paste the sample code below directly into the app editor. No modifications needed - stage names are automatically configured.

3. Run the Application

Click the "Run" button to launch the app. The stage will be created automatically on first run.

4. Use the Application

  1. Voice Input: Click the microphone button to speak
  2. Model Selection: Choose your preferred AI model from the sidebar
  3. Text Input: Regular chat input is also available

Sample Code

import streamlit as st
import io
import uuid
import json
from datetime import datetime
from snowflake.snowpark.context import get_active_session
from snowflake.snowpark.functions import ai_complete, to_file

# Get Snowflake session
session = get_active_session()

# Constants
STAGE_NAME = "AUDIO_TRANSCRIBE_STAGE"

# Page configuration
st.set_page_config(layout="wide")
st.title("AI Voice Chatbot")

# Sidebar: Model selection
st.sidebar.title("⚙️ Settings")

# Model options
model_options = [
    "━━━ 🟢 OpenAI ━━━",
    "openai-gpt-oss-120b",
    "openai-gpt-oss-20b",
    "openai-gpt-5",
    "openai-gpt-5-mini",
    "openai-gpt-5-nano",
    "openai-gpt-5-chat",
    "openai-gpt-4.1",
    "openai-o4-mini",
    "━━━ 🔵 Claude ━━━",
    "claude-4-opus",
    "claude-4-sonnet",
    "claude-3-7-sonnet",
    "claude-3-5-sonnet",
    "━━━ 🦙 Llama ━━━",
    "llama4-maverick",
    "llama4-scout",
    "llama3.3-70b",
    "llama3.2-3b",
    "llama3.2-1b",
    "llama3.1-405b",
    "llama3.1-70b",
    "llama3.1-8b",
    "llama3-70b",
    "llama3-8b",
    "━━━ 🟣 Mistral ━━━",
    "mistral-large2",
    "mistral-large",
    "mixtral-8x7b",
    "mistral-7b",
    "━━━ ❄️ Snowflake ━━━",
    "snowflake-arctic",
    "snowflake-llama-3.3-70b",
    "snowflake-llama-3.1-405b",
    "━━━ 🔴 Others ━━━",
    "deepseek-r1",
    "reka-core",
    "reka-flash",
    "jamba-1.5-large",
    "jamba-1.5-mini",
    "jamba-instruct",
    "gemma-7b"
]

# Default model setting
default_model = "llama4-maverick"
default_index = model_options.index(default_model) if default_model in model_options else 1

llm_model = st.sidebar.radio(
    "Select AI Model",
    options=model_options,
    index=default_index,
    format_func=lambda x: x if "━━━" in x else f"{x}"
)

# Use next model if separator is selected
if "━━━" in llm_model:
    llm_model = "llama4-maverick"

# Stage setup
@st.cache_resource
def setup_stage():
    """Setup stage for audio file storage"""
    try:
        session.sql(f"DESC STAGE {STAGE_NAME}").collect()
    except:
        session.sql(f"""
            CREATE STAGE IF NOT EXISTS {STAGE_NAME}
            ENCRYPTION = (TYPE = 'SNOWFLAKE_SSE')
            DIRECTORY = (ENABLE = TRUE)
        """).collect()

setup_stage()

# Initialize session state
if 'messages' not in st.session_state:
    st.session_state.messages = []
    st.session_state.chat_history = ""

def extract_text_from_transcript(transcript_result):
    """Extract text from AI_TRANSCRIBE result"""
    if isinstance(transcript_result, str) and transcript_result.startswith('{'):
        try:
            return json.loads(transcript_result).get('text', '')
        except:
            return transcript_result
    return transcript_result

def clean_ai_response(response):
    """Clean up AI response"""
    if isinstance(response, str):
        response = response.strip('"')
        response = response.replace('\\n', '\n')
    return response

def generate_ai_response(prompt, model):
    """Generate AI response"""
    df = session.range(1).select(
        ai_complete(model=model, prompt=prompt).alias("response")
    )
    return clean_ai_response(df.collect()[0]['RESPONSE'])

# Display chat history
for message in st.session_state.messages:
    with st.chat_message(message["role"]):
        st.markdown(message["content"])

# Voice input section
st.subheader("Voice Input")
audio_value = st.audio_input("Click the microphone button to speak")

if st.button("📤 Send Voice", disabled=(audio_value is None), use_container_width=True):
    if audio_value:
        try:
            # Upload audio file
            with st.spinner("🎤 Uploading audio..."):
                audio_filename = f"audio_{datetime.now().strftime('%Y%m%d_%H%M%S')}_{uuid.uuid4().hex[:8]}.wav"
                audio_stream = io.BytesIO(audio_value.getvalue())
                session.file.put_stream(
                    audio_stream,
                    f"@{STAGE_NAME}/{audio_filename}",
                    auto_compress=False,
                    overwrite=True
                )

            # Transcribe
            with st.spinner("📝 Transcribing audio..."):
                query = f"""
                    SELECT AI_TRANSCRIBE(
                        TO_FILE('@{STAGE_NAME}/{audio_filename}')
                    ) as transcript
                """
                result = session.sql(query).collect()

            if result and len(result) > 0:
                transcribed_text = extract_text_from_transcript(result[0]['TRANSCRIPT'])

                if transcribed_text:
                    # Add user message
                    st.session_state.messages.append({"role": "user", "content": transcribed_text})
                    st.session_state.chat_history += f"User: {transcribed_text}\n"

                    # Generate AI response
                    with st.spinner("🤖 AI is generating response..."):
                        full_prompt = st.session_state.chat_history + "AI: "
                        response = generate_ai_response(full_prompt, llm_model)

                    st.session_state.messages.append({"role": "assistant", "content": response})
                    st.session_state.chat_history += f"AI: {response}\n"

                    st.rerun()
                else:
                    st.warning("Transcription failed. Please try again.")
        except Exception as e:
            st.error(f"Error occurred: {str(e)}")

# Text input
if prompt := st.chat_input("Enter your message..."):
    # Add and display user message
    st.session_state.messages.append({"role": "user", "content": prompt})
    st.session_state.chat_history += f"User: {prompt}\n"
    with st.chat_message("user"):
        st.markdown(prompt)

    # Generate and display AI response
    try:
        with st.spinner("🤖 AI is generating response..."):
            full_prompt = st.session_state.chat_history + "AI: "
            response = generate_ai_response(full_prompt, llm_model)

        st.session_state.messages.append({"role": "assistant", "content": response})
        st.session_state.chat_history += f"AI: {response}\n"
        with st.chat_message("assistant"):
            st.markdown(response)
    except Exception as e:
        st.error(f"Error occurred: {str(e)}")

# Clear chat history
if st.button("🗑️ Clear Chat History"):
    st.session_state.messages = []
    st.session_state.chat_history = ""
    st.rerun()
Enter fullscreen mode Exit fullscreen mode

Application Screenshots

Voice-enabled AI Chatbot Interface

Chat interaction with voice input

Implementation Highlights

  • Simple Implementation: No additional packages required, works with standard libraries only
  • Audio Management: Store audio data in stages and process with AI_TRANSCRIBE
  • Multimodal Support: Supports both voice and text input
  • Rich Model Selection: Choose from latest models including OpenAI GPT-5

Cost Considerations

AI_TRANSCRIBE pricing follows the same token-based model as other AISQL functions:

Token Consumption and Pricing

  • 50 tokens per second of audio: Consistent across languages and timestamp granularities
  • 1 hour of audio = 180,000 tokens
  • Estimated cost: At 1.3 credits per million tokens and assuming $3 per credit, 1 hour of audio processing costs approximately $0.117

For example, a 60-second audio file:

  • 60 seconds × 50 tokens = 3,000 tokens

Note: Audio files under 1 minute are billed as 1 minute (3,000 tokens) minimum. For processing many short audio files, consider batching them together for cost optimization.

For latest pricing information, refer to the Snowflake Service Consumption Table.

Summary

AI_TRANSCRIBE represents a breakthrough function that opens the door to audio data analytics. Combined with Snowflake's enhanced support for images and documents in 2025, the addition of audio - the third major unstructured data format - truly positions Snowflake as a comprehensive multimodal data platform.

Key Benefits

  1. Unified Data Processing: Process all data types including audio within Snowflake
  2. AISQL Function Integration: Combine with sentiment analysis, classification, summarization, and vectorization
  3. Secure Environment: No external data movement required, maintaining governance
  4. Development Efficiency: Build audio analytics pipelines with just SQL, no third-party packages needed

From customer service and meeting transcription to healthcare documentation and legal compliance, AI_TRANSCRIBE unlocks valuable insights from previously untapped audio data. Start exploring how this function can transform your business analytics today!


Have you tried audio analytics in your data workflows? What use cases are you most excited about? Share your experiences in the comments below!


Promotion

Snowflake What's New Updates on X

I share Snowflake What's New updates on X. Follow for the latest insights:

English Version

Snowflake What's New Bot (English Version)

Japanese Version

Snowflake's What's New Bot (Japanese Version)

Change Log

(20250810) Initial post

Original Japanese Article

https://zenn.dev/tsubasa_tech/articles/65e96e2bd257ec

Comments 0 total

    Add comment