Introduction
Snowflake's unstructured data analytics has taken another leap forward! After expanding Cortex AI capabilities for images and documents throughout 2025, we can now work with audio data directly from SQL!
The new AI_TRANSCRIBE function, released in Public Preview as part of Snowflake Cortex AISQL, transforms how we handle audio data. Customer support calls, meeting recordings, interviews - all these previously hard-to-leverage audio assets can now be transcribed with a single SQL query and combined with other AISQL functions for advanced analytics.
With support for images, documents, and now audio - the three major unstructured data formats - Snowflake has dramatically expanded the possibilities for business data analytics. Let's explore how AI_TRANSCRIBE works, its practical applications, and I'll even share a voice-enabled AI chatbot built with Streamlit in Snowflake!
Note: AI_TRANSCRIBE is currently in Public Preview, so features may undergo significant updates in the future.
Note: This article represents my personal views and not those of Snowflake.
What is AI_TRANSCRIBE?
AI_TRANSCRIBE is Snowflake Cortex AISQL's audio-to-text transcription function. Previously, leveraging audio data required external services or third-party packages, but AI_TRANSCRIBE enables direct audio transcription within SQL queries.
Key Features
- SQL Native: Call directly from SQL like other AISQL functions for simple integration
- Multi-language Support: Supports numerous languages including English, Spanish, French, German, Chinese, and many more
- Speaker Identification: Distinguishes and labels multiple speakers
- Timestamp Generation: Provides timestamps at word or speaker level
- Secure Processing: All data processing occurs within Snowflake's secure environment
Part of the Cortex AISQL Family
AI_TRANSCRIBE becomes even more powerful when combined with existing Cortex AISQL functions:
- AI_SENTIMENT: Analyze sentiment in transcribed audio
- AI_CLASSIFY: Automatically categorize audio content
- AI_COMPLETE: Summarize or answer questions about audio content
- AI_AGG: Extract insights from grouped audio data
- AI_EMBED: Vectorize audio transcripts for similarity search
Basic Usage
The basic syntax for AI_TRANSCRIBE is straightforward:
AI_TRANSCRIBE( <audio_file> [ , <options> ] )
Parameters
-
audio_file: FILE type object representing the audio file. Use
TO_FILE
function to create a reference to staged files -
options: Optional OBJECT type with the following fields:
-
timestamp_granularity
: Specifies timestamp granularity -
"word"
: Timestamps for each word -
"speaker"
: Timestamps and labels for each speaker
-
Example 1: Simple Text Transcription
The simplest use case is converting audio to text:
-- Convert audio file to text
SELECT AI_TRANSCRIBE(
TO_FILE('@audio_stage', 'customer_call_001.wav')
);
{
"audio_duration": 19.08,
"text": "Hi, I'd like to inquire about the product I purchased last week. The packaging was damaged when it arrived, and I'd like to request an exchange if possible. Could you help me with this? Thank you."
}
Processing time for this 19-second audio file was approximately 2 seconds - impressively fast for analytics scenarios!
Example 2: Word-Level Timestamps
For detailed analysis, add word-level timestamps:
-- Transcribe with word-level timestamps
SELECT AI_TRANSCRIBE(
TO_FILE('@audio_stage', 'meeting_recording.wav'),
{'timestamp_granularity': 'word'}
);
{
"audio_duration": 19.08,
"segments": [
{
"end": 1.254,
"start": 0.993,
"text": "Hi"
},
{
"end": 1.434,
"start": 1.254,
"text": "I'd"
},
{
"end": 1.514,
"start": 1.434,
"text": "like"
}
// ... more segments
],
"text": "Hi I'd like to inquire about the product..."
}
Example 3: Speaker Identification
For meetings or interviews with multiple speakers, use speaker identification:
-- Transcribe with speaker identification
SELECT AI_TRANSCRIBE(
TO_FILE('@audio_stage', 'interview_2025.mp3'),
{'timestamp_granularity': 'speaker'}
);
{
"audio_duration": 16.2,
"segments": [
{
"end": 8.461,
"speaker_label": "SPEAKER_00",
"start": 0.511,
"text": "Good morning, thank you for joining us today. My name is Sarah."
},
{
"end": 15.153,
"speaker_label": "SPEAKER_01",
"start": 9.048,
"text": "Thank you for having me. I'm John, pleased to be here."
}
],
"text": "Good morning, thank you for joining us today. My name is Sarah. Thank you for having me. I'm John, pleased to be here."
}
Supported Languages and Formats
Supported Languages
AI_TRANSCRIBE supports an extensive list of languages:
- English, Spanish, French, German
- Mandarin Chinese, Cantonese
- Japanese, Korean
- Arabic, Bulgarian, Catalan
- Czech, Dutch, Greek
- Hungarian, Indonesian, Italian
- Latvian, Polish, Portuguese
- Romanian, Russian, Serbian
- Slovenian, Swedish, Thai
- Turkish, Ukrainian
Supported Audio Formats
Major audio formats are supported:
- MP3: Most common audio format
- WAV: Uncompressed high-quality audio
- FLAC: Lossless compressed audio
- Ogg: Open-source format
- WebM: Web standard format
Limitations and Considerations
Technical Limitations
Limitation | Details |
---|---|
Maximum File Size | 700MB |
Maximum Duration (without timestamps) | 120 minutes |
Maximum Duration (with timestamps) | 60 minutes |
Concurrent Processing | Depends on account compute resources |
Usage Considerations
- Audio Quality Impact: Background noise or poor audio quality may reduce transcription accuracy
- Technical Terminology: Industry-specific terms or proper nouns may not be accurately transcribed
- Language-Specific Behavior: Some languages may have unique behaviors with word-level timestamps
- Real-time Processing: Currently supports batch processing only, not real-time streaming
Regional Availability
AI_TRANSCRIBE is natively available in:
- AWS US West 2 (Oregon)
- AWS US East 1 (N. Virginia)
- AWS EU Central 1 (Frankfurt)
- Azure East US 2 (Virginia)
For other regions: Use cross-region inference to access AI_TRANSCRIBE functionality with potentially slight latency.
Business Use Cases
AI_TRANSCRIBE excels in various business scenarios:
1. Customer Service Quality Enhancement
Transform call center recordings into actionable insights:
- Sentiment Analysis: Use AI_SENTIMENT to analyze professionalism, problem resolution, and wait time perspectives
- Call Classification: Automatically categorize calls as complaints, inquiries, or praise with AI_CLASSIFY
- Speaker Separation: Analyze operator and customer speech separately for detailed insights
- Real-time Dashboards: Visualize analysis results for immediate service quality improvements
2. Meeting Automation and Action Item Extraction
Transform meeting recordings into productivity tools:
- Automatic Meeting Minutes: Instantly obtain full text from lengthy meetings
- Summary Generation: Use AI_COMPLETE to create concise meeting summaries
- Action Item Extraction: Automatically identify decisions and to-dos for efficient follow-up
- Participant Analysis: Track who said what using speaker identification
3. Legal and Compliance Automation
Strengthen risk management with transcribed legal conversations:
- Complete Documentation: Preserve all contract negotiations and legal discussions as text
- Compliance Risk Detection: Classify conversation content by risk level using AI_CLASSIFY
- Evidence Preservation: Accurately record who said what and when with speaker identification and timestamps
- Automated Audit Reports: Extract key points and generate audit-ready documentation
4. Education and Training Enhancement
Maximize learning effectiveness with transcribed educational content:
- Lecture Archives: Save course content as searchable text
- Subtitle Creation: Add captions to video materials using word-level timestamps
- Training Feedback Analysis: Identify improvement areas in training methodologies
- Multilingual Support: Transcribe foreign language courses for easier review
5. Healthcare Documentation (with proper privacy controls)
Streamline medical documentation workflows:
- Automated Clinical Notes: Generate structured medical records from doctor-patient conversations
- EHR Integration: Extract relevant information for electronic health records
- Multilingual Patient Care: Support international patients with transcription and translation
- Quality Assurance: Analyze consultation content for healthcare improvement
Building a Voice-Enabled AI Chatbot with Streamlit in Snowflake
Let's build a simple voice-enabled AI chatbot using AI_TRANSCRIBE in Streamlit in Snowflake. Users can ask questions via voice, which gets transcribed and answered by AI (including the newly added OpenAI GPT-5!).
Application Overview
This application provides:
- Voice Recording: Record audio directly from browser and save to stage
- Audio Transcription: Convert to text using AI_TRANSCRIBE
- AI Response Generation: Generate answers using AI_COMPLETE
Prerequisites
Environment Requirements
- Python Version: 3.11 or higher
- Additional Packages: None required (works with standard packages only)
- Streamlit in Snowflake: Environment to create and run applications
Regional Verification
Ensure your region supports AI_TRANSCRIBE and AI_COMPLETE functions, or enable cross-region inference.
Implementation Steps
1. Create a New Streamlit in Snowflake App
Navigate to 'Streamlit' in Snowsight's left pane and click '+ Streamlit' to create a new app.
2. Paste the Sample Code
Copy and paste the sample code below directly into the app editor. No modifications needed - stage names are automatically configured.
3. Run the Application
Click the "Run" button to launch the app. The stage will be created automatically on first run.
4. Use the Application
- Voice Input: Click the microphone button to speak
- Model Selection: Choose your preferred AI model from the sidebar
- Text Input: Regular chat input is also available
Sample Code
import streamlit as st
import io
import uuid
import json
from datetime import datetime
from snowflake.snowpark.context import get_active_session
from snowflake.snowpark.functions import ai_complete, to_file
# Get Snowflake session
session = get_active_session()
# Constants
STAGE_NAME = "AUDIO_TRANSCRIBE_STAGE"
# Page configuration
st.set_page_config(layout="wide")
st.title("AI Voice Chatbot")
# Sidebar: Model selection
st.sidebar.title("⚙️ Settings")
# Model options
model_options = [
"━━━ 🟢 OpenAI ━━━",
"openai-gpt-oss-120b",
"openai-gpt-oss-20b",
"openai-gpt-5",
"openai-gpt-5-mini",
"openai-gpt-5-nano",
"openai-gpt-5-chat",
"openai-gpt-4.1",
"openai-o4-mini",
"━━━ 🔵 Claude ━━━",
"claude-4-opus",
"claude-4-sonnet",
"claude-3-7-sonnet",
"claude-3-5-sonnet",
"━━━ 🦙 Llama ━━━",
"llama4-maverick",
"llama4-scout",
"llama3.3-70b",
"llama3.2-3b",
"llama3.2-1b",
"llama3.1-405b",
"llama3.1-70b",
"llama3.1-8b",
"llama3-70b",
"llama3-8b",
"━━━ 🟣 Mistral ━━━",
"mistral-large2",
"mistral-large",
"mixtral-8x7b",
"mistral-7b",
"━━━ ❄️ Snowflake ━━━",
"snowflake-arctic",
"snowflake-llama-3.3-70b",
"snowflake-llama-3.1-405b",
"━━━ 🔴 Others ━━━",
"deepseek-r1",
"reka-core",
"reka-flash",
"jamba-1.5-large",
"jamba-1.5-mini",
"jamba-instruct",
"gemma-7b"
]
# Default model setting
default_model = "llama4-maverick"
default_index = model_options.index(default_model) if default_model in model_options else 1
llm_model = st.sidebar.radio(
"Select AI Model",
options=model_options,
index=default_index,
format_func=lambda x: x if "━━━" in x else f" • {x}"
)
# Use next model if separator is selected
if "━━━" in llm_model:
llm_model = "llama4-maverick"
# Stage setup
@st.cache_resource
def setup_stage():
"""Setup stage for audio file storage"""
try:
session.sql(f"DESC STAGE {STAGE_NAME}").collect()
except:
session.sql(f"""
CREATE STAGE IF NOT EXISTS {STAGE_NAME}
ENCRYPTION = (TYPE = 'SNOWFLAKE_SSE')
DIRECTORY = (ENABLE = TRUE)
""").collect()
setup_stage()
# Initialize session state
if 'messages' not in st.session_state:
st.session_state.messages = []
st.session_state.chat_history = ""
def extract_text_from_transcript(transcript_result):
"""Extract text from AI_TRANSCRIBE result"""
if isinstance(transcript_result, str) and transcript_result.startswith('{'):
try:
return json.loads(transcript_result).get('text', '')
except:
return transcript_result
return transcript_result
def clean_ai_response(response):
"""Clean up AI response"""
if isinstance(response, str):
response = response.strip('"')
response = response.replace('\\n', '\n')
return response
def generate_ai_response(prompt, model):
"""Generate AI response"""
df = session.range(1).select(
ai_complete(model=model, prompt=prompt).alias("response")
)
return clean_ai_response(df.collect()[0]['RESPONSE'])
# Display chat history
for message in st.session_state.messages:
with st.chat_message(message["role"]):
st.markdown(message["content"])
# Voice input section
st.subheader("Voice Input")
audio_value = st.audio_input("Click the microphone button to speak")
if st.button("📤 Send Voice", disabled=(audio_value is None), use_container_width=True):
if audio_value:
try:
# Upload audio file
with st.spinner("🎤 Uploading audio..."):
audio_filename = f"audio_{datetime.now().strftime('%Y%m%d_%H%M%S')}_{uuid.uuid4().hex[:8]}.wav"
audio_stream = io.BytesIO(audio_value.getvalue())
session.file.put_stream(
audio_stream,
f"@{STAGE_NAME}/{audio_filename}",
auto_compress=False,
overwrite=True
)
# Transcribe
with st.spinner("📝 Transcribing audio..."):
query = f"""
SELECT AI_TRANSCRIBE(
TO_FILE('@{STAGE_NAME}/{audio_filename}')
) as transcript
"""
result = session.sql(query).collect()
if result and len(result) > 0:
transcribed_text = extract_text_from_transcript(result[0]['TRANSCRIPT'])
if transcribed_text:
# Add user message
st.session_state.messages.append({"role": "user", "content": transcribed_text})
st.session_state.chat_history += f"User: {transcribed_text}\n"
# Generate AI response
with st.spinner("🤖 AI is generating response..."):
full_prompt = st.session_state.chat_history + "AI: "
response = generate_ai_response(full_prompt, llm_model)
st.session_state.messages.append({"role": "assistant", "content": response})
st.session_state.chat_history += f"AI: {response}\n"
st.rerun()
else:
st.warning("Transcription failed. Please try again.")
except Exception as e:
st.error(f"Error occurred: {str(e)}")
# Text input
if prompt := st.chat_input("Enter your message..."):
# Add and display user message
st.session_state.messages.append({"role": "user", "content": prompt})
st.session_state.chat_history += f"User: {prompt}\n"
with st.chat_message("user"):
st.markdown(prompt)
# Generate and display AI response
try:
with st.spinner("🤖 AI is generating response..."):
full_prompt = st.session_state.chat_history + "AI: "
response = generate_ai_response(full_prompt, llm_model)
st.session_state.messages.append({"role": "assistant", "content": response})
st.session_state.chat_history += f"AI: {response}\n"
with st.chat_message("assistant"):
st.markdown(response)
except Exception as e:
st.error(f"Error occurred: {str(e)}")
# Clear chat history
if st.button("🗑️ Clear Chat History"):
st.session_state.messages = []
st.session_state.chat_history = ""
st.rerun()
Application Screenshots
Implementation Highlights
- Simple Implementation: No additional packages required, works with standard libraries only
- Audio Management: Store audio data in stages and process with AI_TRANSCRIBE
- Multimodal Support: Supports both voice and text input
- Rich Model Selection: Choose from latest models including OpenAI GPT-5
Cost Considerations
AI_TRANSCRIBE pricing follows the same token-based model as other AISQL functions:
Token Consumption and Pricing
- 50 tokens per second of audio: Consistent across languages and timestamp granularities
- 1 hour of audio = 180,000 tokens
- Estimated cost: At 1.3 credits per million tokens and assuming $3 per credit, 1 hour of audio processing costs approximately $0.117
For example, a 60-second audio file:
- 60 seconds × 50 tokens = 3,000 tokens
Note: Audio files under 1 minute are billed as 1 minute (3,000 tokens) minimum. For processing many short audio files, consider batching them together for cost optimization.
For latest pricing information, refer to the Snowflake Service Consumption Table.
Summary
AI_TRANSCRIBE represents a breakthrough function that opens the door to audio data analytics. Combined with Snowflake's enhanced support for images and documents in 2025, the addition of audio - the third major unstructured data format - truly positions Snowflake as a comprehensive multimodal data platform.
Key Benefits
- Unified Data Processing: Process all data types including audio within Snowflake
- AISQL Function Integration: Combine with sentiment analysis, classification, summarization, and vectorization
- Secure Environment: No external data movement required, maintaining governance
- Development Efficiency: Build audio analytics pipelines with just SQL, no third-party packages needed
From customer service and meeting transcription to healthcare documentation and legal compliance, AI_TRANSCRIBE unlocks valuable insights from previously untapped audio data. Start exploring how this function can transform your business analytics today!
Have you tried audio analytics in your data workflows? What use cases are you most excited about? Share your experiences in the comments below!
Promotion
Snowflake What's New Updates on X
I share Snowflake What's New updates on X. Follow for the latest insights:
English Version
Snowflake What's New Bot (English Version)
Japanese Version
Snowflake's What's New Bot (Japanese Version)
Change Log
(20250810) Initial post