Jerry Shao, CTO & Co-founder of Datastrato, explained the vision and capabilities of unified metadata management at Scaling Iceberg Adoption at Pinterest with Gravitino.
Session 1: The Modern Data Challenge - Why "Catalog of Catalogs" Matters
The Enterprise Data Silos Problem
Modern data-driven companies face an inevitable challenge: data silos everywhere. Take a typical enterprise—they'll have a Hadoop-based data lake for ETL and batch processing, a data warehouse for ad hoc analytics, streaming processing stacks for real-time requirements, and machine learning platforms for AI workloads.
The result? Each system serves its purpose well, but data becomes fragmented across isolated islands.
The Traditional "Unified Data" Approach Falls Short
The conventional wisdom says: "Unify all your data in one storage layer." Lakehouse technologies attempt this, trying to force everything into a single system.
But here's the problem: Current lakehouse technologies cannot adequately support streaming analytics AND machine learning AND traditional analytics. Each workload has unique requirements that no single system can perfectly address.
Apache Gravitino's Revolutionary Insight
Instead of asking "How do we unify data together?"
We asked: "Can we unify the metadata together?"
This shift in thinking is profound. Every data system needs a catalog to manage its metadata. So rather than moving massive amounts of data, why not create a unified layer that manages the catalogs themselves?
Apache Gravitino is a "catalog of catalogs" - a metadata management platform that provides unified governance and access across diverse data systems without forcing you to abandon existing investments.
Session 2: Gravitino's Architecture - Unification Without Migration
The Generic Metadata Object Model
At Gravitino's core is a universal metadata framework that represents different types of data through consistent interfaces:
Unified Table Management Tables from Hive, Iceberg, PostgreSQL, and other systems are represented through the same metadata model, enabling consistent operations across platforms.
Comprehensive Data Type Support
- Tables: Traditional structured data across any system
- Filesets: Direct file and directory management on HDFS, S3, etc.
- Models: Machine learning model metadata and versioning
- Topics: Streaming data topics and configurations
The Connection Layer: Universal Data System Integration
Gravitino connects to diverse data systems through specialized connectors:
- Hive Connector: Integrates with Hive Metastore ecosystems
- JDBC Connector: Connects to relational databases
- Iceberg Connector: Native Apache Iceberg support
- Custom Connectors: Extensible framework for new systems
Dual API Strategy: Standards Compliance + Innovation
Gravitino Unified REST APIs Generic operations across all data types and systems through consistent interfaces.
Native Iceberg REST APIs Full compliance with Iceberg REST specification at /v1/namespaces
, /v1/tables
, supporting standard Iceberg clients while adding enterprise capabilities.
The Power of Interoperability Both API sets operate on the same underlying data, so operations through one interface are immediately reflected in the other.
Session 3: Gravitino IRC - Production-Ready Iceberg REST Catalog
Beyond Basic REST Catalog Implementation
While Apache Iceberg provides a reference REST catalog implementation, Gravitino IRC is built for enterprise production requirements with enhanced capabilities that standard implementations lack.
Federated Catalog Architecture
The Challenge: Organizations use different catalog backends (Hive Metastore, JDBC databases, cloud-native solutions) and don't want to abandon existing investments.
Gravitino's Solution: Build REST endpoints on top of existing catalogs rather than replacing them.
User Applications → Iceberg REST Interface → Gravitino IRC → Existing Catalogs
├── Hive Metastore
├── JDBC Catalog
└── Future: S3, Polaris
Dual API Interoperability
Both Gravitino unified APIs and Iceberg REST APIs operate on the same underlying metadata, enabling seamless interoperability:
Production Enhancements Over Standard Implementation
Enterprise Serviceability
- Integrated metrics systems: Native Prometheus and Grafana support for comprehensive monitoring
- Audit logging: Complete operation tracking for governance and compliance
- Event framework: Pre/post-event hooks for custom business logic
Flexible Deployment Options Organizations can deploy Gravitino IRC as a unified service alongside other Gravitino APIs, or as a standalone Iceberg-focused service, depending on their architectural preferences.
Enhanced Performance and Reliability Unlike basic implementations, Gravitino IRC includes intelligent caching, connection pooling, and failover mechanisms designed for enterprise-scale workloads.
Session 4: Enterprise Security and Governance at Scale
End-to-End Authentication Architecture
Client Authentication
- OAuth2: Modern web-based authentication for applications and users
- Kerberos: Enterprise directory integration for secure environments
- Pluggable Framework: Custom authentication methods for unique organizational requirements
Backend Catalog Authentication Different catalog systems require different authentication approaches:
- Hive Catalogs: Kerberos and delegation token support with impersonation
- JDBC Catalogs: Secure username/password management
- Cloud Catalogs: Native cloud identity integration
Data Layer Security
- HDFS Integration: Kerberos and delegation token support
- Cloud Storage: Secure credential vending that provides temporary, scoped access tokens
- Cross-System: Consistent security model regardless of backend storage
Role-Based Access Control (RBAC)
Comprehensive Identity Management Through Gravitino's unified REST APIs, administrators can:
- Add users and groups to the system
- Create roles with specific privileges on different entities
- Bind roles to users with fine-grained control
- Enforce policies consistently across all connected systems
Unified Policy Enforcement When users query tables through any connected engine, Gravitino enforces access policies in real-time, checking permissions before allowing read or write operations.
Advanced Data Governance Features
Intelligent Data Discovery
- Tagging System: Classify and organize data assets across systems
- Search Integration: OpenSearch integration for keyword-based data discovery
- Metadata Enrichment: Automatic data profiling and documentation
Data Lineage Tracking Gravitino captures and exposes lineage information, showing how data flows between systems and transforms through different processing stages.
Session 5: Expanding Ecosystem Support
Growing Catalog Backend Support
Current Production Support
- Hive Metastore: Full integration with existing Hadoop ecosystems
- JDBC Catalogs: PostgreSQL, MySQL, and other relational database catalogs
Planned Integrations We will extend to support more catalog backends like S3 catalogs and Polaris and others, giving organizations even more flexibility in their catalog choices.
Conclusion: The Unified Metadata Future
Apache Gravitino represents a fundamental shift in how enterprises approach data architecture. Rather than forcing organizations to migrate data or abandon existing investments, Gravitino enables transformation through unification.
The three core principles driving this evolution:
- Metadata-First Architecture: Unifying metadata management enables data interoperability without data movement
- Federation Over Migration: Preserve existing investments while gaining modern capabilities
- Standards-Based Innovation: Extensible platforms that maintain ecosystem compatibility
Organizations choosing Gravitino gain:
- Immediate integration benefits without migration risk
- Enterprise-grade security and governance capabilities
- Future-proof architecture that evolves with industry standards
- Active community support and continuous innovation
The future of enterprise data architecture isn't about choosing the right system—it's about choosing the right approach to unified metadata management.