Kimball Star Schema vs Palantir’s Ontology
TC Ricks

TC Ricks @tc87

Joined:
Jul 16, 2020

Kimball Star Schema vs Palantir’s Ontology

Publish Date: Aug 6
16 0

Here’s a joke I thought of:

Q: What does the father of data warehousing have in common with a $300B analytics powerhouse named after a Lord of the Rings spying device?

A: Both of them defy the star schema paradigm and encourage you to model your data top-down, in a highly normalized format.

…maybe you didn’t think that was funny, but I did. Not funny in a “ha-ha” way, but in a glitch-in-the-matrix kind of way. Like “huh, that’s funny”.

Pretty much everyone in data today uses some variation of the star schema paradigm popularized by Ralph Kimball. If you’re familiar with data warehousing or business intelligence, you’ve probably seen “facts” and “dimensions” tables floating around in your tools. Those are Kimball’s work and they really are the industry standard for modeling data for BI purposes. They’re everywhere.

Palantir, a $300B data analytics company, makes a ton of money selling services that model enterprise data in a way that runs directly counter to the Kimball industry standard. They center their offering around the ontology, a highly normalized data model that represents the “objects” a business user would think about. These objects have links between them characterized as 1-to-1, 1-to-many, or many-to-many. It’s a highly intuitive structure.

The weird thing is that most of the industry doesn’t do that. Palantir’s roaring success in deploying teams of forward deployed engineers to build out ontologies in their platform is a testament to the power of this modeling technique. But usually there’s nothing new under the sun. So why isn’t anyone else doing this?

To me, as a former Palantir forward deployed engineer, it felt like a glitch in the matrix. It didn’t quite make sense.

I would come across customer data warehouses which were highly denormalized and structured in “facts” and “dimensions”. I asked myself why on earth they adopted this strategy, instead of the more intuitive, straightforward object-property approach. Following this train of thought, I uncovered a fascinating story of a passionate but respectful technological debate between titans of the data industry, Bill Inmon and Ralph Kimball.

Bill Inmon, from the beginning, advocated for a top-down approach to data modeling. He felt enterprises should establish what they need from their data warehouse in a single source of truth, and then do the integration work upfront to map raw data from source systems into the source of truth in a highly consistent way. He advocated for a normalized layer (many tables, not pre-joined), pushing joins until the very end of the analytics process. This maximizes data availability, ensuring information is not aggregated away or dropped in “join” steps before it reaches the end user.

Ralph Kimball, on the other hand, is largely responsible for the “dimensional modeling” approach that most people use today. You may know this as the “star schema”. The approach has a lot of great features that I love: it’s incremental, decentralized, simpler up-front, and returns user-facing results quickly. It’s a “bottoms up” methodology; you can incrementally deliver analytics from each source system in your organization before you create a centralized, consistent source of truth. As a result of these really great properties, Kimball’s approach has achieved full ideological capture of the data warehousing community.

So what’s going on with Palantir? Why did they go so off-track of the consensus?

I don’t think they’re contrarian for the sake of being contrarian. Palantir sells contracts on business results and the price tag is high. High enough to be able to deploy specialized, highly skilled resources with the sole goal of developing a data model that achieves 7 or 8 figure business outcomes. In this environment, Palantir is able to pay the heavy up-front human cost of the Inmon, top-down approach: integrate all the data you can, pay down the effort to achieve strong consistency across disparate data sources, and deliver a high-level model that fully delivers on data availability in a human-understandable way (the ontology).

Most data teams don’t have the luxury of the budget, focus, and firepower of Palantir to build and maintain an ontology. Kimball’s approach is far easier, but comes with the downside of a lack of data visibility and coverage for the business user. Usually people close the gap with fancy dashboards and on-call schedules for analysts who respond to ad-hoc requests from business users.

AI is rapidly changing what people can do with a limited amount of budget and human firepower. In my opinion, now is the time to invest in technology that cuts the expense of ontology curation by 90%. That’s what we’re working on at AstroBee. We’re using LLM-based agents to do the upfront integration and data consistency work required to bring a full-coverage ontology to your team at an affordable price point. You could say that AstroBee brings the firepower of Palantir to the little guy. We’re gearing up to launch soon, but if you’d like to take part in early pre-launch user testing, please reach out to support@astrobee.ai!

Comments 0 total

    Add comment