What is it?
A data warehouse is a central store used for managing large volumes of historical and current data for an organization. Unlike operational and transactional databases, it is optimized for analysis and business intelligence.
What are its components?
1. Database
This is the core storage component in a data warehouse built upon a data model. Dimensional modelling is the preferred method of coming up with the blueprint/data model for this database because it is both query optimizing and easy to grasp i.e. Fact table for quantitative measurable metrics and dimension tables for descriptive/attribute content adding meaning to fact tables.
Two schema designs are used in this modelling:
Star schema: simple and intuitive. It is denormalized, query optimizing, compatible with reporting and BI tools but storage inefficient.
Snowflake schema: complex and extends star schema by normalizing tables. It is normalized, storage efficient, maintains data integrity by reducing redundancy but makes queries and ETL processes more complex due to multiple joins.
2. Data Sources
Since data warehouses are central, they fetch data from multiple sources:
Transactional databases: handle real time, small scale, frequent, read-write operations (OLTP). Normalized to reduce redundancy. Example: A retail MySQL dB managing daily sales.
Customer Relationship Management (CRM): systems that store company interactions with customers and prospects. Example: A CRM containing customer profiles, sales leads, campaigns and purchase history.
Enterprise Resource Planning (ERP): integrate and manage core business processes; Finance, HR, supply chain and inventory. Highly normalized for operational efficiency. Example: ERP table for inventory containing ItemID, StoreID, Quantity, Cost, Date_of_purchase.
APIs: basically, how systems share data. In this context, a retail company’s website real time/ near real time traffic data (page views, user demographics) can be pulled using Google Analytics API and stored in a warehouse for analysis.
Flat Files: simple, non-relational files stored in formats such as CSV, JSON, XML. Stored in local systems/cloud before ingesting into data warehouse. Example: A CSV file of customer survey responses stored in S3 buckets then loaded into data warehouse to analyze customer satisfaction.
How are these data sources integrated into the warehouse?
3. ETL/ELT Processes
This is where ETL/ELT comes in handy.
Extraction: Pulling data from sources mentioned above to ensure all relevant data is collected.
Transformation: Cleaning and standardizing the data. Removing duplicates, handling missing values, ensuring data integrity i.e. standardize date formats. Different sources have different formats, conventions; An ERM may use Firstname, Lastname order while a CRM uses Lastname, Firstname order.
To optimize performance and reduce query complexity, aggregation of the data is necessary e.g. aggregating daily transactional data to weekly sales.
Loading: Organizing the data into an optimized structure i.e. in a star schema to improve query performance.
Take Note: Stages can be very useful to store Flat files temporarily before loading them into tables. A good example is Snowflakes AWS S3 stage which is managed by snowflake. It may also be external e.g. AWS S3, Azure Blob, Google Cloud Storage but will require configuration. In the case of retail: An AWS S3 stage storing customer survey responses in a CSV file.
4. Query and reporting tools
Allow users to interact with stored data, generate insights and build reports. These includes business intelligence platforms: Power BI, Tableau, Looker as well as SQL tools.
Real World Applications
Amazon employs Amazon Redshift, a cloud-based data warehouse, to handle vast amounts of data from its e-commerce platform, including clicks, impressions, website visitors, and purchase histories. This supports marketing analytics, tracks key performance indicators (KPIs) like conversion rates and churn, and enables reverse ETL to target audiences effectively.
Target Corporation, one of the largest retailers in the United States, uses a sophisticated data warehouse to power its analytics and decision-making. Their system, known as the “Guest Data Platform,” integrates data from various sources to create a unified view of each customer. This has enabled target to:
- Implement highly successful personalized marketing campaigns
- Optimize store layouts based on customer behavior analysis
- Improve inventory management, reducing stockouts and overstocks
- Enhance their online and mobile shopping experiences