Building a Modern Data Warehouse in SQL Server with Medallion Architecture

A data warehouse is essential for consolidating and transforming data from multiple sources into a single, reliable source for business intelligence, reporting, and analytical insights. It is a subject-oriented (focused on specific business areas like sales, customers, or products), integrated (combines data from multiple sources into a unified view), time-variant (stores historical data), non-volatile (data is usually not deleted from a warehouse) collection of data to support decision-making.

Why a Data Warehouse?

Implementing a data warehouse offers many benefits:

Replaces manual data gathering, significantly reducing human error, and increasing speed through ETL (Extract, Transform, Load) processes.
Becomes the single source of truth for all analysis and reporting, ensuring consistency across the organization.
Enables combining data from various systems.
Historical Data Preservation
Ensures all reports reflect the same, consistent data.
Capable of handling large volumes of data, including "Big Data."

For this project, the objective was to develop a modern data warehouse using SQL Server to consolidate sales data, enabling analytical reporting and informed decision-making. This project aimed to cleanse and resolve data quality issues, combine ERP and CRM sources into a single data model requiring historization of all data.

Medallion Architecture

Given the need for both structured data processing and a desire for a modern, scalable approach, this project opted for the Medallion Architecture. This multi-layered architecture provides a clear and robust framework for data processing.

The Medallion Architecture consists of three distinct layers:

Bronze Layer (Raw): Ingests raw, unprocessed data directly from source systems. Data is stored "as-is" to ensure traceability.
Silver Layer (Cleaned & Transformed): This layer holds cleaned, transformed, and integrated data. This is where data quality issues are resolved, duplicates are removed, formats are standardized, and data from different sources is joined.
Gold Layer (Aggregated & Business-Ready): Contains aggregated, business-level data optimized for business intelligence, analytics and reporting.

This layered approach ensures data quality and provides a structured path from raw input to analytical insights.

ETL:

The ETL process is the backbone of every data warehouse project. It is where the data is moved from its raw state to usable format.

Extraction: The first step involves identifying and extracting the necessary subset of data from the source systems (in our case, CSV files from ERP and CRM systems). At this stage, the data remains unchanged. Extraction happens in the Bronze layer of the Medallion architecture.
Transformation: This is the most critical phase, where extracted data undergoes various manipulations to meet business requirements. This includes data cleansing (handling missing values, correcting errors), data integration (combining data from multiple sources), formatting, and normalization. Transformation takes place in Silver layer.
Loading: Finally, the transformed data is loaded into the target database tables within the data warehouse. It happens in the Gold layer.

Implementation Details:

All the source files for the project are in csv format from two source systems: CRM and ERP. CRM is taken as the primary data source, in case of conflicts encounter between the data. An example of the csv file is provided in the image below. This is for the source file crm_cust_info.csv.

Following six files are handled:

CRM
- crm_cust_info
- prd_info
- sales_details
ERM
- CUST_AZ12
- LOC_A101
- PX_CAT_G1V2

All the examples will primarily focus on the table crm_cust_info from CRM. If there are important considerations and drastic changes in procedure of handling the problem, then those implementations are also provided.

Bronze Layer: Raw Data Ingestion

This is the landing zone for the raw data. Here the data from the source is stored in the database – without changing it. Raw data Ingestion is handled in two steps: creating a table in SQL server exactly similar to the source csv file inside the bronze schema and loading all the data to these tables. An example of an ingestion of crm_cust_info.csv into our database:

IF OBJECT_ID('bronze.crm_cust_info', 'U') IS NOT NULL
    DROP TABLE bronze.crm_cust_info;

CREATE TABLE bronze.crm_cust_info (
    cst_id INT,
    cst_key NVARCHAR(50),
    cst_firstname NVARCHAR(50),
    cst_lastname NVARCHAR(50),
    cst_marital_status NVARCHAR(50),
    cst_gndr NVARCHAR(50),
    cst_create_date DATE
);

Then the data is bulk loaded into this table.

TRUNCATE TABLE bronze.crm_cust_info;

BULK INSERT bronze.crm_cust_info
FROM '<the_file_location>'
WITH (
    FIRSTROW = 2, -- Skips the header row in the CSV file
    FIELDTERMINATOR = ',' -- Specifies comma as the field delimiter
);

Same process is repeated for all other files. These two processes are then consolidated into a stored procedure with some logging, error handling and some performance monitoring. Following is a snippet of stored procedure to load data into gold layer.

CREATE OR ALTER PROCEDURE bronze.load_bronze AS
BEGIN
    DECLARE @start_time DATETIME, @end_time DATETIME, @start_bronze_layer DATETIME, @end_bronze_layer DATETIME;

    -- These variables can be used for checking how much time each table took to truncate and load,
    -- and the total time it took to load the batch.

    BEGIN TRY
            PRINT '---------------------';
        PRINT 'Loading Bronze Layer ';
        PRINT '---------------------';

        PRINT 'Loading CRM Tables...';
        SET @start_bronze_layer = GETDATE();
        SET @start_time = GETDATE();
        PRINT '>> Truncating Table: bronze.crm_cust_info';

         -- PRINT statements log into console for debugging
         -- ... (The Bulk Load Queries for each of the 6 tables)
        END TRY
    BEGIN CATCH
        PRINT '--------------------------------------------';
        PRINT 'ERROR OCCURRED DURING LOADING BRONZE LAYER...';
        PRINT '--------------------------------------------';
        PRINT 'Error: ' + ERROR_MESSAGE();
        PRINT 'Error No.'
        + CAST(ERROR_NUMBER() AS NVARCHAR);
        PRINT 'Error State.' + CAST(ERROR_STATE() AS NVARCHAR);
    END CATCH
END;

Silver Layer: Data Cleaning and Transformation:

The Silver layer is where the core data transformation and integration take place. This is crucial for preparing data for analytical consumption. A data integration model was created before proceeding through our transformation mainly with the aim of creating appropriate data for the joins later on.

Again, transformation took place in two steps, which was handle by single query. Each tables presented unique problems requiring respective transformation techniques to handle it. Similar to bronze layer, the initial step was to create corresponding tables. The table structures were kept similar to the bronze layer when starting which was subjected to change as we solidify the transformation approach by working on the bronze layer tables.

Some Transformation Problems Handled

Let's look at the cleaning and transformation steps for key tables:

crm_cust_info Cleaning

Handling Duplicate and NULL Primary Keys: We identified duplicate cst_id values in the bronze layer. To resolve this, we used ROW_NUMBER() with PARTITION BY cst_id ORDER BY cst_create_date DESC to select the latest record for each customer.
```
-- Query to identify duplicates and NULLs
SELECT
    cst_id,
    COUNT(*) AS Occurance
FROM bronze.crm_cust_info
GROUP BY cst_id
HAVING COUNT(*) > 1 OR cst_id IS NULL
ORDER BY Occurance DESC;
```

The `flag_last` column helps us pick the most recent record:

```sql
SELECT
    *,
    ROW_NUMBER() OVER (PARTITION BY cst_id ORDER BY cst_create_date DESC) AS flag_last
FROM bronze.crm_cust_info;
```

Removing Unwanted Spaces: For text columns like cst_firstname, we used TRIM() to remove leading/trailing spaces.
Data Normalization (Low Cardinality Columns): For columns like cst_marital_status and cst_gndr, we standardized values using CASE statements to ensure consistency (e.g., 'M' to 'Male', 'F' to 'Female', 'n/a' for NULLs).
```
SELECT DISTINCT cst_gndr FROM bronze.crm_cust_info;
```

```sql
-- Example of gender normalization
…CASE WHEN UPPER(TRIM(cst_gndr)) = 'M' THEN 'Male'
     WHEN UPPER(TRIM(cst_gndr)) = 'F' THEN 'Female'
     ELSE 'n/a'
     END AS cst_gndr
…
```

Final Transformed Query for silver.crm_cust_info:

INSERT INTO silver.crm_cust_info (
    cst_id,
    cst_key,
    cst_firstname,
    cst_lastname,
    cst_marital_status,
    cst_gndr,
    cst_create_date)
SELECT
    cst_id,
    cst_key,
    TRIM(cst_firstname) AS cst_firstname,
    TRIM(cst_lastname) AS cst_lastname,
    CASE WHEN UPPER(TRIM(cst_marital_status)) = 'S' THEN 'Single'
         WHEN UPPER(TRIM(cst_marital_status)) = 'M' THEN 'Married'
         ELSE 'n/a'
     END AS cst_marital_status,
    CASE WHEN UPPER(TRIM(cst_gndr))
= 'M' THEN 'Male'
         WHEN UPPER(TRIM(cst_gndr)) = 'F' THEN 'Female'
         ELSE 'n/a'
     END AS cst_gndr,
    cst_create_date
FROM (
    SELECT
        *,
        ROW_NUMBER() OVER (PARTITION BY cst_id ORDER BY cst_create_date DESC) AS flag_last
    FROM bronze.crm_cust_info) t
WHERE flag_last = 1 AND cst_id IS NOT NULL;

crm_prd_info Cleaning

Similar checks for PKs and spaces were performed. Key transformations for crm_prd_info included:

Derived Columns: Extracting cat_id and prd_key from the prd_key column in the source to facilitate joins with other tables. We used SUBSTRING and REPLACE to format these keys consistently.
```
-- Example of derived columns
REPLACE(SUBSTRING([prd_key], 1, 5), '-', '_') AS [cat_id],
REPLACE(SUBSTRING([prd_key], 7, LEN([prd_key])), '-', '_') AS [prd_key]
```
Handling Integers: Using COALESCE(prd_cost, 0) to replace NULL product costs with 0.
Start and End Date Logic: Business logic dictated that a NULL prd_end_dt signifies the current price. We derived prd_end_dt for historical records using LEAD() to set the end date as one day before the next product's start date, ensuring non-overlapping periods.
```
-- Example for prd_end_dt derivation
DATEADD(DAY, -1, LEAD(prd_start_dt, 1) OVER (PARTITION BY prd_key ORDER BY prd_start_dt)) AS prd_end_dt
```

Final Transformed Query for silver.crm_prd_info:

INSERT INTO [silver].[crm_prd_info](
    [prd_id],
    [cat_id],
    [prd_key],
    [prd_nm],
    [prd_cost],
    [prd_line],
    [prd_start_dt],
    [prd_end_dt])
SELECT
      [prd_id],
      REPLACE(SUBSTRING([prd_key], 1, 5), '-', '_') AS [cat_id],
      REPLACE(SUBSTRING([prd_key], 7, LEN([prd_key])), '-', '_') AS [prd_key],
      [prd_nm],
      COALESCE(prd_cost, 0) AS prd_cost,
      CASE WHEN UPPER(TRIM(prd_line)) = 'M' THEN 'Mounting'
      WHEN UPPER(TRIM(prd_line)) = 'R' THEN 'Road'
            WHEN UPPER(TRIM(prd_line)) = 'S' THEN 'Other Sales'
            WHEN UPPER(TRIM(prd_line)) = 'T' THEN 'Touring'
            ELSE 'n/a'
        END AS prd_line,
      CAST([prd_start_dt] AS DATE) AS [prd_start_date],
      DATEADD(DAY, -1,LEAD(prd_start_dt, 1) OVER (PARTITION BY prd_key ORDER BY prd_start_dt)) AS prd_end_dt
  FROM [DataWarehouse].[bronze].[crm_prd_info];

crm_sales_details Cleaning

The initial state of crm_sales_details table:

Following transformations were applied to this table:

Date Column Conversion and Validation: Sales dates were sometimes 0 or in an incorrect format. We converted these to NULL and then cast valid entries to DATE format. We also implicitly validated that Order Date <= Shipping Date <= Due Date by ensuring proper casting.
```
-- Example for date conversion
CASE WHEN sls_order_dt = 0 OR LEN(sls_order_dt) != 8 THEN NULL
     ELSE CAST(CAST(sls_order_dt AS VARCHAR) AS DATE)
     END AS [sls_order_dt]
```
Checking Derived Values: For sls_sales, sls_quantity, and sls_price, we ensured logical consistency (e.g., sls_sales = sls_quantity * sls_price) and handled NULL or negative values using COALESCE and CASE statements.

Final Transformed Query for silver.crm_sales_details:

INSERT INTO [silver].[crm_sales_details] (
            [sls_ord_num]
           ,[sls_prd_key]
           ,[sls_cust_id]
           ,[sls_order_dt]
           ,[sls_ship_dt]
           ,[sls_due_dt]
           ,[sls_sales]
           ,[sls_quantity]
           ,[sls_price])
SELECT
    [sls_ord_num],
    [sls_prd_key],
    [sls_cust_id],
    CASE WHEN sls_order_dt = 0 OR LEN(sls_order_dt) != 8 THEN NULL
         ELSE CAST(CAST(sls_order_dt AS VARCHAR) AS DATE)
         END AS [sls_order_dt],
    CASE WHEN sls_ship_dt = 0 OR LEN(sls_ship_dt) != 8 THEN NULL
         ELSE CAST(CAST(sls_ship_dt AS VARCHAR) AS DATE)
         END AS [sls_ship_dt],
    CASE WHEN sls_due_dt
= 0 OR LEN(sls_due_dt) != 8 THEN NULL
     ELSE CAST(CAST(sls_due_dt AS VARCHAR) AS DATE)
     END AS [sls_due_dt],
    COALESCE([sls_sales], [sls_price]/[sls_quantity]) AS [sls_sales],
    COALESCE([sls_quantity],[sls_sales]/[sls_price]) AS [sls_quantity],
    CASE WHEN [sls_price] < 0 THEN [sls_sales]*[sls_quantity]
        ELSE COALESCE([sls_price], [sls_sales]/[sls_quantity])
        END AS [sls_price]
FROM bronze.crm_sales_details;

The remaining ERP tables (erp_cust_az12, erp_loc_a101, erp_px_cat_g1v2) underwent similar cleaning processes, including handling NULLs, duplicates, date formats, and data normalization. All these silver layer insertions were then consolidated into another stored procedure, silver.load_silver, for automated execution.

Gold Layer: Analytical Data Model for Business Ready Data

The Gold layer is designed for optimal analytical performance and ease of understanding for business users. We adopted a Star Schema model, which is highly effective for BI and reporting.

In a Star Schema:

A central fact table contains quantitative information (measures) about business events or transactions (e.g., sales amount, order quantity). It answers, "How Much?" or "How Many?".
It is surrounded by dimension tables, which provide descriptive context to the data (e.g., customer names, product categories, dates, locations). They answer "Who?" "What?" "Where?".

We created views in the gold schema to represent our dimension and fact tables. Views are excellent for the Gold layer as they provide a logical abstraction without duplicating data, always reflecting the latest transformations from the Silver layer.

Customer Dimension (gold.dim_customers)

This dimension integrates customer information from CRM and ERP sources. A key challenge was reconciling two gender columns from different sources. We prioritized the CRM source and defaulted to 'n/a' if both were missing. We also added a surrogate key (customer_key) using ROW_NUMBER() for efficient joining with fact tables.

CREATE VIEW gold.dim_customers AS
SELECT
    ROW_NUMBER() OVER (ORDER BY ci.cst_id) AS customer_key, -- Add surrogate key
    ci.cst_id AS customer_id,
    ci.cst_key AS customer_number,
    ci.cst_firstname AS first_name,
    ci.cst_lastname AS last_name,
    cl.cntry AS country,
    ci.cst_marital_status AS marital_status,
    -- Data integration for gender: CRM is primary, then ERP, then 'n/a'
    CASE WHEN ci.cst_gndr != 'n/a' THEN ci.cst_gndr
         ELSE COALESCE(ca.gen, 'n/a')
         END
AS gender,
    ca.bdate AS birth_date,
    ci.cst_create_date AS created_date
FROM silver.crm_cust_info ci
LEFT JOIN silver.erp_cust_az12 ca
    ON ci.cst_key = ca.cid
LEFT JOIN silver.erp_loc_a101 cl
    ON ci.cst_key = cl.cid;

Product Dimension (gold.dim_products)

This dimension provides descriptive information about products. Based on the project scope, we focused on the latest product information, filtering out historical records using WHERE pi.prd_end_dt IS NULL. A product_key surrogate key was also added.

CREATE VIEW gold.dim_products AS
SELECT
    ROW_NUMBER() OVER (ORDER BY pi.prd_start_dt, pi.prd_key) AS product_key, -- Surrogate key
    pi.prd_id AS product_id,
    pi.prd_key AS product_number,
    pi.prd_nm AS product_name,
    pi.cat_id AS category_id,
    ep.cat AS category,
    ep.subcat AS subcategory,
    ep.maintenance,
    pi.prd_cost AS product_cost,
    pi.prd_line AS product_line,
    pi.prd_start_dt AS start_date
FROM silver.crm_prd_info pi
LEFT JOIN silver.erp_px_cat_g1v2 ep
    ON pi.cat_id = ep.id
WHERE pi.prd_end_dt IS NULL;

-- Filters for current product information

Sales Fact (gold.fact_sales)

The fact table contains the core sales transaction data. It links to our dimension tables using their respective surrogate keys (product_key and customer_key), allowing for flexible and performant analytical queries.

CREATE VIEW gold.fact_sales AS
SELECT
    sd.sls_ord_num AS order_number,
    dp.product_key, -- Surrogate key from dim_products
    dc.customer_key, -- Surrogate key from dim_customers
    sd.sls_order_dt AS order_date,
    sd.sls_ship_dt AS shipping_date,
    sd.sls_due_dt AS due_date,
    sd.sls_sales AS sales,
    sd.sls_quantity AS quantity,
    sd.sls_price AS price
FROM silver.crm_sales_details sd
-- JOIN with dimension tables using surrogate keys
LEFT JOIN gold.dim_products dp
    ON sd.sls_prd_key = dp.product_number -- Join on business key, but select surrogate key
LEFT JOIN gold.dim_customers dc
    ON sd.sls_cust_id
= dc.customer_id; -- Join on business key, but select surrogate key

Final Data Model and Flow

The resulting Star Schema provides a clear and intuitive structure for reporting:

And the complete data flow from source to consumption:

Conclusion:

This project successfully built a modern data warehouse in SQL Server using the Medallion Architecture, consolidating sales data from ERP and CRM systems into a unified, clean, and structured resource for analytics. I gained hands-on experience in raw data ingestion (Bronze layer) , resolving data quality issues and transforming data (Silver layer) , and designing an optimized analytical data model using a Star Schema in the Gold layer for business intelligence and reporting. This experience significantly enhanced my skills in SQL Server, ETL processes, and data warehousing best practices, demonstrating the Medallion Architecture's effectiveness in creating reliable data pipelines and a "single source of truth".

Feel free to connect with me on LinkedIn for any queries! You can find the full project code and resources on my GitHub repository.

Samit Paudel @samit_paudel_5d330ffcfb22