Supercharging Databricks Asset Bundles
Serge Artishev

Serge Artishev @sacode

About: Tech and data enthusiast. Sharing insights to empower fellow developers. Lifelong learner & code master.

Joined:
Sep 4, 2023

Supercharging Databricks Asset Bundles

Publish Date: May 16
0 0

Multi-Environment Workflows with CI/CD

In my previous post, I introduced the Databricks Asset Bundle (DAB) template project that helps you get started quickly with a structured approach to Databricks development. Today, I want to dive deeper into how DAB handles variables, parameterization, and CI/CD automation across multiple environments.

The Power of Parameterization

One of the most powerful features of Databricks Asset Bundles is the ability to parameterize nearly everything using variables. This allows us to define workflows once and deploy them to multiple environments with different configurations.

Variable Structure in DAB

The template organizes variables in a clear, hierarchical structure:

variables/
├── common.yml                # Variables shared across all environments
├── {workflowName}.dev.yml    # Worklow specific variables for development
├── {workflowName}.test.yml   # Worklow specific variables for testing
├── {workflowName}.prod.yml   # Worklow specific variables for production
Enter fullscreen mode Exit fullscreen mode

This organization gives us several benefits:

  • Clear separation of concerns (common vs specific)
  • Environment-specific configurations (.dev, .test, .prod)
  • Logical grouping of related variables per workflow

Including Variables in Your Project

The main databricks.yml file includes these variable files based on the target environment:

include:
  - resources/**/*.yml
  - variables/common.yml
  - variables/*.dev.yml  # This changes based on environment
Enter fullscreen mode Exit fullscreen mode

NOTE:
Unfortunately, the Databricks CLI does not support ${bundle.target} placeholder variable files yet. This is a bit of a pain, but it's a known issue and I trust this will be fixed in a future release. So for now, we need to manually update the databricks.yml file to include the correct variables/*.{environment}.yml variable file for each environment. To do that, we can use the yq command to update the databricks.yml file when running the CI/CD pipeline.

When deploying to different environments, we simply swap out which environment-specific variable files to include. For example, in our CI/CD pipeline for test deployment:

# Update include path for test environment
yq -i 'with(.include[] | select(. == "variables/*.dev.yml"); . = "variables/*.test.yml")' databricks.yml
Enter fullscreen mode Exit fullscreen mode

Real-World Example: Parameterizing SharePoint Workflows

Let's look at a real example from the template. The SharePoint Excel refresh workflow connects to SharePoint, processes Excel files, and loads the data to Delta tables. Here's how we parameterize it:

  1. Define environment-specific variables:
# In variables/sharepoint.dev.yml
variables:
  sharepoint:
    type: complex
    default:
      secret_scope: "azure"
      tenant_id_key: "azure-tenant-id"
      client_id_key: "azure-app-client-id"
      client_secret_key: "azure-app-client-secret"
      site_id_key: "sharepoint-site-id"
      drive_id_key: "sharepoint-drive-id"
      modified_in_last_hours: 240
      target_catalog: "bronze"
      target_schema: "sharepoint_dev"
      sync_schedule: "0 0 0 * * ?"
      concurrency: 10
Enter fullscreen mode Exit fullscreen mode

Take note of the target_schema variable. We can use this to deploy the same workflow to different environments with different schema names. Same goes for the target_catalog variable and any other variables that are used in the workflow definition.

  1. Reference variables in workflow definition:
# In resources/sharepoint/sharepoint_excel_refresh.yml
resources:
  jobs:
    sharepoint_excel_refresh:
      name: "${bundle.name} Sharepoint Excel Refresh"
      tasks:
        - task_key: sharepoint_excel_file_list
          notebook_task:
            notebook_path: "${workspace.file_path}/notebooks/sharepoint/excel_list_process"
            base_parameters:
              secret_scope: "${var.sharepoint.secret_scope}"
              tenant_id_key: "${var.sharepoint.tenant_id_key}"
              # More parameters...
              modified_in_last_hours: "${var.sharepoint.modified_in_last_hours}"
      schedule:
        quartz_cron_expression: "${var.sharepoint.sync_schedule}"
        timezone_id: "${var.timezone_id}"
Enter fullscreen mode Exit fullscreen mode

By using this approach, we can deploy the same workflow to different environments with environment-specific configurations. For instance, in production we might have different:

  • Target schema names (sharepoint_prod vs sharepoint_dev)
  • Sync schedules (hourly in production, daily in dev)
  • Lookback periods (24 hours in production, 240 hours in dev)

The complex type allows us to define a complex object with multiple properties, which is perfect for our SharePoint example that allows us to define all the workflow specific variables in one place.

Automated CI/CD Pipeline

The real magic happens when we automate deployment across environments. The template includes GitHub Actions workflows that:

  1. Validate on PRs and feature branches:

    • Run unit tests
    • Validate DAB bundle configuration
    • Check code quality
  2. Auto-deploy to test environment:

    • Triggered on pushes to the develop branch
    • Updates variable includes for test environment
    • Authenticates with service principal
    • Deploys the DAB bundle
  3. Deploy to production:

    • Triggered on pushes to the main branch
    • Updates variable includes for production environment
    • Adds approval steps for production deployment
    • Deploys with production-specific settings

Authentication with Service Principals

A key part of the CI/CD automation is using service principals for authentication. In the GitHub workflows, we:

  1. Obtain an OAuth token using the service principal credentials
  2. Use that token for Databricks CLI authentication
  3. Pass the service principal ID as a variable during deployment
# Get OAuth token for service principal
response=$(curl -s -X POST \
  -u "${{ secrets.SERVICE_PRINCIPAL_APP_ID }}:${{ secrets.SERVICE_PRINCIPAL_SECRET }}" \
  "$DATABRICKS_HOST/oidc/v1/token" \
  -d "grant_type=client_credentials&scope=all-apis")

# Extract token and set environment variables
token=$(echo $response | jq -r '.access_token')
export DATABRICKS_TOKEN="$token"

# Deploy with service principal ID variable
databricks bundle deploy --target test --var service_principal_id=${{ secrets.SERVICE_PRINCIPAL_APP_ID }}
Enter fullscreen mode Exit fullscreen mode

Advanced Techniques

Here are some advanced techniques you can use with this setup:

1. Dynamic Configuration Based on Branch

You can make your CI/CD pipeline smarter by adjusting configuration based on the Git branch:

- name: Set environment variables based on branch
  run: |
    if [[ "${{ github.ref }}" == "refs/heads/main" ]]; then
      echo "TARGET_ENV=prod" >> $GITHUB_ENV
    elif [[ "${{ github.ref }}" == "refs/heads/develop" ]]; then
      echo "TARGET_ENV=test" >> $GITHUB_ENV
    else
      echo "TARGET_ENV=dev" >> $GITHUB_ENV
    fi

- name: Update variable includes
  run: |
    yq -i 'with(.include[] | select(. == "variables/*.dev.yml"); . = "variables/*.${{ env.TARGET_ENV }}.yml")' databricks.yml
Enter fullscreen mode Exit fullscreen mode

2. Feature Flags via Variables

You can implement simple feature flags using variables:

variables:
  features:
    enable_advanced_analytics: true
    enable_real_time_processing: false
Enter fullscreen mode Exit fullscreen mode

Then in your workflows:

tasks:
  - task_key: optional_analytics_step
    notebook_task:
      notebook_path: "/path/to/analytics"
    if: ${var.features.enable_advanced_analytics}
Enter fullscreen mode Exit fullscreen mode

3. Template Workflows with Parameters

You can create reusable workflow templates by parameterizing common patterns:

# resources/templates/ingest_template.yml
resources:
  jobs:
    ${var.job_name}:  # Dynamic job name
      name: "Ingest ${var.source_name} Data"
      tasks:
        - task_key: ingest_data
          notebook_task:
            notebook_path: "/Shared/ingest/${var.source_type}"
            base_parameters:
              source_config: ${var.source_config}
              target_table: ${var.target_table}
Enter fullscreen mode Exit fullscreen mode

Conclusion

By combining DAB's parameterization capabilities with automated CI/CD pipelines, you can create a robust, maintainable system for deploying Databricks resources across environments. This approach gives you:

  • Clear separation of configuration from implementation
  • Environment-specific settings without code duplication
  • Automated testing and deployment
  • Consistent deployment process across environments
  • Version-controlled infrastructure and configuration

What's your experience with Databricks Asset Bundles? Have you found other useful patterns for managing multi-environment deployments? Let me know in the comments!

Comments 0 total

    Add comment