Multi-Environment Workflows with CI/CD
In my previous post, I introduced the Databricks Asset Bundle (DAB) template project that helps you get started quickly with a structured approach to Databricks development. Today, I want to dive deeper into how DAB handles variables, parameterization, and CI/CD automation across multiple environments.
The Power of Parameterization
One of the most powerful features of Databricks Asset Bundles is the ability to parameterize nearly everything using variables. This allows us to define workflows once and deploy them to multiple environments with different configurations.
Variable Structure in DAB
The template organizes variables in a clear, hierarchical structure:
variables/
├── common.yml # Variables shared across all environments
├── {workflowName}.dev.yml # Worklow specific variables for development
├── {workflowName}.test.yml # Worklow specific variables for testing
├── {workflowName}.prod.yml # Worklow specific variables for production
This organization gives us several benefits:
- Clear separation of concerns (common vs specific)
- Environment-specific configurations (.dev, .test, .prod)
- Logical grouping of related variables per workflow
Including Variables in Your Project
The main databricks.yml
file includes these variable files based on the target environment:
include:
- resources/**/*.yml
- variables/common.yml
- variables/*.dev.yml # This changes based on environment
NOTE:
Unfortunately, the Databricks CLI does not support${bundle.target}
placeholder variable files yet. This is a bit of a pain, but it's a known issue and I trust this will be fixed in a future release. So for now, we need to manually update thedatabricks.yml
file to include the correctvariables/*.{environment}.yml
variable file for each environment. To do that, we can use theyq
command to update thedatabricks.yml
file when running the CI/CD pipeline.
When deploying to different environments, we simply swap out which environment-specific variable files to include. For example, in our CI/CD pipeline for test deployment:
# Update include path for test environment
yq -i 'with(.include[] | select(. == "variables/*.dev.yml"); . = "variables/*.test.yml")' databricks.yml
Real-World Example: Parameterizing SharePoint Workflows
Let's look at a real example from the template. The SharePoint Excel refresh workflow connects to SharePoint, processes Excel files, and loads the data to Delta tables. Here's how we parameterize it:
- Define environment-specific variables:
# In variables/sharepoint.dev.yml
variables:
sharepoint:
type: complex
default:
secret_scope: "azure"
tenant_id_key: "azure-tenant-id"
client_id_key: "azure-app-client-id"
client_secret_key: "azure-app-client-secret"
site_id_key: "sharepoint-site-id"
drive_id_key: "sharepoint-drive-id"
modified_in_last_hours: 240
target_catalog: "bronze"
target_schema: "sharepoint_dev"
sync_schedule: "0 0 0 * * ?"
concurrency: 10
Take note of the target_schema
variable. We can use this to deploy the same workflow to different environments with different schema names. Same goes for the target_catalog
variable and any other variables that are used in the workflow definition.
- Reference variables in workflow definition:
# In resources/sharepoint/sharepoint_excel_refresh.yml
resources:
jobs:
sharepoint_excel_refresh:
name: "${bundle.name} Sharepoint Excel Refresh"
tasks:
- task_key: sharepoint_excel_file_list
notebook_task:
notebook_path: "${workspace.file_path}/notebooks/sharepoint/excel_list_process"
base_parameters:
secret_scope: "${var.sharepoint.secret_scope}"
tenant_id_key: "${var.sharepoint.tenant_id_key}"
# More parameters...
modified_in_last_hours: "${var.sharepoint.modified_in_last_hours}"
schedule:
quartz_cron_expression: "${var.sharepoint.sync_schedule}"
timezone_id: "${var.timezone_id}"
By using this approach, we can deploy the same workflow to different environments with environment-specific configurations. For instance, in production we might have different:
- Target schema names (sharepoint_prod vs sharepoint_dev)
- Sync schedules (hourly in production, daily in dev)
- Lookback periods (24 hours in production, 240 hours in dev)
The complex
type allows us to define a complex object with multiple properties, which is perfect for our SharePoint example that allows us to define all the workflow specific variables in one place.
Automated CI/CD Pipeline
The real magic happens when we automate deployment across environments. The template includes GitHub Actions workflows that:
-
Validate on PRs and feature branches:
- Run unit tests
- Validate DAB bundle configuration
- Check code quality
-
Auto-deploy to test environment:
- Triggered on pushes to the develop branch
- Updates variable includes for test environment
- Authenticates with service principal
- Deploys the DAB bundle
-
Deploy to production:
- Triggered on pushes to the main branch
- Updates variable includes for production environment
- Adds approval steps for production deployment
- Deploys with production-specific settings
Authentication with Service Principals
A key part of the CI/CD automation is using service principals for authentication. In the GitHub workflows, we:
- Obtain an OAuth token using the service principal credentials
- Use that token for Databricks CLI authentication
- Pass the service principal ID as a variable during deployment
# Get OAuth token for service principal
response=$(curl -s -X POST \
-u "${{ secrets.SERVICE_PRINCIPAL_APP_ID }}:${{ secrets.SERVICE_PRINCIPAL_SECRET }}" \
"$DATABRICKS_HOST/oidc/v1/token" \
-d "grant_type=client_credentials&scope=all-apis")
# Extract token and set environment variables
token=$(echo $response | jq -r '.access_token')
export DATABRICKS_TOKEN="$token"
# Deploy with service principal ID variable
databricks bundle deploy --target test --var service_principal_id=${{ secrets.SERVICE_PRINCIPAL_APP_ID }}
Advanced Techniques
Here are some advanced techniques you can use with this setup:
1. Dynamic Configuration Based on Branch
You can make your CI/CD pipeline smarter by adjusting configuration based on the Git branch:
- name: Set environment variables based on branch
run: |
if [[ "${{ github.ref }}" == "refs/heads/main" ]]; then
echo "TARGET_ENV=prod" >> $GITHUB_ENV
elif [[ "${{ github.ref }}" == "refs/heads/develop" ]]; then
echo "TARGET_ENV=test" >> $GITHUB_ENV
else
echo "TARGET_ENV=dev" >> $GITHUB_ENV
fi
- name: Update variable includes
run: |
yq -i 'with(.include[] | select(. == "variables/*.dev.yml"); . = "variables/*.${{ env.TARGET_ENV }}.yml")' databricks.yml
2. Feature Flags via Variables
You can implement simple feature flags using variables:
variables:
features:
enable_advanced_analytics: true
enable_real_time_processing: false
Then in your workflows:
tasks:
- task_key: optional_analytics_step
notebook_task:
notebook_path: "/path/to/analytics"
if: ${var.features.enable_advanced_analytics}
3. Template Workflows with Parameters
You can create reusable workflow templates by parameterizing common patterns:
# resources/templates/ingest_template.yml
resources:
jobs:
${var.job_name}: # Dynamic job name
name: "Ingest ${var.source_name} Data"
tasks:
- task_key: ingest_data
notebook_task:
notebook_path: "/Shared/ingest/${var.source_type}"
base_parameters:
source_config: ${var.source_config}
target_table: ${var.target_table}
Conclusion
By combining DAB's parameterization capabilities with automated CI/CD pipelines, you can create a robust, maintainable system for deploying Databricks resources across environments. This approach gives you:
- Clear separation of configuration from implementation
- Environment-specific settings without code duplication
- Automated testing and deployment
- Consistent deployment process across environments
- Version-controlled infrastructure and configuration
What's your experience with Databricks Asset Bundles? Have you found other useful patterns for managing multi-environment deployments? Let me know in the comments!