dbt in Production: The Patterns That Scale

The first dbt project most engineers build is simple: a handful of staging models, a few marts, everything in one folder. This works fine at small scale. The problems appear around 50-100 models, when staging models start referencing each other, test coverage is inconsistent, naming conventions have drifted, and nobody is quite sure what the canonical source for a given metric is.

This post covers the architectural and operational patterns that keep dbt projects maintainable as they grow: project structure, environment configuration, macro libraries, CI/CD, custom schema generation, and the testing philosophy that actually gets followed rather than abandoned when the team gets busy.

Project Structure at Scale

The folder structure of a dbt project is the primary navigation mechanism for new engineers. A clear, consistent structure means anyone can find what they are looking for without asking. The three-layer structure (staging, intermediate, marts) is the right foundation; the question is how to organize within each layer as the project grows.

models/
  staging/
    salesforce/
      _salesforce__sources.yml   # source definitions
      _salesforce__models.yml    # model docs + tests
      stg_salesforce__accounts.sql
      stg_salesforce__contacts.sql
    stripe/
      _stripe__sources.yml
      stg_stripe__charges.sql
      stg_stripe__refunds.sql
  intermediate/
    int_orders_with_payments.sql
    int_customer_sessions.sql
  marts/
    core/
      _core__models.yml
      fct_orders.sql
      dim_customers.sql
    finance/
      fct_revenue.sql
      fct_refunds.sql
    marketing/
      fct_campaign_performance.sql

tests/
  generic/
    test_column_is_url.sql      # custom generic tests
  singular/
    test_revenue_reconciliation.sql

macros/
  generate_schema_name.sql
  surrogate_key.sql
  get_fiscal_period.sql

Source-scoped folders in staging (one subfolder per source system) make it clear which staging models correspond to which sources. The YAML convention (underscore prefix, double underscore separator) makes schema files visually distinct from SQL models and keeps source definitions co-located with their models.

Environment Configuration

dbt profiles and environment variables control how the project behaves across development, staging, and production. Key configurations to manage explicitly:

# dbt_project.yml — project-level configuration
name: 'analytics'
version: '1.0.0'
profile: 'analytics'

vars:
  # Override in environment-specific profiles
  start_date: '2020-01-01'
  payment_methods: ['card', 'bank_transfer']

models:
  analytics:
    staging:
      +materialized: view
      +schema: staging
    intermediate:
      +materialized: ephemeral
    marts:
      +materialized: table
      core:
        +schema: core
      finance:
        +schema: finance

# profiles.yml — environment-specific
analytics:
  target: dev
  outputs:
    dev:
      type: snowflake
      account: "{{ env_var('SNOWFLAKE_ACCOUNT') }}"
      database: dev_db
      schema: "{{ env_var('DBT_USER', 'anonymous') }}_analytics"
      warehouse: transform_wh
    prod:
      type: snowflake
      account: "{{ env_var('SNOWFLAKE_ACCOUNT') }}"
      database: prod_db
      schema: analytics
      warehouse: transform_wh

Custom Schema Generation

By default, dbt appends the schema defined in the model config to the target schema, producing names like prod_analytics. For production systems, you usually want schemas like core and finance without the environment prefix. The generate_schema_name macro controls this behavior.

-- macros/generate_schema_name.sql
{% macro generate_schema_name(custom_schema_name, node) -%}
  {%- set default_schema = target.schema -%}
  
  {%- if custom_schema_name is none -%}
    {{ default_schema }}
  {%- elif target.name == 'prod' -%}
    -- In production: use the custom schema name directly (no prefix)
    {{ custom_schema_name | trim }}
  {%- else -%}
    -- In dev/staging: prefix with user schema to avoid conflicts
    {{ default_schema }}_{{ custom_schema_name | trim }}
  {%- endif -%}
{%- endmacro %}

This macro ensures that production deployments write to clean schema names (core, finance, marketing) while development environments use user-prefixed schemas (rkirsch_analytics_core) to avoid collisions.

Macro Libraries

Custom macros eliminate repeated logic and enforce consistency. Three categories of macros worth building early:

-- Surrogate key generation (consistent across all fact tables)
-- macros/surrogate_key.sql
{% macro surrogate_key(field_list) %}
  MD5(
    CONCAT_WS('|',
      {% for field in field_list %}
        COALESCE(CAST({{ field }} AS VARCHAR), 'NULL')
        {%- if not loop.last %}, {% endif %}
      {% endfor %}
    )
  )
{% endmacro %}

-- Usage
SELECT
  {{ surrogate_key(['order_id', 'line_item_id']) }} AS order_line_sk

-- Date spine for filling gaps
-- macros/date_spine.sql  (or use dbt_utils.date_spine)
{% macro get_fiscal_period(date_column) %}
  CASE
    WHEN MONTH({{ date_column }}) IN (1, 2, 3)   THEN 'Q1'
    WHEN MONTH({{ date_column }}) IN (4, 5, 6)   THEN 'Q2'
    WHEN MONTH({{ date_column }}) IN (7, 8, 9)   THEN 'Q3'
    WHEN MONTH({{ date_column }}) IN (10, 11, 12) THEN 'Q4'
  END
{% endmacro %}

CI/CD for dbt

A dbt CI pipeline should run on every pull request, test only changed models and their downstream dependencies, and block merges when tests fail. The slim CI pattern achieves this efficiently:

# .github/workflows/dbt-ci.yml
name: dbt CI

on:
  pull_request:
    branches: [main]

jobs:
  dbt-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - name: Setup Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'
      
      - name: Install dbt
        run: pip install dbt-snowflake
      
      - name: dbt deps
        run: dbt deps
        env:
          DBT_TARGET: ci
      
      - name: dbt build (changed models + downstream)
        run: |
          # Get list of changed files
          CHANGED=$(git diff --name-only origin/main...HEAD -- 'models/*')
          
          if [ -z "$CHANGED" ]; then
            echo "No model changes, skipping dbt run"
            exit 0
          fi
          
          # Build and test only changed models + their dependents
          dbt build --select "state:modified+"                     --defer --state ./prod_artifacts
        env:
          DBT_TARGET: ci
          SNOWFLAKE_ACCOUNT: ${{ secrets.SNOWFLAKE_ACCOUNT }}

The --defer flag and --state reference allow the CI run to use production artifacts for models that did not change, avoiding rebuilding the entire project on every PR.

Testing Philosophy That Gets Followed

The testing approaches that work in practice are the ones that are easy to add, fast to run, and generate actionable output when they fail. The approaches that get abandoned are the ones that require significant setup per model, take too long to run, or produce noise.

The minimum test set per model: not_null on every required column, unique on every primary key, and accepted_values on every categorical column with a bounded set. These three tests can be defined in 10 lines of YAML and catch the majority of common data quality issues.

For business-critical models, add relationship tests (foreign key integrity) and custom singular tests for specific business rules (revenue cannot be negative, dates cannot be in the future). These take more time to write but provide stronger guarantees for the models stakeholders depend on most.

The meta field in model YAML is useful for tracking which models have reached test coverage standards, enabling reporting on test coverage across the project and identifying the models that need attention.

A dbt project that is well-tested, well-structured, and well-documented reduces the cognitive load of onboarding new engineers, reduces the frequency of data quality incidents, and makes refactoring safer. The investment in these practices pays off in proportion to how long the project lives and how many engineers work on it.