dbt in Production: The Patterns That Scale
Getting a dbt project to work is easy. Keeping it maintainable at 200 models, across multiple environments, with a team of engineers, is the harder problem. Here are the patterns that scale.
The first dbt project most engineers build is simple: a handful of staging models, a few marts, everything in one folder. This works fine at small scale. The problems appear around 50-100 models, when staging models start referencing each other, test coverage is inconsistent, naming conventions have drifted, and nobody is quite sure what the canonical source for a given metric is.
This post covers the architectural and operational patterns that keep dbt projects maintainable as they grow: project structure, environment configuration, macro libraries, CI/CD, custom schema generation, and the testing philosophy that actually gets followed rather than abandoned when the team gets busy.
Project Structure at Scale
The folder structure of a dbt project is the primary navigation mechanism for new engineers. A clear, consistent structure means anyone can find what they are looking for without asking. The three-layer structure (staging, intermediate, marts) is the right foundation; the question is how to organize within each layer as the project grows.
models/
staging/
salesforce/
_salesforce__sources.yml # source definitions
_salesforce__models.yml # model docs + tests
stg_salesforce__accounts.sql
stg_salesforce__contacts.sql
stripe/
_stripe__sources.yml
stg_stripe__charges.sql
stg_stripe__refunds.sql
intermediate/
int_orders_with_payments.sql
int_customer_sessions.sql
marts/
core/
_core__models.yml
fct_orders.sql
dim_customers.sql
finance/
fct_revenue.sql
fct_refunds.sql
marketing/
fct_campaign_performance.sql
tests/
generic/
test_column_is_url.sql # custom generic tests
singular/
test_revenue_reconciliation.sql
macros/
generate_schema_name.sql
surrogate_key.sql
get_fiscal_period.sqlSource-scoped folders in staging (one subfolder per source system) make it clear which staging models correspond to which sources. The YAML convention (underscore prefix, double underscore separator) makes schema files visually distinct from SQL models and keeps source definitions co-located with their models.
Environment Configuration
dbt profiles and environment variables control how the project behaves across development, staging, and production. Key configurations to manage explicitly:
# dbt_project.yml — project-level configuration
name: 'analytics'
version: '1.0.0'
profile: 'analytics'
vars:
# Override in environment-specific profiles
start_date: '2020-01-01'
payment_methods: ['card', 'bank_transfer']
models:
analytics:
staging:
+materialized: view
+schema: staging
intermediate:
+materialized: ephemeral
marts:
+materialized: table
core:
+schema: core
finance:
+schema: finance
# profiles.yml — environment-specific
analytics:
target: dev
outputs:
dev:
type: snowflake
account: "{{ env_var('SNOWFLAKE_ACCOUNT') }}"
database: dev_db
schema: "{{ env_var('DBT_USER', 'anonymous') }}_analytics"
warehouse: transform_wh
prod:
type: snowflake
account: "{{ env_var('SNOWFLAKE_ACCOUNT') }}"
database: prod_db
schema: analytics
warehouse: transform_whCustom Schema Generation
By default, dbt appends the schema defined in the model config to the target schema, producing names like prod_analytics. For production systems, you usually want schemas like core and finance without the environment prefix. The generate_schema_name macro controls this behavior.
-- macros/generate_schema_name.sql
{% macro generate_schema_name(custom_schema_name, node) -%}
{%- set default_schema = target.schema -%}
{%- if custom_schema_name is none -%}
{{ default_schema }}
{%- elif target.name == 'prod' -%}
-- In production: use the custom schema name directly (no prefix)
{{ custom_schema_name | trim }}
{%- else -%}
-- In dev/staging: prefix with user schema to avoid conflicts
{{ default_schema }}_{{ custom_schema_name | trim }}
{%- endif -%}
{%- endmacro %}This macro ensures that production deployments write to clean schema names (core, finance, marketing) while development environments use user-prefixed schemas (rkirsch_analytics_core) to avoid collisions.
Macro Libraries
Custom macros eliminate repeated logic and enforce consistency. Three categories of macros worth building early:
-- Surrogate key generation (consistent across all fact tables)
-- macros/surrogate_key.sql
{% macro surrogate_key(field_list) %}
MD5(
CONCAT_WS('|',
{% for field in field_list %}
COALESCE(CAST({{ field }} AS VARCHAR), 'NULL')
{%- if not loop.last %}, {% endif %}
{% endfor %}
)
)
{% endmacro %}
-- Usage
SELECT
{{ surrogate_key(['order_id', 'line_item_id']) }} AS order_line_sk
-- Date spine for filling gaps
-- macros/date_spine.sql (or use dbt_utils.date_spine)
{% macro get_fiscal_period(date_column) %}
CASE
WHEN MONTH({{ date_column }}) IN (1, 2, 3) THEN 'Q1'
WHEN MONTH({{ date_column }}) IN (4, 5, 6) THEN 'Q2'
WHEN MONTH({{ date_column }}) IN (7, 8, 9) THEN 'Q3'
WHEN MONTH({{ date_column }}) IN (10, 11, 12) THEN 'Q4'
END
{% endmacro %}CI/CD for dbt
A dbt CI pipeline should run on every pull request, test only changed models and their downstream dependencies, and block merges when tests fail. The slim CI pattern achieves this efficiently:
# .github/workflows/dbt-ci.yml
name: dbt CI
on:
pull_request:
branches: [main]
jobs:
dbt-test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Install dbt
run: pip install dbt-snowflake
- name: dbt deps
run: dbt deps
env:
DBT_TARGET: ci
- name: dbt build (changed models + downstream)
run: |
# Get list of changed files
CHANGED=$(git diff --name-only origin/main...HEAD -- 'models/*')
if [ -z "$CHANGED" ]; then
echo "No model changes, skipping dbt run"
exit 0
fi
# Build and test only changed models + their dependents
dbt build --select "state:modified+" --defer --state ./prod_artifacts
env:
DBT_TARGET: ci
SNOWFLAKE_ACCOUNT: ${{ secrets.SNOWFLAKE_ACCOUNT }}The --defer flag and --state reference allow the CI run to use production artifacts for models that did not change, avoiding rebuilding the entire project on every PR.
Testing Philosophy That Gets Followed
The testing approaches that work in practice are the ones that are easy to add, fast to run, and generate actionable output when they fail. The approaches that get abandoned are the ones that require significant setup per model, take too long to run, or produce noise.
The minimum test set per model: not_null on every required column, unique on every primary key, and accepted_values on every categorical column with a bounded set. These three tests can be defined in 10 lines of YAML and catch the majority of common data quality issues.
For business-critical models, add relationship tests (foreign key integrity) and custom singular tests for specific business rules (revenue cannot be negative, dates cannot be in the future). These take more time to write but provide stronger guarantees for the models stakeholders depend on most.
The meta field in model YAML is useful for tracking which models have reached test coverage standards, enabling reporting on test coverage across the project and identifying the models that need attention.
A dbt project that is well-tested, well-structured, and well-documented reduces the cognitive load of onboarding new engineers, reduces the frequency of data quality incidents, and makes refactoring safer. The investment in these practices pays off in proportion to how long the project lives and how many engineers work on it.
Found this useful? Share it: