Blog
Infrastructure as Code for Data Engineers: Terraform Patterns for Data Platforms
Ryan Kirsch · January 2, 2026 · 8 min read
Data platform configuration -- warehouses, schemas, roles, S3 buckets, Kafka topics -- accumulates over time in ways that become impossible to track manually. Someone creates a warehouse for a project, forgets to set auto-suspend, and six months later it is still running at full cost. Someone grants a role manually, and six months later no one knows why it exists. Infrastructure as code is how you prevent this.
Why Data Engineers Should Care About Terraform
Terraform is a tool that declares infrastructure as code and applies changes idempotently. You describe what you want to exist, and Terraform calculates the diff between current state and desired state, then applies only the necessary changes.
For data engineers, the value is:
- Reproducibility. A new environment (staging, dev) is provisioned by running
terraform apply, not by manually recreating warehouse configurations, schemas, and permissions. - Change visibility. Every infrastructure change goes through a pull request. Someone can see that a new role was granted, a warehouse was resized, or a schema was created -- with the context of why.
- Drift prevention. Terraform detects and reports when someone has manually changed infrastructure outside of code. Configuration drift is caught before it causes failures.
Managing Snowflake with Terraform
The Snowflake Terraform provider manages databases, schemas, warehouses, roles, users, and grants. A basic Snowflake data platform:
# providers.tf
terraform {
required_providers {
snowflake = {
source = "Snowflake-Labs/snowflake"
version = "~> 0.89"
}
}
}
provider "snowflake" {
account = var.snowflake_account
username = var.snowflake_username
password = var.snowflake_password
role = "SYSADMIN"
}
# warehouses.tf
resource "snowflake_warehouse" "analytics" {
name = "ANALYTICS_MEDIUM"
warehouse_size = "medium"
auto_suspend = 120 # 2 minutes
auto_resume = true
initially_suspended = true
max_concurrency_level = 8
statement_timeout_in_seconds = 300
comment = "Analytics queries and dbt transformations"
}
resource "snowflake_warehouse" "etl" {
name = "ETL_SMALL"
warehouse_size = "small"
auto_suspend = 300 # 5 minutes
auto_resume = true
initially_suspended = true
comment = "Ingestion and ETL pipelines"
}
# databases.tf
resource "snowflake_database" "analytics" {
name = "ANALYTICS"
comment = "Production analytics warehouse"
}
resource "snowflake_schema" "bronze" {
database = snowflake_database.analytics.name
name = "BRONZE"
comment = "Raw ingested data, append-only"
}
resource "snowflake_schema" "silver" {
database = snowflake_database.analytics.name
name = "SILVER"
comment = "Cleansed and conformed data"
}
resource "snowflake_schema" "gold" {
database = snowflake_database.analytics.name
name = "GOLD"
comment = "Business-facing analytics models"
}Role-Based Access Control as Code
Access control is one of the most common drift sources. Manual grants accumulate, roles get created for specific projects and never cleaned up, and no one can answer “why does this user have this permission?”. Managing roles in Terraform makes every grant intentional and auditable.
# roles.tf
resource "snowflake_role" "analyst_read" {
name = "ANALYST_READ"
comment = "Read access to gold schema for analysts"
}
resource "snowflake_role" "data_engineer" {
name = "DATA_ENGINEER"
comment = "Read/write access for data engineering team"
}
# Grant gold schema read to analyst role
resource "snowflake_schema_grant" "gold_select_analyst" {
database_name = snowflake_database.analytics.name
schema_name = snowflake_schema.gold.name
privilege = "SELECT"
roles = [snowflake_role.analyst_read.name]
# Grant to future tables in schema automatically
on_future = true
}
# Grant all schemas to data engineer role
resource "snowflake_database_grant" "analytics_usage_de" {
database_name = snowflake_database.analytics.name
privilege = "USAGE"
roles = [snowflake_role.data_engineer.name]
}
# Assign warehouse to roles
resource "snowflake_warehouse_grant" "analytics_usage_analyst" {
warehouse_name = snowflake_warehouse.analytics.name
privilege = "USAGE"
roles = [snowflake_role.analyst_read.name]
}S3 Bucket Management
Data lake S3 buckets have specific configuration requirements for data engineering workloads. Managing them in Terraform ensures every bucket has consistent lifecycle policies, encryption, and access controls:
# s3.tf
resource "aws_s3_bucket" "data_lake" {
bucket = "company-data-lake-prod"
tags = {
Environment = "production"
Team = "data-platform"
ManagedBy = "terraform"
}
}
resource "aws_s3_bucket_lifecycle_configuration" "data_lake" {
bucket = aws_s3_bucket.data_lake.id
rule {
id = "bronze-intelligent-tiering"
status = "Enabled"
filter {
prefix = "bronze/"
}
transition {
days = 90
storage_class = "INTELLIGENT_TIERING"
}
}
rule {
id = "temp-expiration"
status = "Enabled"
filter {
prefix = "tmp/"
}
expiration {
days = 7 # Clean up tmp files automatically
}
}
}
resource "aws_s3_bucket_server_side_encryption_configuration" "data_lake" {
bucket = aws_s3_bucket.data_lake.id
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "aws:kms"
}
}
}Terraform State Management for Teams
Terraform stores state about what it has created. For teams, this state must live in shared, versioned storage -- not on a developer's laptop. The standard pattern is S3 with DynamoDB for state locking:
# backend.tf
terraform {
backend "s3" {
bucket = "company-terraform-state"
key = "data-platform/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "terraform-state-lock" # Prevents concurrent applies
}
}
# Separate the state bucket from the data lake bucket
# (state bucket should have versioning enabled and strict access control)
# variables.tf
variable "environment" {
type = string
description = "Environment: prod, staging, or dev"
validation {
condition = contains(["prod", "staging", "dev"], var.environment)
error_message = "Environment must be prod, staging, or dev."
}
}
variable "snowflake_account" {
type = string
sensitive = true # Prevents value from appearing in logs
}
variable "snowflake_password" {
type = string
sensitive = true
}The CI/CD Workflow for Infrastructure Changes
Infrastructure changes should follow the same review process as code changes. A standard CI/CD workflow:
# .github/workflows/terraform.yml
name: Terraform
on:
pull_request:
paths: ["infrastructure/**"]
push:
branches: [main]
paths: ["infrastructure/**"]
jobs:
plan:
runs-on: ubuntu-latest
if: github.event_name == 'pull_request'
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
- name: Terraform Init
run: terraform -chdir=infrastructure init
- name: Terraform Plan
run: terraform -chdir=infrastructure plan -out=tfplan
env:
TF_VAR_snowflake_account: ${{ secrets.SNOWFLAKE_ACCOUNT }}
TF_VAR_snowflake_password: ${{ secrets.SNOWFLAKE_PASSWORD }}
- name: Comment PR with Plan
uses: actions/github-script@v7
# ... post plan output as PR comment
apply:
runs-on: ubuntu-latest
if: github.event_name == 'push' && github.ref == 'refs/heads/main'
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
- name: Terraform Apply
run: terraform -chdir=infrastructure apply -auto-approve
env:
TF_VAR_snowflake_account: ${{ secrets.SNOWFLAKE_ACCOUNT }}
TF_VAR_snowflake_password: ${{ secrets.SNOWFLAKE_PASSWORD }}The pattern: plan on every PR (shows reviewers exactly what will change), apply on merge to main. No manual terraform applyfrom developer machines in production. Infrastructure changes are reviewed, approved, and applied automatically -- the same discipline as application code.
Ryan Kirsch
Senior Data Engineer with experience building production pipelines at scale. Works with dbt, Snowflake, and Dagster, and writes about data engineering patterns from production experience. See his full portfolio.