Infrastructure as Code for Data Engineers: Terraform Patterns for Data Platforms

Why Data Engineers Should Care About Terraform

Terraform is a tool that declares infrastructure as code and applies changes idempotently. You describe what you want to exist, and Terraform calculates the diff between current state and desired state, then applies only the necessary changes.

For data engineers, the value is:

Reproducibility. A new environment (staging, dev) is provisioned by running terraform apply, not by manually recreating warehouse configurations, schemas, and permissions.
Change visibility. Every infrastructure change goes through a pull request. Someone can see that a new role was granted, a warehouse was resized, or a schema was created -- with the context of why.
Drift prevention. Terraform detects and reports when someone has manually changed infrastructure outside of code. Configuration drift is caught before it causes failures.

Managing Snowflake with Terraform

The Snowflake Terraform provider manages databases, schemas, warehouses, roles, users, and grants. A basic Snowflake data platform:

# providers.tf
terraform {
  required_providers {
    snowflake = {
      source  = "Snowflake-Labs/snowflake"
      version = "~> 0.89"
    }
  }
}

provider "snowflake" {
  account   = var.snowflake_account
  username  = var.snowflake_username
  password  = var.snowflake_password
  role      = "SYSADMIN"
}

# warehouses.tf
resource "snowflake_warehouse" "analytics" {
  name                         = "ANALYTICS_MEDIUM"
  warehouse_size               = "medium"
  auto_suspend                 = 120      # 2 minutes
  auto_resume                  = true
  initially_suspended          = true
  max_concurrency_level        = 8
  statement_timeout_in_seconds = 300

  comment = "Analytics queries and dbt transformations"
}

resource "snowflake_warehouse" "etl" {
  name                         = "ETL_SMALL"
  warehouse_size               = "small"
  auto_suspend                 = 300      # 5 minutes
  auto_resume                  = true
  initially_suspended          = true
  comment                      = "Ingestion and ETL pipelines"
}

# databases.tf
resource "snowflake_database" "analytics" {
  name    = "ANALYTICS"
  comment = "Production analytics warehouse"
}

resource "snowflake_schema" "bronze" {
  database = snowflake_database.analytics.name
  name     = "BRONZE"
  comment  = "Raw ingested data, append-only"
}

resource "snowflake_schema" "silver" {
  database = snowflake_database.analytics.name
  name     = "SILVER"
  comment  = "Cleansed and conformed data"
}

resource "snowflake_schema" "gold" {
  database = snowflake_database.analytics.name
  name     = "GOLD"
  comment  = "Business-facing analytics models"
}

Role-Based Access Control as Code

Access control is one of the most common drift sources. Manual grants accumulate, roles get created for specific projects and never cleaned up, and no one can answer “why does this user have this permission?”. Managing roles in Terraform makes every grant intentional and auditable.

# roles.tf
resource "snowflake_role" "analyst_read" {
  name    = "ANALYST_READ"
  comment = "Read access to gold schema for analysts"
}

resource "snowflake_role" "data_engineer" {
  name    = "DATA_ENGINEER"
  comment = "Read/write access for data engineering team"
}

# Grant gold schema read to analyst role
resource "snowflake_schema_grant" "gold_select_analyst" {
  database_name = snowflake_database.analytics.name
  schema_name   = snowflake_schema.gold.name
  privilege     = "SELECT"
  roles         = [snowflake_role.analyst_read.name]

  # Grant to future tables in schema automatically
  on_future = true
}

# Grant all schemas to data engineer role
resource "snowflake_database_grant" "analytics_usage_de" {
  database_name = snowflake_database.analytics.name
  privilege     = "USAGE"
  roles         = [snowflake_role.data_engineer.name]
}

# Assign warehouse to roles
resource "snowflake_warehouse_grant" "analytics_usage_analyst" {
  warehouse_name = snowflake_warehouse.analytics.name
  privilege      = "USAGE"
  roles          = [snowflake_role.analyst_read.name]
}

S3 Bucket Management

Data lake S3 buckets have specific configuration requirements for data engineering workloads. Managing them in Terraform ensures every bucket has consistent lifecycle policies, encryption, and access controls:

# s3.tf
resource "aws_s3_bucket" "data_lake" {
  bucket = "company-data-lake-prod"
  
  tags = {
    Environment = "production"
    Team        = "data-platform"
    ManagedBy   = "terraform"
  }
}

resource "aws_s3_bucket_lifecycle_configuration" "data_lake" {
  bucket = aws_s3_bucket.data_lake.id

  rule {
    id     = "bronze-intelligent-tiering"
    status = "Enabled"
    
    filter {
      prefix = "bronze/"
    }

    transition {
      days          = 90
      storage_class = "INTELLIGENT_TIERING"
    }
  }

  rule {
    id     = "temp-expiration"
    status = "Enabled"
    
    filter {
      prefix = "tmp/"
    }

    expiration {
      days = 7  # Clean up tmp files automatically
    }
  }
}

resource "aws_s3_bucket_server_side_encryption_configuration" "data_lake" {
  bucket = aws_s3_bucket.data_lake.id

  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm = "aws:kms"
    }
  }
}

Terraform State Management for Teams

Terraform stores state about what it has created. For teams, this state must live in shared, versioned storage -- not on a developer's laptop. The standard pattern is S3 with DynamoDB for state locking:

# backend.tf
terraform {
  backend "s3" {
    bucket         = "company-terraform-state"
    key            = "data-platform/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-state-lock"  # Prevents concurrent applies
  }
}

# Separate the state bucket from the data lake bucket
# (state bucket should have versioning enabled and strict access control)

# variables.tf
variable "environment" {
  type        = string
  description = "Environment: prod, staging, or dev"
  
  validation {
    condition     = contains(["prod", "staging", "dev"], var.environment)
    error_message = "Environment must be prod, staging, or dev."
  }
}

variable "snowflake_account" {
  type        = string
  sensitive   = true  # Prevents value from appearing in logs
}

variable "snowflake_password" {
  type      = string
  sensitive = true
}

The CI/CD Workflow for Infrastructure Changes

Infrastructure changes should follow the same review process as code changes. A standard CI/CD workflow:

# .github/workflows/terraform.yml
name: Terraform

on:
  pull_request:
    paths: ["infrastructure/**"]
  push:
    branches: [main]
    paths: ["infrastructure/**"]

jobs:
  plan:
    runs-on: ubuntu-latest
    if: github.event_name == 'pull_request'
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
      
      - name: Terraform Init
        run: terraform -chdir=infrastructure init
        
      - name: Terraform Plan
        run: terraform -chdir=infrastructure plan -out=tfplan
        env:
          TF_VAR_snowflake_account: ${{ secrets.SNOWFLAKE_ACCOUNT }}
          TF_VAR_snowflake_password: ${{ secrets.SNOWFLAKE_PASSWORD }}
          
      - name: Comment PR with Plan
        uses: actions/github-script@v7
        # ... post plan output as PR comment
  
  apply:
    runs-on: ubuntu-latest
    if: github.event_name == 'push' && github.ref == 'refs/heads/main'
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
      
      - name: Terraform Apply
        run: terraform -chdir=infrastructure apply -auto-approve
        env:
          TF_VAR_snowflake_account: ${{ secrets.SNOWFLAKE_ACCOUNT }}
          TF_VAR_snowflake_password: ${{ secrets.SNOWFLAKE_PASSWORD }}

The pattern: plan on every PR (shows reviewers exactly what will change), apply on merge to main. No manual terraform applyfrom developer machines in production. Infrastructure changes are reviewed, approved, and applied automatically -- the same discipline as application code.