Blog

Building a Data Platform from Scratch: Decisions, Trade-offs, and the Order That Matters

Ryan Kirsch · December 8, 2025 · 11 min read

The most expensive decisions in data engineering are the ones made in the first 90 days. Storage architecture, ingestion patterns, and transformation conventions compound forward -- they are very hard to change once pipelines and analysts depend on them. This guide walks through the decision sequence in the order it actually matters.

Start With the Questions, Not the Stack

The most common mistake when building a data platform is starting with a technology preference rather than the business requirements. “We should use Databricks” or “Let's go Snowflake” before anyone has asked what latency the business needs, how much data will land per day, or who the primary consumers will be.

The questions that should precede every architecture decision:

  • Who are the consumers? Analysts writing SQL, data scientists training models, engineers reverse-ETLing into operational systems, and BI tools all have different requirements from the same underlying data.
  • What is the acceptable latency? Hours, minutes, or seconds? The answer determines whether batch, microbatch, or streaming is appropriate -- and that choice cascades into every other decision.
  • What is the data volume and growth rate?A platform handling 10GB/day is architected differently from one handling 10TB/day, even if both start with the same tool choices.
  • What is the team's existing expertise?The best architecture on paper is worth nothing if no one on the team can operate it. Build for the team you have, not the team you imagine hiring.

Layer 1: Storage -- The Foundation You Cannot Easily Change

The storage decision is the most consequential and the one most teams make without enough deliberation. The main options in 2026:

Cloud data warehouse (Snowflake, BigQuery, Redshift): Managed storage with strong SQL support, automatic optimization, and tight ecosystem integration. The right choice for most teams under 50 engineers whose primary consumers are analysts. Higher per-query cost at extreme scale, but zero infrastructure management.

Data lakehouse (S3/GCS + Delta Lake or Iceberg): Object storage with ACID transactions and open table formats. More infrastructure to manage, but lower storage costs at scale and better support for ML workloads and multi-engine access. Right for teams with Spark or Flink workloads alongside SQL analytics.

Hybrid: Land raw data in object storage (cheap, durable, format-agnostic) and copy curated tables to a warehouse for analytics. This pattern separates long-term archival from analytical performance, which is often the right trade-off for large organizations.

For most greenfield platforms in 2026: start with Snowflake or BigQuery. The operational overhead saved is worth more than the cost premium for teams under 20 engineers. Migrate to a lakehouse if and when volume economics demand it, not before.

Layer 2: Ingestion -- The Multiplier on Everything Downstream

Ingestion quality directly determines transformation complexity. Data that arrives clean, complete, and on schedule makes every downstream model simpler. Data that arrives late, duplicated, or with schema drift forces defensive logic into every transformation.

The ingestion decision matrix:

Source Type          | Recommended Tool
---------------------|----------------------------
SaaS APIs (Salesforce| Fivetran or Airbyte
  HubSpot, Stripe)   | (managed connectors)
---------------------|----------------------------
Operational databases| Debezium (CDC via Kafka)
  (Postgres, MySQL)  | or Airbyte CDC connectors
---------------------|----------------------------
Custom REST APIs     | Python scripts in Dagster
                     | assets
---------------------|----------------------------
Event streams        | Kafka / Redpanda
  (clickstream, IoT) | with Kafka Connect
---------------------|----------------------------
Files (CSV, Parquet, | COPY INTO (Snowflake)
  JSON from S3/GCS)  | or dbt seeds for small
---------------------|----------------------------

The schema-on-write vs schema-on-read decision: raw data should be stored as-received (schema-on-read) so you can replay it if your parsing logic changes. Apply types and constraints at the silver layer, not at ingestion. This is the bronze-silver-gold pattern in practice.

The most underestimated ingestion problem is late-arriving data. Source systems deliver events out of order. A webhook that fires at 11:58 PM for a transaction that occurred at 11:45 PM lands in the wrong daily batch. Plan your incremental load windows to accommodate late arrival from day one -- it is much harder to retrofit.

Layer 3: Transformation -- Where the Platform's Value Lives

Transformation is where raw data becomes business value. The decision is almost always dbt for SQL-based transformations in 2026, with PySpark or pandas for complex Python transformations that require procedural logic or ML feature engineering.

The conventions that matter most in the first 90 days:

  • Naming conventions. Define stg_,int_, fct_, dim_ prefixes before the first model is written. Retrofitting naming conventions onto a large project is painful.
  • Grain documentation. Every fact table should have a documented grain (one row per what?). Undocumented grain is the source of most downstream confusion and incorrect joins.
  • Test coverage policy. Decide what minimum tests are required for every model (not_null on all keys, unique on surrogate keys) before the codebase grows past the point where enforcement is practical.
  • PR review process. Treat dbt models as code with real code review. Schema changes to widely-used models should require explicit sign-off from downstream consumers.

Layer 4: Orchestration -- The Nervous System

Orchestration ties all the layers together. The choice in 2026 is primarily between Airflow (task-based, battle-tested, large ecosystem) and Dagster (asset-based, modern observability, better developer experience).

The practical decision criteria:

  • If the team already knows Airflow and the primary use case is simple ETL scheduling, stay with Airflow. The switching cost is high and the operational benefit of Dagster is most visible at larger scale.
  • If starting fresh and the team values observability, lineage, and the asset model, Dagster is the better long-term choice. The learning curve is real but the developer experience pays dividends.
  • For dbt-primary teams, both integrate well. Dagster's native dbt integration (auto-generating assets from manifest) is a notable advantage for teams that want unified lineage.

Regardless of tool choice: design for idempotency from day one. Every pipeline run should be safe to re-run. This means no append-only loads without deduplication, no non-idempotent operations in the critical path, and explicit handling of partial failures.

Layer 5: Serving -- Matching Data to Consumer Needs

The serving layer is often treated as an afterthought, but it determines whether the platform delivers business value or becomes a beautiful internal engineering project that no one uses.

The main serving patterns and their right use cases:

  • BI tools (Looker, Metabase, Tableau): Best for structured reporting, dashboards, and self-serve SQL for analysts. Connect directly to gold mart tables. Define a semantic layer to prevent metric inconsistency across reports.
  • Reverse ETL (Census, Hightouch): Sync curated data from the warehouse back into operational systems (Salesforce, HubSpot, Zendesk). Closes the loop between analytics and action.
  • ML feature store: If the platform serves ML use cases, a dedicated feature store (Feast, Tecton, or Databricks Feature Store) prevents duplicate feature engineering and ensures training-serving consistency.
  • Data API: For product teams that need to embed analytics into user-facing features, a thin REST API over warehouse queries (or Materialize for real-time) is more appropriate than giving product engineers direct warehouse access.

The 90-Day Build Sequence

Given the above, here is the pragmatic build sequence that ships value quickly without creating technical debt:

Days 1-14: Foundation. Choose and provision the warehouse. Set up source connections for the 3-5 most important source systems. Establish naming conventions and project structure in dbt. Get a basic staging layer running.

Days 15-30: First value. Build silver models for the top 2-3 business entities (customers, orders, events). Connect a BI tool to the warehouse. Ship one dashboard that replaces an existing spreadsheet-based report.

Days 31-60: Reliability. Add tests to all staging and silver models. Set up orchestration with scheduled runs and alerting. Add data freshness monitoring. Document the grain and ownership of every production model.

Days 61-90: Scale. Add remaining source systems. Build gold mart tables for primary business domains. Establish data contracts for models that other teams depend on. Introduce lineage tracking and incident response runbooks.

The sequence prioritizes delivering one working end-to-end slice (source to dashboard) before building out breadth. A platform with one fully reliable data product is more valuable than a platform with ten half-built pipelines.

The Decisions That Are Hardest to Change

Build these right the first time, because retrofitting them is expensive:

  • Primary key strategy. Surrogate keys vs. natural keys vs. composite keys. Decide early and enforce consistently. Mixed key strategies across a warehouse are a maintenance problem that compounds with every new join.
  • Date/time convention. UTC everywhere in storage, convert at the serving layer. Teams that mix timezones in the warehouse create subtle, hard-to-debug aggregation errors that show up in production at the worst moments.
  • Null semantics. What does a null mean in your platform -- unknown, not applicable, or missing? Define this explicitly and encode it in your staging transforms. Ambiguous nulls become incorrect aggregations downstream.
  • Access control model. Row-level security, column masking, and schema-level permissions are much easier to design in from the start than to add later. For platforms handling sensitive data, build the access model before the first external consumer gets credentials.

Every platform has a moment where the team wishes they had made different foundational decisions. The teams that end up with maintainable platforms are the ones who spent the extra days in weeks 1-4 getting these foundations right, rather than racing to ship the first dashboard.

Share this post:

RK

Ryan Kirsch

Senior Data Engineer with experience building production pipelines at scale. Works with dbt, Snowflake, and Dagster, and writes about data engineering patterns from production experience. See his full portfolio.