The Local-First Data Stack: Practical Lessons from Dagster, dbt, and DuckDB

Turns out you can. Here is what I built, why I made each architectural decision, and what surprised me along the way.

The Challenge

The project started as a portfolio piece but quickly became something I actually cared about: pulling health data from my Oura Ring, development activity from GitHub, and weather data from OpenWeatherMap, then correlating them to answer questions like "do I write more code on days with good sleep scores?" That required a real pipeline, not a cron job duct-taped to a CSV.

The constraints were real too. No cloud budget, no managed warehouse, and no Kubernetes cluster to hide behind. Whatever I built had to run on a MacBook, be reproducible, and look like the kind of architecture a senior engineer would ship at work, not just in a weekend hackathon. That ruled out stitching together raw Python scripts and calling it a day. I needed orchestration, tested SQL transformations, and a storage layer fast enough to query on the fly.

The Architecture

Dagster over Airflow: the asset-centric model wins

The first decision was orchestration. Airflow is the obvious default, and I have used it professionally. But Airflow is task-centric: it models your pipeline as a graph of operations. Dagster models it as a graph of assets. That distinction matters more than it sounds.

With Dagster's software-defined assets (SDAs), each dataset is a first-class object. You declare what each asset produces, what it depends on, and Dagster handles the rest, including freshness policies, lineage tracking, and a built-in data catalog in the UI. When I ingest raw Oura Ring data into my bronze layer, Dagster knows that asset exists, when it was last materialized, and what downstream silver-layer assets depend on it. Airflow's task graph doesn't give you that natively.

The local development experience also edges out Airflow significantly. Running dagster dev spins up the full UI at localhost with hot reload. No Docker Compose, no separate scheduler process, no fighting with environment variables. For a local-first project, that frictionlessness compounded fast.

dbt Core over raw SQL: tests and structure as first-class concerns

Once data lands in the bronze layer, it needs to be transformed. I could have written raw Python with SQL strings. I chose dbt Core instead, and it was the right call.

The staging/marts pattern dbt enforces pushed me to think about data in layers: stg_oura__sleep cleans and standardizes the raw API response; mart_health_activity_correlationsjoins across sources and exposes the final analytical view. That separation is boring on a small project and invaluable on a large one. More concretely: dbt's built-in testing framework let me define 17 data quality tests across my models, covering not-null constraints, uniqueness on primary keys, and accepted-value validations on categorical fields. Those tests run automatically after each transformation. I know immediately when an API response has changed shape.

Jinja templating and the ref()function also give you dependency management for free. When I refactorstg_oura__sleep, dbt knows every downstream model that needs to rerun. That kind of lineage is something you'd otherwise spend weeks wiring up manually.

DuckDB over Postgres or Snowflake: the right tool for laptop-scale analytics

The storage choice was DuckDB, and this is where I had to push back on my own instincts. Postgres is familiar. Snowflake is what the job descriptions mention. But neither is the right fit here.

DuckDB is an in-process OLAP database. It runs inside your Python process, reads directly from Parquet files, and executes columnar queries over millions of rows in milliseconds on a laptop. There is no server to manage, no connection pooling to configure, and no per-query billing to worry about. For analytical workloads at this scale, it is genuinely faster than a hosted warehouse on queries against local data, because you eliminate the network entirely.

The practical detail that sealed it: dbt Core has a first-party DuckDB adapter (dbt-duckdb). My entire transformation layer runs locally with zero configuration changes. And if I eventually want to move this to the cloud, MotherDuck provides managed DuckDB hosting with near-zero migration cost. The exit ramp exists. I just don't need it yet.

What I Learned

Dagster's asset model required a mental shift early on. I kept wanting to think in tasks ("run the ingestion job") instead of assets ("materialize the bronze Oura dataset"). Once that clicked, the rest of the design followed naturally. Freshness policies in particular are underrated: I configured the bronze assets to flag as stale after 24 hours, which gives me a dashboard-level health check without writing any custom monitoring logic.

The surprise was how well dbt and Dagster compose. Dagster's native dbt integration (dagster-dbt) introspects your dbt project and surfaces each dbt model as a Dagster asset automatically. That means my full lineage graph, from raw API call to Streamlit dashboard, lives in one place. I expected to glue these tools together. I did not expect the glue to already exist and actually work.

If I were starting over, I would define the DuckDB schema more formally up front. I evolved it organically and paid for that with a couple of messy migration scripts mid-project.

The Results

The pipeline ingests from three APIs, processes data through bronze, silver, and gold medallion layers, passes 17 automated dbt data quality tests, and surfaces findings in a Streamlit dashboard. The most interesting finding: sleep score and commit volume have a moderate positive correlation on weekdays, but weather is essentially noise for my coding output. The infrastructure to find that out cost nothing beyond my laptop and a few evenings.

Takeaway

This stack (Dagster + dbt Core + DuckDB) is genuinely good for local-first analytics projects and probably right for small-to-medium team projects that do not need a managed cloud warehouse yet. It is not the right call if you need multi-user concurrency, petabyte-scale storage, or enterprise IAM out of the box. But for building something rigorous, testable, and portfolio-worthy without a cloud bill, it is hard to beat.

The full repo is at github.com/agalloch88/data-pipeline. If you are building something similar or have opinions on where this stack breaks down at scale, I want to hear it.