Data Engineering With DuckDB: Fast Local Analytics Without the Cloud

The first time I ran a 500MB CSV through DuckDB, I was expecting to wait. I had been working with Pandas for that class of problem for years, and the mental model was: CSV this size, expect some lag, grab coffee. DuckDB returned in under two seconds. The query was a multi-join aggregation with a window function. No configuration, no spinning up infrastructure, no cluster to wait for.

That experience is what DuckDB keeps delivering, and why it has become a standard tool in my local data engineering setup. This post covers what DuckDB is, where it fits in a modern data engineering workflow, how it compares to the tools you are already using, and where it breaks down.

What DuckDB Actually Is

DuckDB is an in-process analytical database. It runs embedded inside your application or script, with no separate server process, no network connection, and no cluster to manage. You install it as a library, import it, and query data directly. The whole engine lives in your process.

The "analytical" part is important. DuckDB is an OLAP database, optimized for column-oriented analytical queries rather than OLTP row-level transactions. It uses a vectorized query execution engine, which processes data in columnar batches rather than row by row. This is why it is so fast on aggregations, joins, and window functions against large flat files.

It is free, open source (MIT license), and actively maintained by a team that releases frequently. The Python, R, Node.js, Java, and CLI interfaces are all well-developed.

A minimal example of why it is compelling:

import duckdb

# Query a CSV directly, no loading step
result = duckdb.sql("""
  SELECT
    region,
    SUM(revenue)                         AS total_revenue,
    COUNT(DISTINCT customer_id)          AS unique_customers,
    AVG(revenue) OVER (PARTITION BY region
                       ORDER BY order_date
                       ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
                      )                  AS rolling_7_avg
  FROM 'orders_2025.csv'
  WHERE order_date >= '2025-01-01'
  GROUP BY region
  ORDER BY total_revenue DESC
""").df()

print(result.head(10))

That query runs directly against the CSV file. DuckDB reads only the columns it needs (projection pushdown), filters during scan (predicate pushdown), and returns a Pandas DataFrame. No intermediate load step. No schema definition required.

When to Use DuckDB

DuckDB is not a replacement for a cloud data warehouse. It is the right tool for a specific set of problems, and knowing that set precisely is what makes it valuable rather than overused.

Local development and prototyping. This is DuckDB's strongest use case. When you are developing a dbt model, a data pipeline, or an analytical query, you need a fast feedback loop. Loading data into BigQuery or Snowflake for iterative development is slow, wasteful, and occasionally expensive. DuckDB gives you a local query engine that accepts the same SQL you will run in production. I use it as the local dbt adapter on every new project, running against sampled production data or synthetic test data.

Small-to-medium dataset analytics. "Small-to-medium" is relative to your hardware, but DuckDB handles datasets up to tens of gigabytes comfortably on a modern laptop with 16GB of RAM. For datasets that fit within that range, DuckDB is typically faster than spinning up cloud infrastructure, and the operational overhead is zero.

Embedded analytics in applications. DuckDB is designed for embedding. If you are building a data application, a CLI tool that does analytics, or a reporting system that needs to query structured data without a separate database server, DuckDB fits cleanly. The in-process architecture means no dependency on an external service being available.

File format conversions and exploration. DuckDB reads Parquet, CSV, JSON, and Arrow natively. For the common task of "I have a collection of Parquet files from S3, I need to understand the schema and run some exploratory queries," DuckDB is the fastest path available. It can also write to Parquet directly, making it useful for format conversion pipelines.

CI/CD pipeline testing. Running data quality tests, schema validations, and transformation logic checks in a CI pipeline benefits from a lightweight, fast database engine that does not require external service dependencies. DuckDB starts in milliseconds and does not need credentials or network access.

Core Query Patterns Worth Knowing

A few DuckDB-specific patterns that come up repeatedly in data engineering work:

Reading multiple files as a single table:

-- Read all Parquet files in a directory
SELECT * FROM read_parquet('data/events/**/*.parquet');

-- Read with filename column for partition awareness
SELECT
  filename,
  *
FROM read_parquet('data/events/**/*.parquet', filename=true);

Querying S3 directly (with the httpfs extension):

INSTALL httpfs;
LOAD httpfs;

SET s3_region='us-east-1';
SET s3_access_key_id='YOUR_KEY';
SET s3_secret_access_key='YOUR_SECRET';

SELECT COUNT(*), MIN(event_time), MAX(event_time)
FROM read_parquet('s3://your-bucket/events/2025/**/*.parquet');

Fast aggregation with window functions:

-- DuckDB handles large window function queries efficiently
SELECT
  customer_id,
  order_date,
  revenue,
  SUM(revenue) OVER (
    PARTITION BY customer_id
    ORDER BY order_date
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  )                        AS cumulative_revenue,
  ROW_NUMBER() OVER (
    PARTITION BY customer_id
    ORDER BY order_date
  )                        AS order_sequence
FROM orders
ORDER BY customer_id, order_date;

Sampling large files for exploration:

-- Get a reproducible 1% sample for initial exploration
SELECT *
FROM 'large_events.parquet'
USING SAMPLE 1% (bernoulli, 42);

Writing Parquet for downstream use:

COPY (
  SELECT customer_id, SUM(revenue) AS ltv
  FROM orders
  GROUP BY customer_id
) TO 'customer_ltv.parquet' (FORMAT PARQUET);

DuckDB vs. Pandas: The Practical Comparison

Pandas is the default Python tool for DataFrame manipulation, and DuckDB does not replace it entirely. The comparison is more nuanced than "DuckDB is faster."

DuckDB wins on: multi-table join performance, aggregation against large flat files, window functions, and anything requiring SQL semantics. For datasets above a few hundred megabytes, DuckDB's columnar execution and out-of-core processing (it can handle datasets larger than RAM) gives it a substantial performance advantage over Pandas, which loads everything into memory.

Pandas wins on: row-level iteration (when you genuinely need to process row by row), complex custom Python transformations that do not map to SQL, integration with the scientific Python ecosystem (scikit-learn, matplotlib, etc.), and workflows where the input is already a DataFrame and the transformations are complex Python logic rather than SQL expressions.

The most practical pattern is using both: DuckDB for the heavy SQL aggregations and joins, returning results to Pandas for Python-side processing or visualization.

import duckdb
import pandas as pd

# Use DuckDB for the heavy lifting
aggregated = duckdb.sql("""
  SELECT
    product_category,
    YEAR(order_date)         AS year,
    SUM(revenue)             AS total_revenue
  FROM 'orders_large.parquet'
  GROUP BY 1, 2
""").df()

# Use Pandas for the Python-side work
pivot = aggregated.pivot(
    index='product_category',
    columns='year',
    values='total_revenue'
)
pivot.plot(kind='bar', figsize=(12, 6))

DuckDB vs. Spark: When Scale Actually Matters

Spark is the distributed compute engine of choice for data at scale. The comparison with DuckDB is a question of when distribution is actually necessary.

Spark is appropriate when: your dataset is too large to fit on a single node (terabytes and above), your pipeline needs to run on a cluster for throughput reasons, or your organization already has Spark infrastructure and operational expertise.

DuckDB is appropriate when: your dataset fits on a single machine, you want fast iteration without cluster startup time (Spark cluster initialization alone can take minutes), or you are in a development and testing context where Spark overhead is disproportionate to the work being done.

The most important comparison point: Spark is operationally complex. It requires cluster management, serialization awareness, understanding of shuffle operations and broadcast joins, and distributed debugging skills. DuckDB requires none of that. For the wide class of problems that do not actually require distribution, choosing DuckDB is choosing fewer moving parts, faster development cycles, and substantially lower operational burden.

The practical test: if your data fits on a $10/month VM with 32GB of RAM, DuckDB will almost certainly outperform a Spark cluster on that same data, at lower cost and with less operational overhead. Reach for Spark when single-node processing genuinely cannot keep up.

DuckDB vs. BigQuery: The Cloud Trade-Off

BigQuery is a managed, serverless, column-oriented analytical database at massive scale. The comparison with DuckDB is primarily about scale, cost, and collaboration.

BigQuery wins on: petabyte-scale datasets, multi-user concurrent query environments, built-in security and governance, ML integration, and anything requiring a persistent shared data layer across a team or organization.

DuckDB wins on: development speed (no loading, no credentials, no API calls), cost (free), offline capability, and scenarios where the data does not need to live in a managed cloud service.

The pattern I use in production: dbt with DuckDB locally for model development and unit testing, BigQuery as the production target. Models that work in DuckDB and pass local tests are promoted to BigQuery through CI. The local DuckDB dev loop is dramatically faster than the round-trip to BigQuery for iterative SQL development.

DuckDB in a dbt Project

dbt has first-class DuckDB support through the dbt-duckdb adapter. Setting up a local development profile is straightforward:

# profiles.yml
my_project:
  target: dev
  outputs:
    dev:
      type: duckdb
      path: /tmp/dev.duckdb
      threads: 4
    prod:
      type: bigquery
      project: my-gcp-project
      dataset: analytics
      threads: 8
      timeout_seconds: 300

With this setup, dbt run runs against the local DuckDB instance in development and against BigQuery in production. The SQL compiled by dbt is the same either way, with adapter-specific dialect differences handled by dbt's macro system.

Seed files, snapshots, and tests all work with the DuckDB adapter. The dev cycle becomes: write model, run dbt build --select my_model, inspect results in DuckDB, iterate. No waiting for warehouse slots, no query cost, no network latency.

Where DuckDB Falls Short

DuckDB is not the right tool in every situation. The limitations worth knowing:

Concurrent writes. DuckDB supports one writer at a time. If you have multiple processes attempting concurrent inserts or updates to a DuckDB database file, you will hit locking errors. This is by design: DuckDB is not built for OLTP or multi-writer workloads.

Dataset size limits. DuckDB can process datasets larger than RAM through its out-of-core execution, but it is still a single-node engine. For genuinely large datasets (multiple terabytes) that need to fit within a reasonable query time budget, a distributed system remains necessary.

Persistent shared storage. A DuckDB database is a file. It does not have a network interface, connection pooling, or built-in replication. If you need a shared persistent data store that multiple services or users query concurrently, a managed database service is the right choice.

Operational maturity for production workloads. DuckDB is excellent in pipelines and development environments, but deploying it as the primary query layer for a production analytics product with many concurrent users requires careful architecture. It can work, but it is not what it was designed for.

Getting Started

Install:

pip install duckdb

# For dbt integration
pip install dbt-duckdb

The quickest way to verify DuckDB belongs in your workflow: take a CSV or Parquet file you regularly work with in Pandas, run the same aggregation in DuckDB, and compare the time. For most data engineers doing that experiment for the first time, the result is conclusive enough to add DuckDB to the default local development setup immediately.

The tool earns its place not by replacing your cloud warehouse but by filling the gap between notebook-level Pandas and full warehouse infrastructure. For everything in that gap, which is most local data engineering work, DuckDB is currently the best option available.