DuckDB for Data Engineers: The In-Process Analytics Engine

DuckDB occupies a position that did not really exist before: a fully featured analytical SQL engine that runs in-process, requires no server, and handles analytical workloads that would have required a Spark cluster five years ago. The tagline "SQLite for analytics" undersells it. DuckDB is fast enough and capable enough to be a primary tool, not just a development convenience.

This guide covers what DuckDB is good at, how to use it in Python data pipelines, how to query Parquet and S3 files, how it integrates with dbt, and the honest limitations that tell you when to reach for something else.

What DuckDB Actually Is

DuckDB is a columnar, vectorized SQL database that runs embedded inside a host process (Python, R, Node, or as a standalone CLI). It stores data in its own binary format or queries external files directly without importing them first. There is no server to start, no port to connect to, and no configuration file to manage.

pip install duckdb

import duckdb

# Query a CSV directly — no import needed
result = duckdb.sql("SELECT COUNT(*) FROM 'events.csv'").fetchone()

# Query a Parquet file
df = duckdb.sql("""
    SELECT
        user_id,
        DATE_TRUNC('day', event_timestamp) AS event_date,
        COUNT(*) AS event_count
    FROM 'events.parquet'
    WHERE event_type = 'purchase'
    GROUP BY 1, 2
    ORDER BY 1, 2
""").df()  # Returns a pandas DataFrame

The performance on analytical queries is genuinely impressive. DuckDB uses vectorized execution (processing data in batches rather than row by row) and parallel execution across CPU cores by default. On a modern laptop, it can aggregate hundreds of millions of rows in seconds.

Querying Parquet and S3

DuckDB's ability to query Parquet files directly without loading them into memory first is one of its most powerful features. It uses predicate pushdown and column pruning to read only the data it needs.

# Query multiple Parquet files with glob pattern
result = duckdb.sql("""
    SELECT
        year,
        month,
        SUM(revenue) AS total_revenue
    FROM 'data/orders/year=*/month=*/*.parquet'
    GROUP BY year, month
    ORDER BY year, month
""").df()

# Query S3 directly (requires httpfs extension)
duckdb.sql("INSTALL httpfs; LOAD httpfs;")
duckdb.sql("""
    SET s3_region='us-east-1';
    SET s3_access_key_id='your_key';
    SET s3_secret_access_key='your_secret';
""")

result = duckdb.sql("""
    SELECT COUNT(*) 
    FROM 's3://my-data-bucket/events/2026/03/*.parquet'
""").fetchone()

# Use IAM role instead of credentials (preferred in production)
duckdb.sql("SET s3_use_credential_chain=true;")

For Hive-partitioned Parquet on S3, DuckDB reads the partition keys from the directory structure and uses them in predicate pushdown. A query filtering on year=2026 will skip all other year partitions without reading them.

DuckDB as a Pandas Replacement

For datasets that fit in memory or slightly exceed it, DuckDB often outperforms Pandas and produces cleaner, more readable code. The SQL interface is more expressive than method chaining for complex transformations.

import pandas as pd
import duckdb

# Load a pandas DataFrame
orders = pd.read_parquet('orders.parquet')
customers = pd.read_parquet('customers.parquet')

# DuckDB can query pandas DataFrames directly by name
result = duckdb.sql("""
    WITH customer_ltv AS (
        SELECT
            customer_id,
            SUM(amount) AS lifetime_value,
            COUNT(*) AS order_count,
            MIN(order_date) AS first_order,
            MAX(order_date) AS last_order
        FROM orders
        GROUP BY customer_id
    )
    SELECT
        c.segment,
        COUNT(*) AS customer_count,
        AVG(ltv.lifetime_value) AS avg_ltv,
        PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY ltv.lifetime_value) AS median_ltv
    FROM customers c
    JOIN customer_ltv ltv ON c.customer_id = ltv.customer_id
    GROUP BY c.segment
    ORDER BY avg_ltv DESC
""").df()

# Benchmark: for this type of multi-table aggregation,
# DuckDB is typically 3-10x faster than pandas on datasets > 10M rows

DuckDB in Data Pipelines

DuckDB fits naturally into lightweight pipeline patterns where you want SQL-based transformations without a warehouse connection.

from dagster import asset
import duckdb
import boto3
from pathlib import Path

@asset
def daily_revenue_parquet(context) -> None:
    """
    Read raw events from S3 Parquet, compute daily revenue,
    write result back to S3 as Parquet.
    """
    partition_date = context.asset_partition_key_for_output()
    
    con = duckdb.connect()
    con.execute("INSTALL httpfs; LOAD httpfs;")
    con.execute("SET s3_use_credential_chain=true;")
    
    con.execute(f"""
        COPY (
            SELECT
                DATE_TRUNC('day', event_timestamp)::DATE AS event_date,
                user_id,
                SUM(amount / 100.0) AS daily_revenue,
                COUNT(*) AS transaction_count
            FROM 's3://raw-data/events/date={partition_date}/*.parquet'
            WHERE event_type = 'purchase'
            GROUP BY 1, 2
        )
        TO 's3://processed-data/daily-revenue/date={partition_date}/data.parquet'
        (FORMAT PARQUET, COMPRESSION SNAPPY)
    """)
    
    row_count = con.execute(f"""
        SELECT COUNT(*) FROM 
        's3://processed-data/daily-revenue/date={partition_date}/data.parquet'
    """).fetchone()[0]
    
    context.log.info(f"Wrote {row_count} rows for {partition_date}")

dbt-DuckDB Integration

DuckDB has a first-class dbt adapter (dbt-duckdb), making it a fully supported warehouse target. This means you can run an entire dbt project locally against DuckDB without a cloud warehouse account, which is useful for development, testing, and CI.

# profiles.yml
my_project:
  target: dev
  outputs:
    dev:
      type: duckdb
      path: /tmp/dev.duckdb   # local file
    prod:
      type: duckdb
      path: s3://my-bucket/warehouse.duckdb  # S3-backed
      extensions:
        - httpfs

The dbt-duckdb adapter supports external sources, allowing dbt to read directly from Parquet files on S3 without materializing them into the database first. This makes it possible to build a full transformation layer on top of a data lake without a managed warehouse service.

Honest Limitations

DuckDB is single-node. It scales to the resources of a single machine, which is substantial on modern hardware (a high-memory EC2 instance can handle terabytes) but has a ceiling that distributed systems do not. For petabyte-scale workloads, you need Spark, BigQuery, or Snowflake.

DuckDB is not a transactional database. It handles concurrent reads well but concurrent writes poorly. Do not use it as an operational database. It is an analytical tool.

The ecosystem is younger than Postgres or Spark. Some integrations that are mature for other systems are still rough around the edges for DuckDB. Check the specific connector or integration you need before committing to it in production.

Within those constraints, DuckDB is one of the most valuable additions to the data engineering toolkit in recent years. The ability to run analytical SQL locally, against files, without infrastructure, changes how you prototype and test pipelines. Once you start using it, reaching for a warehouse connection to run a quick exploratory query starts to feel unnecessarily heavy.