Real-Time Analytics: Building a Streaming Data Warehouse with Redpanda and Materialize

The Real-Time Analytics Problem

Traditional data warehouses are optimized for batch. You run a pipeline every hour or every day, load the results, and analysts query the warehouse. This works well for historical analysis -- what happened last quarter, what cohorts churned in the past year -- but falls apart when the question is “what is happening right now.”

The naive fix is to run batches more frequently. Five-minute microbatches. One-minute microbatches. Eventually you are running Spark jobs every 30 seconds and discovering that your warehouse was never designed for that write frequency. You have table lock contention, duplicates in incremental models, and latency that keeps creeping up as the data volume grows.

The real fix is to treat streaming as a first-class primitive. That means a messaging layer that handles high-throughput writes with durable replay (Kafka or a Kafka-compatible system), and a compute layer that can execute SQL continuously over changing data (a streaming database or materialized view engine).

Redpanda: Kafka Without the Operational Overhead

Apache Kafka is the standard for high-throughput event streaming, but running Kafka in production is operationally intensive. Zookeeper (or KRaft), broker tuning, JVM heap management, partition rebalancing -- there is a reason Kafka expertise is its own career path.

Redpanda is a Kafka-compatible message broker written in C++ instead of Java/Scala. It is fully API-compatible with Kafka, which means any Kafka client, connector, or tool works with Redpanda without code changes. The operational difference is significant: no Zookeeper, no JVM, a single binary deployment, and meaningfully lower tail latency.

For data engineers who know Kafka but are tired of JVM tuning, Redpanda is a drop-in that trades some ecosystem maturity for much easier operations. You produce and consume exactly as you would with Kafka:

from confluent_kafka import Producer, Consumer
import json

# Producer -- identical to Kafka
producer = Producer({
    "bootstrap.servers": "redpanda-broker:9092",
    "acks": "all",
    "retries": 3,
})

def on_delivery(err, msg):
    if err:
        print(f"Delivery failed: {err}")

producer.produce(
    topic="order-events",
    key="order-12345",
    value=json.dumps({
        "order_id": "order-12345",
        "customer_id": "cust-789",
        "amount_cents": 4999,
        "status": "created",
        "timestamp": "2026-03-27T14:30:00Z",
    }),
    callback=on_delivery,
)
producer.flush()

# Consumer -- also identical to Kafka
consumer = Consumer({
    "bootstrap.servers": "redpanda-broker:9092",
    "group.id": "analytics-consumer",
    "auto.offset.reset": "earliest",
})
consumer.subscribe(["order-events"])

The Kafka-compatible API is why Redpanda is worth knowing even if you work primarily with managed Kafka (Confluent, MSK). The architecture knowledge transfers directly, and for local development or test environments, Redpanda spins up in a single Docker container in under 10 seconds.

Materialize: Streaming SQL That Feels Like a Database

Materialize is a streaming database built on Timely Dataflow and Differential Dataflow. From the user perspective, it looks like Postgres -- you connect with a Postgres client, write SQL, and query views. The difference is that those views are maintained incrementally in real time as new data arrives.

Instead of writing a Flink job or a Spark Streaming application, you write a SQL CREATE MATERIALIZED VIEW. Materialize subscribes to the upstream source (a Kafka topic, a Postgres CDC feed) and keeps the view result fresh as events arrive. Query latency is in the milliseconds because the computation runs continuously -- you are reading a precomputed result, not re-scanning data.

Setting up a source from Redpanda:

-- Connect Materialize to Redpanda
CREATE CONNECTION redpanda_conn
TO KAFKA (
    BROKER 'redpanda-broker:9092'
);

-- Declare the schema
CREATE SOURCE order_events
FROM KAFKA CONNECTION redpanda_conn (TOPIC 'order-events')
FORMAT JSON
(
    order_id     TEXT,
    customer_id  TEXT,
    amount_cents BIGINT,
    status       TEXT,
    "timestamp"  TEXT
)
ENVELOPE NONE;

-- Create a continuously updated materialized view
CREATE MATERIALIZED VIEW active_orders_by_customer AS
SELECT
    customer_id,
    COUNT(*)              AS total_orders,
    SUM(amount_cents)     AS total_spent_cents,
    MAX("timestamp")      AS last_order_at
FROM order_events
WHERE status != 'cancelled'
GROUP BY customer_id;

-- This query returns fresh results in milliseconds
SELECT * FROM active_orders_by_customer
WHERE customer_id = 'cust-789';

The view is incrementally maintained. When a new order event arrives for cust-789, Materialize updates only that row -- it does not re-scan the entire table. This is the core property that makes streaming SQL practical: sub-second latency without full recomputation on every query.

Connecting the Streaming Layer to dbt

One of the underappreciated properties of Materialize is its Postgres-compatible interface, which means dbt works with it almost out of the box. The dbt-materialize adapter lets you define streaming transformations as dbt models and run them through standard dbt commands.

# profiles.yml
materialize:
  target: dev
  outputs:
    dev:
      type: materialize
      host: localhost
      port: 6875
      user: materialize
      database: materialize
      schema: analytics

# models/streaming/customer_revenue_realtime.sql
# config block at top
# {{ config(materialized='materializedview') }}

SELECT
    customer_id,
    DATE_TRUNC('hour', "timestamp"::timestamptz) AS hour_bucket,
    COUNT(*)                                       AS orders,
    SUM(amount_cents) / 100.0                      AS revenue_usd
FROM {{ source('redpanda', 'order_events') }}
WHERE status = 'delivered'
GROUP BY 1, 2

The materialized view model type tells dbt to create a Materialize-maintained view rather than a standard table or view. dbt handles the lineage, documentation, and testing -- the same workflow you use for batch dbt models -- but the underlying computation is streaming.

This is the architectural win: your team does not need to learn Flink or Spark Streaming. Data engineers who know dbt can write streaming transformations. The streaming complexity is abstracted into the adapter layer.

A Full Real-Time Analytics Stack

Combining the pieces into a production-grade real-time analytics architecture:

Event producers -- application services, microservices, webhooks -- publish to Redpanda topics. Kafka-compatible clients, zero code changes from Kafka.
Materialize subscribes to Redpanda topics and maintains materialized views. SQL-defined transformations, millisecond query latency.
dbt-materialize manages the transformation layer. Streaming models live in the same dbt project as batch models, unified lineage and documentation.
Downstream BI tools connect to Materialize via Postgres driver. Metabase, Superset, Grafana -- any tool that speaks Postgres works without modification.
Batch warehouse (Snowflake, BigQuery) runs in parallel for historical analysis and complex aggregations that do not need real-time latency.

The parallel operation is intentional. Real-time does not replace batch for all use cases. Operational dashboards and fraud signals belong in the streaming layer. Historical cohort analysis and month-end revenue attribution belong in the batch warehouse. The right architecture uses both.

When Real-Time Is Worth It

Real-time infrastructure adds operational complexity. Before reaching for Redpanda and Materialize, verify the business case clearly.

Real-time is worth the investment when:

Decisions depend on freshness. Fraud detection, inventory management, live bidding systems, and operational SLA tracking all degrade meaningfully with batch latency.
User-facing features need it. If a product feature shows a live metric to users (active users online, real-time leaderboards, live transaction counts), the data needs to be seconds-fresh, not hours-fresh.
The volume justifies streaming. High-frequency event streams -- clickstreams, IoT sensors, payment transactions -- generate data continuously. Batch pipelines on these sources often create backpressure problems that streaming solves naturally.

Real-time is likely overkill when stakeholders only look at dashboards once a day, when the business cadence is weekly or monthly, or when your current batch pipeline already runs fast enough. Adding streaming for its own sake creates maintenance burden without business value.

Getting Started Without Rebuilding Everything

The practical starting point is to identify one high-value real-time use case -- typically an operational dashboard or a fraud/anomaly detection signal -- and build the streaming path only for that use case. Leave everything else in the batch warehouse.

Redpanda runs locally in Docker and Materialize has a developer tier. You can prototype a working end-to-end streaming pipeline in an afternoon without any cloud infrastructure. Once the pattern is proven, moving to managed Redpanda Cloud and Materialize Cloud is straightforward.

The combination of Kafka-compatible streaming (Redpanda), streaming SQL (Materialize), and familiar dbt workflows lowers the barrier to real-time analytics further than it has ever been. You do not need a dedicated streaming team or deep Flink expertise to deliver millisecond query latency. You need a clear use case, the right tooling, and a dbt model.