Event-Driven Architecture for Data Engineers: When and How to Build Event Pipelines

What Makes Something an Event

An event represents something that happened at a specific point in time. It is immutable -- it cannot be changed after it occurs. A user clicked a button. An order was placed. A payment failed. These are facts about the world at a moment in time.

This is distinct from a state update. “The customer's email address is currently X” is state. “The customer changed their email address at 3:47 PM on March 27” is an event. Both are useful for different purposes, and the right data platform often needs both.

The data engineering value of events:

Complete history. A stream of events lets you reconstruct the state of any entity at any point in time. Current state (what a table row says today) cannot do this -- it only reflects the most recent update.
Decoupling. Event producers do not need to know about event consumers. A new analytics use case can subscribe to an existing event stream without changing the producer.
Replay. When your pipeline logic changes, you can replay historical events through the new logic without asking source systems to re-export data.

Event Schema Design: Decisions That Matter

Event schema design is one of the highest-impact decisions in an event-driven system. Events are immutable once emitted, which means schema mistakes are permanent. The schema you define today will be in your event store for years.

# Good event schema design - CloudEvents-inspired
{
  # Standard envelope fields (required on ALL events)
  "id": "01HZGT...",           # UUID, globally unique
  "specversion": "1.0",
  "type": "com.company.orders.created",  # Reverse-DNS namespaced
  "source": "/orders-service/v2",
  "time": "2026-03-27T14:30:00Z",        # RFC 3339, always UTC

  # Event-specific data
  "data": {
    "order_id": "ord_abc123",
    "customer_id": "cust_xyz789",
    "amount_cents": 4999,
    "currency": "USD",
    "line_items": [
      {"product_id": "prod_001", "quantity": 2, "unit_price_cents": 2499}
    ],
    "status": "created"
  },
  
  # Schema version for evolution
  "dataschema": "https://schemas.company.com/orders/created/v2"
}

The design decisions that cause the most downstream pain:

Embedding state instead of change. An event named order.status_changed should include both the old status and the new status, not just the new status. A consumer that needs to count transitions (pending to processing, processing to shipped) needs both values. If you only emit the new status, you cannot reconstruct the transition without joining to a separate state table.

No schema registry. Without a schema registry (Confluent Schema Registry, AWS Glue Schema Registry), producers and consumers have no formal contract. Schema changes break consumers silently. Register every event schema before publishing.

Using internal IDs only. If your events only reference internal database IDs, consumers that do not have access to that database cannot interpret them. Include enough context in the event that a consumer can act on it without requiring a separate lookup -- or at least include the key identifiers that allow a single-query enrichment.

Consumer Patterns: At-Least-Once vs. Exactly-Once

Kafka and most event streaming systems guarantee at-least-once delivery by default. This means a consumer may receive the same event more than once (typically after a failure and retry). Your pipeline must handle this.

Idempotent consumers handle duplicate events safely. The pattern is to use a deduplication key (the event's unique ID) in your write logic:

# Idempotent Snowflake load from Kafka events
MERGE INTO silver.orders AS target
USING (
    SELECT
        event_id,
        data:order_id::string AS order_id,
        data:customer_id::string AS customer_id,
        data:amount_cents::integer AS amount_cents,
        time::timestamp AS event_time
    FROM staging.order_events_raw
    WHERE event_type = 'com.company.orders.created'
) AS source ON target.order_id = source.order_id
WHEN MATCHED THEN
    UPDATE SET updated_at = CURRENT_TIMESTAMP()
WHEN NOT MATCHED THEN
    INSERT (order_id, customer_id, amount_cents, event_time, created_at)
    VALUES (source.order_id, source.customer_id, source.amount_cents,
            source.event_time, CURRENT_TIMESTAMP());

Exactly-once processing is available in Kafka (via transactions and idempotent producers) but adds significant complexity. For most data engineering use cases, at-least-once delivery with idempotent consumers is the right trade-off: simpler to operate, correct in result, negligible performance overhead.

Consumer Groups and Parallelism

Kafka partitions allow parallel consumption. Each partition is consumed by exactly one consumer in a consumer group. The parallelism ceiling of a consumer group equals the number of partitions -- adding more consumers than partitions leaves some consumers idle.

from confluent_kafka import Consumer, KafkaError
import json

consumer = Consumer({
    "bootstrap.servers": "kafka-broker:9092",
    "group.id": "analytics-pipeline-v2",  # Version in group ID for clean restarts
    "auto.offset.reset": "earliest",
    "enable.auto.commit": False,  # Manual commit for at-least-once guarantee
    "max.poll.interval.ms": 300000,  # 5 min max processing time per batch
})

consumer.subscribe(["order-events"])

try:
    while True:
        msg = consumer.poll(timeout=1.0)
        
        if msg is None:
            continue
        if msg.error():
            if msg.error().code() == KafkaError._PARTITION_EOF:
                continue
            raise Exception(f"Consumer error: {msg.error()}")
        
        event = json.loads(msg.value().decode("utf-8"))
        process_event(event)
        
        # Commit only after successful processing
        consumer.commit(asynchronous=False)
        
finally:
    consumer.close()

Use separate consumer groups for separate use cases. An analytics pipeline and a notification service consuming the same topic should have different group IDs. Each group maintains its own offset, so they can proceed independently without one blocking the other.

Change Data Capture: Events from Existing Databases

Most operational data does not originate as events -- it lives in relational databases as rows. Change Data Capture (CDC) converts database changes (INSERT, UPDATE, DELETE) into event streams, without requiring application code changes.

Debezium is the standard open-source CDC tool. It reads from the database's transaction log (Postgres WAL, MySQL binlog, MongoDB oplog) and emits an event for each change:

# Debezium connector config (Postgres)
{
  "name": "postgres-orders-cdc",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "postgres.internal",
    "database.port": "5432",
    "database.user": "debezium",
    "database.password": "${POSTGRES_PASSWORD}",
    "database.dbname": "orders",
    "table.include.list": "public.orders,public.order_items",
    "topic.prefix": "cdc.orders",
    "plugin.name": "pgoutput",
    "transforms": "unwrap",
    "transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
    "transforms.unwrap.add.fields": "op,ts_ms",
    "slot.name": "debezium_orders_slot"
  }
}

# Each change emits to topic: cdc.orders.public.orders
# Message structure:
# {
#   "before": {...old row...},
#   "after": {...new row...},
#   "__op": "u" (update), "c" (create), "d" (delete)
#   "__ts_ms": 1774617292049
# }

CDC is the right approach for: populating a data warehouse from an operational database without polling, building event-driven pipelines from legacy systems that were not built to emit events, and maintaining low-latency synchronized copies for analytics.

Event-Driven Alongside Batch: The Hybrid Architecture

Event-driven architecture does not replace batch processing -- it complements it. The right mental model is a two-speed platform:

Event stream layer (Kafka/Redpanda): low latency, operational use cases, fraud signals, live dashboards, real-time feature serving for ML. Seconds to minutes of latency.
Batch warehouse layer (Snowflake/BigQuery): complex historical analysis, multi-table joins, period-over-period comparisons, long-term retention. Hours of latency, full history.

Events flow into both layers. The warehouse ingests events in micro-batches (every 5-15 minutes) for analytical storage. The stream layer consumes events in real time for operational use. Consumers pick the layer that matches their latency requirement.

The operational mistake is building two competing systems that disagree on numbers because they process the same events differently. Establish a single source of truth for event schemas (the schema registry) and use the same event definitions in both layers. Discrepancies between real-time and batch numbers should be explainable by processing lag, not schema drift.