Kafka in Production: Lessons from Running Real-Time Pipelines at Scale

Kafka is one of those tools that looks simple until you run it in production. The concept is elegant: a distributed, append-only log. Events go in, consumers read them, everything is durable and replayable. The tutorials make it look like twenty lines of Python. Then you push your first million-message day and realize how many design decisions you made quietly, without noticing they were decisions at all.

What follows is not a Kafka introduction. It is a collection of hard-won lessons from running real-time streaming pipelines at a publication that cannot afford downtime during breaking news. Some of these cost us incidents. All of them made us better.

Lesson 1: Topic Design Is Schema Design. Treat It That Way.

The biggest architectural mistake I see with Kafka is treating topics as throw-away queues. You create a topic, start publishing, and figure out the structure later. This works fine for a while. Then you have six consumers downstream and a schema change, and you spend a week untangling compatibility.

Topics are contracts. Every producer is making a promise to every current and future consumer about what data looks like. Violating that promise silently is how you end up with broken dashboards at 9 AM on a Monday.

What actually works: enforce schemas via a Schema Registry from day one. We use Confluent Schema Registry with Avro. It adds a small operational overhead to set up, but it catches breaking changes at publish time, before they reach consumers. The rule is simple: if you can't describe your message contract in an Avro schema, you do not have a message contract yet. Go define one.

Partition key design is the other piece of topic architecture people underinvest in. The key determines ordering guarantees and load distribution. For our pageview events, we partition by content ID. That means all events for a given article arrive in order to the same partition, which matters when we are computing session-level engagement in real time. If we had partitioned by user ID or randomized, we would have lost that ordering guarantee entirely.

Lesson 2: Consumer Lag Is Your Most Important Metric

You will get paged about many things with Kafka. The one you want to watch constantly, before it becomes a page, is consumer group lag. Lag is the distance between where a consumer is and where the topic head is. Low lag means your consumers are keeping up. Rising lag means something is wrong: a slow consumer, a spike in event volume, a downstream dependency timing out.

We expose consumer lag as a Prometheus metric and alert at two thresholds. A warning fires at 10,000 messages behind. An incident-level alert fires at 100,000. The warning gives us time to investigate before real impact. The incident-level alert means the real-time dashboard is already showing stale data, and we are in active remediation.

The most common cause of lag in our environment: a consumer that is doing too much work per message. The pattern that kills throughput is when a consumer processes one event, makes a database call, waits for a response, processes the next event. Each round-trip to the DB adds latency. At low volume, this is invisible. At ten thousand events per second, it builds a backlog within minutes.

The fix is batching. Kafka consumers pull messages in batches by default. Work with that, not against it. Read a batch, buffer your DB operations, flush them in a single query. Your throughput will jump by an order of magnitude.

Lesson 3: Exactly-Once Is a Lie (Unless You Design for It)

Kafka's exactly-once semantics (EOS) get a lot of attention in the documentation, and the concept is real. You can configure producers and consumers such that, within the Kafka system, each message is processed exactly once. But Kafka is rarely the entire system. Once a message leaves Kafka and touches an external database, API, or file system, you are back in at-least-once territory.

The practical lesson: build your consumers to be idempotent, regardless of what delivery semantics you configure at the Kafka level. Assume every message might arrive twice. Design your writes so that processing the same message twice produces the same state as processing it once.

For our CDC pipeline feeding into Snowflake, we implemented this with a deduplication step in dbt. Every source table has a kafka_offset field. The dbt model that reads from the staging layer picks MAX(kafka_offset) per primary key before writing to the presentation layer. Duplicate messages from Kafka become invisible downstream.

Lesson 4: Dead Letter Queues Are Not Optional

Bad messages happen. A producer sends a malformed event. A schema change lands before all producers are updated. A downstream service returns an unexpected error. If your consumer does not know what to do with a bad message, it has two choices: crash, or skip. Both are bad.

Crashing stops processing for all messages, not just the bad one. Skipping means you silently lose data and will never know exactly how much.

Dead letter queues are the third option. When a message fails processing after N retries, publish it to a separate DLQ topic with enough metadata to understand what went wrong: the original message, the exception, a timestamp, the consumer group, the original topic and offset. Then build a process to review DLQ messages and decide whether they can be replayed once the root cause is fixed.

In practice, DLQs catch two categories of problems: transient errors (a downstream service was briefly unavailable) and permanent errors (a malformed message that will never parse). For transient errors, replay from the DLQ once the service recovers. For permanent errors, the DLQ gives you a clear audit trail and time to investigate without blocking the main consumer.

Lesson 5: Kafka + Kafka Connect Changes What Is Possible

The most impactful thing we did with our Kafka setup was not the streaming analytics. It was using Kafka Connect with Debezium to capture change-data-capture (CDC) events from our production Postgres databases.

Every insert, update, and delete in production flows into a Kafka topic as a structured event within milliseconds. Those events feed our Snowflake warehouse via a Confluent S3 sink connector, landing in staging tables that get transformed by dbt into clean analytical models. The result is a warehouse that reflects the operational database state with a typical latency of under five minutes.

Before this, we ran nightly ETL jobs that lagged by up to 24 hours. Now our editorial team can query what happened this morning in the analytics dashboard and get real answers. That is the kind of improvement that changes how people trust and use data.

Kafka Connect connectors are not magic, but they eliminate a lot of bespoke pipeline code. Each connector is a well-tested unit with known failure modes. When something breaks, you are debugging configuration, not custom Python. That is a meaningful operational advantage at scale.

When Not to Use Kafka

Kafka is a powerful tool. It is also a complex one. Running a Kafka cluster adds operational burden: topic management, retention policy, consumer group monitoring, Schema Registry, connector uptime. For a team without dedicated infrastructure engineering capacity, that overhead is real.

If your use case is a job that runs once an hour and processes a thousand rows, Kafka is probably overkill. A scheduled Airflow/Dagster job hitting a database directly will be simpler to build, simpler to debug, and simpler to hand to a new engineer. Complexity should be proportional to the problem it is solving.

Kafka earns its keep when you have high-volume event streams where ordering matters, when you need multiple independent consumers reading the same data at different rates, or when you need durable replayability across a complex system. If those criteria do not apply to your problem, reach for something simpler first.

The Actual Value

The real value of Kafka in our stack was not the technology. It was the organizational shift it enabled. When events are durable and replayable, you can add a new consumer without coordinating with every existing team. You can replay six months of events to backfill a new data model. You can debug production issues by replaying the exact sequence of events that caused them.

Kafka decoupled our producers from our consumers. Teams that produce data do not need to know who is consuming it or when. That decoupling reduced coordination overhead and let us build faster. That is the actual benefit. The streaming throughput is a nice property. The architectural decoupling is the thing that changed how we work.

Questions or pushback on any of this? Find me on LinkedIn.

Share on X Share on LinkedIn

Ryan Kirsch is a senior data engineer with 8+ years building data infrastructure at media, SaaS, and fintech companies. He specializes in Kafka, dbt, Snowflake, and Airflow, and writes about data engineering patterns from production experience. See his full portfolio.