Blog

What a Data Engineer Actually Builds for an LLM Application

Ryan Kirsch · October 10, 2025 · 6 min read

Most AI content focuses on the model. Here is the infrastructure that makes it work in production.

Every week there is another blog post about prompt engineering, fine-tuning, or which frontier model wins some benchmark. What you almost never read about is the data layer underneath. That layer is why the LLM application works. It is also why most LLM applications fail quietly: the retrieval is bad, the context is stale, and nobody can tell why.

I am a Senior Data Engineer at The Philadelphia Inquirer. We build with Dagster, dbt, and DuckDB. Over the last year I have spent a significant chunk of my time on LLM-adjacent infrastructure: ingestion pipelines, embedding workflows, vector store integrations. This is what that work actually looks like.

Not theory. Not a vendor tutorial. Here is what a data engineer owns, builds, and maintains when a team ships an LLM application.

The Data Layer No One Talks About

Here is the stack, top to bottom:

data source → ingestion pipeline → chunking + embedding → vector store → retrieval layer → LLM → application

The DE owns everything from "data source" through "retrieval layer." The model, the prompt templates, the frontend: those belong to other people. The infrastructure that gets the right content in front of the model at query time is mine.

What that means concretely:

  • Ingestion. Pulling from APIs, web sources, databases, file systems. Handling rate limits, auth failures, schema changes, and the fun situation where the upstream just... returns different fields now.
  • Cleaning and transformation. Stripping HTML artifacts, normalizing encoding, deduplicating, structuring unstructured text into something a chunker can work with.
  • Chunking and embedding. Splitting documents into chunks that will retrieve well. Running those chunks through an embedding model. Storing the vectors with their metadata.
  • Vector store integration. Writing to pgvector or Pinecone, maintaining indexes, and making sure the retrieval queries are actually returning relevant results.
  • Retrieval quality. This is ongoing. It is not a one-time setup. The pipeline needs monitoring the same way any production data pipeline does.

That is the scope. It is more than most job descriptions suggest.

The Three Hardest Engineering Problems

Chunking strategy

How you split a document determines what the retrieval layer can find. Get it wrong and the LLM gets irrelevant context or no coherent context at all.

Chunk too large: the retrieved chunk contains the answer buried in 2,000 tokens of noise. The model either misses it or hallucinates around it. Chunk too small: each chunk is semantically incomplete. A three-sentence chunk about a medication side effect means nothing without the two sentences before it that name the medication.

There is no universal right answer. The correct chunk size depends on the document type, the query patterns, and how the retrieval layer ranks results. I have shipped with 512-token chunks for article content and 128-token chunks for structured reference data. Both were right for their use case. The decision takes actual testing, not a tutorial default.

One thing that helps: storing chunk metadata alongside the vector. Section headers, document IDs, position in the original document. That context improves re-ranking and helps when you need to debug a bad retrieval result.

Embedding pipeline maintenance

Embeddings go stale. This is the part nobody mentions in the quickstart guides.

When a source document updates, you cannot just update the document in your warehouse. You have to re-embed the affected chunks and update the vector store. If you do not, the vector store holds embeddings for content that no longer exists as written. Retrieval returns outdated information. The LLM confidently cites something that changed three months ago.

At The Inquirer, content changes constantly. Articles get updated, corrections get appended, stories get retracted. Any embedding pipeline for news content has to handle incremental re-embedding as a first-class concern, not a later problem.

This is where orchestration matters. In Dagster, I model the embedding pipeline as assets with dependencies. When the source article asset updates, downstream embedding assets know they are stale. The lineage is explicit. Without that, you are manually tracking what needs to be re-embedded and you will miss things.

Evaluation without ground truth

How do you test a retrieval pipeline when you have no labeled dataset of correct answers?

This is where data engineering and ML engineering start to overlap in uncomfortable ways. There is no clean "accuracy" metric to optimize. You cannot compute precision and recall if you do not have a labeled query set. Most teams building their first LLM app do not have one.

What I actually do: build a small set of known queries with known expected content, run them manually against the pipeline, and evaluate the top-k retrieved chunks by inspection. It is slow and not rigorous. It also catches the obvious failures: wrong document type being retrieved, chunking artifacts appearing in results, a metadata filter gone wrong.

Over time, user feedback signals become useful. If users are repeatedly rephrasing the same query, retrieval is probably failing them. That is a signal, not a metric, but it is something.

The honest answer is that DE evaluation for RAG pipelines is an unsolved problem in most shops. Anyone telling you otherwise is either running at massive scale with real labeled data or simplifying.

The Stack

Here is what I actually use, without the marketing copy.

Ingestion: Python with httpx for async API calls,requests when simplicity matters, BeautifulSoup for web sources. Most news content lives behind internal APIs. Some of it is HTML that has to be parsed carefully or you end up embedding navigation menus.

Transformation: dbt for structured source data: analytics tables, metadata, structured logs. Custom Python for unstructured text: cleaning, normalization, chunking logic. dbt does not handle arbitrary text transformation well and I do not try to make it.

Storage: DuckDB locally and in development pipelines where I need fast iteration without infrastructure overhead. BigQuery or Snowflake in production, depending on the client or the data volume. DuckDB's performance for development workflows has been genuinely surprising.

Embedding: OpenAI's text-embedding-3-small for most production work: fast, cheap, good quality. sentence-transformers locally when I need to iterate quickly without API costs or when a project has data sensitivity requirements that rule out sending content to a third party.

Vector store: pgvector for the majority of use cases. It runs inside Postgres, which means one less infrastructure dependency, and it handles the retrieval loads I have seen in practice. Pinecone when the scale or the query complexity justifies a dedicated vector database. Most projects do not need Pinecone. The ones that do know it early.

Orchestration: Dagster. Asset-based orchestration is the right mental model for embedding pipelines because the lineage matters. I need to know which embeddings depend on which source assets, and Dagster makes that explicit and trackable.

What Interviewers Are Actually Asking

Companies hiring DEs for LLM work are asking different questions than they were two years ago. Schema evolution and SCD Type 2 are still there. Now they are sitting next to questions like: "How would you design an embedding pipeline for a 500,000-document corpus?" and "What breaks first when your vector store goes out of sync with your source data?"

These are engineering questions, not data science questions. The skill set is pipeline design, infrastructure thinking, and operational discipline applied to a new class of storage: vector indexes.

If you are preparing for DE interviews in 2025 or 2026, especially at companies building AI products, this is the territory you need to cover. I put together 25 questions I have seen in real DE interview loops, including 10 specifically on LLM pipeline engineering. They are at drills.ryankirsch.dev.

DE work on LLM applications is infrastructure work. It is quieter than model development, less visible than the product surface, and more consequential than most teams realize until retrieval fails in production and the LLM starts confidently making things up. The data layer either works or everything downstream breaks. That is the job.

Preparing for data engineering interviews? The 2025-2026 Drill Pack at drills.ryankirsch.dev covers 25 questions across pipelines, modeling, and LLM infrastructure. Built from real interview loops.

RK

Ryan Kirsch

Data Engineer at the Philadelphia Inquirer. Writing about practical data engineering, local-first stacks, and systems that scale without a cloud bill.

View portfolio →