Context Engineering Is Just Data Engineering

The Concept

There's a new job title floating around. "Context engineer." LinkedIn is already full of them. The pitch is that building AI applications requires a new discipline — one focused on curating, structuring, and delivering the right information to a language model at the right time.

I've been watching this conversation for months and I keep having the same reaction: this is data engineering. Not metaphorically. Not "kind of like" data engineering. It is data engineering, with a different consumer at the end of the pipeline.

The entire history of data engineering is the history of getting the right data, in the right shape, to the right place, at the right time. That the consumer is now a language model instead of a dashboard or a data scientist doesn't change the discipline. It changes the interface.

If you've ever built an ETL pipeline, you already know how to do this. You just don't know that's what they're asking you to do.

If You Already Know Data Engineering, You Already Know Most of This

Let me make this concrete.

A data engineer's job is to take raw data from source systems — databases, APIs, event streams, flat files — and transform it into something useful for downstream consumers. The work is unglamorous. It's schema mapping. It's deduplication. It's figuring out that the created_at field in the payments table is UTC but the order_date in the orders table is Pacific time and nobody documented this. It's building pipelines that are reliable enough to run at 3am without paging anyone.

A "context engineer" building an AI application does the same thing. They take raw information from source systems — documents, databases, APIs, conversation history — and transform it into something useful for a downstream consumer. The consumer happens to be an LLM instead of a BI tool. The "transformation" is called "prompt construction" instead of "data modeling." But the work is the same work.

Here's how the concepts map:

Data Engineering	Context Engineering
Source systems (databases, APIs, files)	Knowledge sources (docs, APIs, user data)
ETL pipeline	Context assembly pipeline
Data warehouse / lake	Vector store / retrieval index
Schema design	Prompt structure
Data quality checks	Context relevance filtering
Freshness SLAs	Retrieval recency requirements
Downstream consumers (dashboards, ML models)	Downstream consumer (LLM)
"Garbage in, garbage out"	"Garbage in, garbage out"

That last row isn't a joke. It's the most important row in the table. The quality of what comes out of an LLM is bounded by the quality of what goes in. This is so obvious to anyone who's built a data pipeline that it barely seems worth stating. And yet an entire cottage industry has sprung up around the discovery that if you feed bad context to a language model, you get bad outputs.

We've known this for thirty years. We just called it data quality.

What's Actually New

The analogy maps cleanly in most places. But there are two genuine differences worth understanding, because they change how you design systems even if the principles are the same.

The consumer is non-deterministic. A dashboard renders the same data the same way every time. An LLM does not. Give it the same context twice and you might get different outputs. This means your pipeline needs to be more precise about what it includes, because you can't rely on the consumer to ignore irrelevant data. A BI tool will just not render a column you didn't ask for. An LLM might hallucinate based on it.

In data engineering terms: your downstream consumer has no schema enforcement. It will try to use everything you give it. This makes the transformation layer — what you include, what you exclude, how you order it — more consequential than it is in a traditional pipeline.

The context window is a hard constraint with no equivalent. Data warehouses scale. You can throw more data into BigQuery and it handles it. Context windows don't work like that. You have a fixed budget — 128k tokens, 200k tokens, whatever the model supports — and everything you want the model to consider has to fit inside it. This is closer to designing for an embedded system than designing for a warehouse. You're optimizing for a constrained environment.

This changes the retrieval problem. In traditional data engineering, the hard part is usually getting data in — ingestion, transformation, loading. In context engineering, the hard part is keeping data out. You have a hundred documents that might be relevant. You can fit twelve. Choosing the right twelve is the entire game.

Under the Hood

Here's what a context assembly pipeline actually looks like. If you've built ETL, this will feel familiar.

Step 1: Ingestion. You pull from source systems. Internal docs in Confluence. Customer data from your CRM. Product specs from a wiki. Conversation history from the current session. This is your staging layer.

Step 2: Transformation. You clean, chunk, and normalize. Strip HTML. Split documents into semantically coherent chunks. Normalize dates and formats. Deduplicate. This is your transformation layer. It runs on ingest, not at query time.

Step 3: Indexing. You embed the chunks and store them in a vector database. This is your serving layer — optimized for fast retrieval at query time. It's the equivalent of building materialized views or OLAP cubes.

Step 4: Retrieval. A user asks a question. You embed the query, find the nearest chunks, and pull them. This is your query layer. Latency matters. Relevance matters more.

Step 5: Assembly. You take the retrieved chunks, the user's question, the conversation history, the system prompt, and any structured data (user profile, permissions, etc.) and you assemble them into a single prompt. This is your presentation layer — the equivalent of a dashboard rendering data for a human, except you're rendering context for a model.

def assemble_context(user_query, user_profile, conversation_history):
    # Step 4: Retrieve relevant documents
    relevant_docs = vector_store.query(user_query, top_k=10)

    # Step 5: Filter and rank by relevance + recency
    filtered = [d for d in relevant_docs if d.score > 0.75]
    filtered.sort(key=lambda d: (d.score, d.freshness), reverse=True)

    # Budget: reserve tokens for system prompt + query + response
    budget = 100_000
    budget -= count_tokens(SYSTEM_PROMPT)
    budget -= count_tokens(user_query)
    budget -= count_tokens(conversation_history[-5:])  # last 5 turns
    budget -= 4_000  # reserve for response

    # Pack context until budget exhausted
    context_chunks = []
    for doc in filtered:
        tokens = count_tokens(doc.text)
        if tokens > budget:
            break
        context_chunks.append(doc.text)
        budget -= tokens

    return {
        "system": SYSTEM_PROMPT,
        "context": "\n---\n".join(context_chunks),
        "history": conversation_history[-5:],
        "query": user_query,
    }

If that code reminds you of packing items into a knapsack with a weight limit — it should. It's the same problem. Data engineers have been solving knapsack-adjacent problems every time they decide what to materialize, what to cache, and what to compute on the fly.

There's a deeper lesson here that every data engineer learns the hard way and most AI tutorials ignore entirely: moving data is expensive, and copies rot.

The moment you extract data from a source system and copy it somewhere else — a vector store, a cache, a staging table — you've created a sync problem. The source changes. Your copy doesn't. Now you have two versions of the truth and a pipeline that needs to keep them aligned. This is the oldest problem in data engineering. It has never been fully solved. It has only ever been managed.

Context engineering inherits this problem wholesale. You chunk your Confluence docs and embed them into a vector store. Great. Someone updates a doc the next day. Your vector store is now stale. A user asks a question and gets an answer based on last week's architecture decision, not the one that got reversed on Monday. This isn't a hypothetical — it's the single most common failure mode in production RAG systems.

The principle is the same one data engineers have followed for decades: minimize copies, maximize freshness, and when you have to copy, know exactly how stale your copy is.

In practice this means:

Query source systems directly when you can. If the data is structured and the source has an API, call it at retrieval time instead of pre-indexing a copy. Yes, it's slower. But it's always current. You wouldn't build a dashboard off a CSV export when you can query the database directly. Same logic applies.
When you must copy, version everything. Track when each chunk was indexed. Track which source document it came from. Track whether that source document still exists. This is lineage. Data engineers build lineage because without it, you can't debug anything. Context engineers need it for the same reason.
Set freshness SLAs and enforce them. "We re-index docs every 24 hours" is a freshness SLA. "We re-index when a doc changes" is a better one. "We have no idea when this was last indexed" is a production incident waiting to happen.

The instinct in AI tutorials is to index everything upfront and treat the vector store as the single source of truth. Don't. The vector store is a cache. Treat it like one. The source system is the source of truth. Always.

Decision Framework

You should think of context engineering as a data engineering problem when:

You're building RAG. You are literally building an ETL pipeline. Ingest documents, transform them into chunks, load them into a vector store, query at runtime. Use the same reliability patterns you'd use for any pipeline: monitoring, alerting, data quality checks, freshness SLAs.
Your LLM outputs are unreliable. Before you blame the model, audit your pipeline. What context is it actually receiving? Is it stale? Is it irrelevant? Is it contradictory? Nine times out of ten, the problem is upstream — just like every data quality issue you've ever debugged.
You're hitting context window limits. This is a capacity planning problem. Treat it like one. Profile your token usage. Understand where the budget goes. Optimize retrieval precision before you upgrade to a bigger model.

You should not blindly apply data engineering patterns when:

The context is entirely conversational. A chatbot with no external knowledge sources isn't a data pipeline. It's a stateful application. Don't over-engineer it.
You're doing single-shot generation. If you're passing a system prompt and a user query with no retrieval, there's no pipeline to build. That's just an API call.

What Your Manager Thinks It Does vs. What It Actually Does

Your manager thinks context engineering is a new discipline that requires hiring new specialists with AI-specific skills.

What it actually is: data engineering with a language model as the downstream consumer. The skills are the same. Schema design, pipeline reliability, data quality, freshness guarantees, capacity planning. The only new skill is understanding how LLMs consume context — which takes about a week to learn, not a career to build.

If your team already has data engineers, you already have context engineers. They just need to learn the interface of the new consumer. That's a Tuesday, not a reorg.

Ship This Weekend

Take an existing data pipeline in your organization — any pipeline that feeds a dashboard or report — and rewire the output to feed an LLM instead.

Concretely:

Pick a pipeline that produces a daily or weekly summary (sales numbers, incident reports, deployment metrics — anything)
Instead of rendering the output as a dashboard, format it as structured text
Pass it to an LLM with a system prompt: "You are an analyst. Based on the following data, identify the three most important trends and explain why they matter."
Email the result to yourself

import openai

daily_data = your_existing_pipeline.run()  # same pipeline, same data

response = openai.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a data analyst. Identify the 3 most important trends in the data below. Be specific and cite numbers."},
        {"role": "user", "content": format_as_text(daily_data)}
    ]
)

send_email(to="you@company.com", body=response.choices[0].message.content)

You'll learn two things. First, that the quality of the LLM output is entirely dependent on the quality of the data you feed it — which you already knew but will now feel viscerally. Second, that "context engineering" is just deciding what to include in format_as_text(). That's data modeling. You've been doing it your entire career.