How Claude Code Front-Loads Context Without Blowing Tokens

Q: How does compaction decide what to summarize?

Compaction targets the conversation history $H$ and dynamic reminders $R$ while preserving the stable prefix. It starts by dropping stale tool outputs, then converts long message pairs into summaries, and finally rewrites the entire history as a structured summary. The runtime re‑injects CLAUDE.md, memory, and MCP tool names after compaction so that project rules stay intact.

Q: Can I prevent auto‑compaction entirely?

You can run `/context` to monitor usage and manually call `/compact` on your own schedule. You can also set a higher utilization threshold through configuration, but if the context window is exceeded, the API will truncate generation. Preventing compaction is not recommended because it risks an incomplete model response.

Claude Code front-loads thousands of tokens, yet sessions stay cheap. Learn how it budgets, caches, and compacts context to keep usage low.

When you start Claude Code, it piles on system instructions, tool schemas, project rules, auto‑learned memory, and git status before your first keystroke. You might expect that firehose of tokens to drain your budget and eat half the context window in one gulp. Yet a typical enterprise developer using Claude Code costs around $13 a day, and the initial blast lands closer to 4 % of a 200k‑token window. The secret is not that the model is cheap. The secret is that Claude Code treats context like a limited compute resource. It budgets, caches, and ruthlessly compacts every piece so that front‑loaded knowledge helps the model without drowning it.

Think of a professional kitchen during a busy service. The chef pulls out knives, salt, stock, and recipe notes before the first order arrives. Those items stay on the counter all night and they cost nothing to re‑reference. But raw vegetables, dirty pans, and tasting spoons pile up fast. The chef clears what isn’t needed and only pulls fresh ingredients at the moment of use. Claude Code does something similar with your context window. It eagerly sets out the stable, high‑value items — CLAUDE.md rules, memory indexes, tool names — and then defers or aggressively compresses the bulky, volatile stuff: large files, long command outputs, full skill bodies. And because the AI platform caches repeated prefixes, the expensive setup work gets amortized across every turn instead of being repaid on every call. The result feels like an always‑on, deeply informed assistant that somehow never overruns its mental workspace.

The context window as working memory

Claude models see the entire conversation history plus all system and tool content on every call. On a 200k‑token model like Sonnet 4.5, that full sequence is the working memory. On newer Claude 5‑series models the window stretches to 1 M tokens, but bigger does not mean better if you fill it with noise Claude models. Every token in the input costs money and competes for the model’s attention. The official documentation says it plainly: context should be treated like curated working memory, not a dumping ground Managing context windows.

Claude Code internalizes that advice and formalizes the budget. Inside the runtime the context window is treated as a hard token limit (W) split across four parts: the system prompt (S), the tool definitions (T), dynamically injected system reminders (R), and the conversation history (H). The relationship is (W = |S| + |T| + |R| + |H|) Claude Code context compaction. The first two, (S) and (T), are mostly stable. (R) varies by turn but is kept small. That leaves (H) to grow with messages, file reads, and tool outputs, and it is (H) that eventually triggers compaction, not the upfront load.

What actually sits in that startup prefix? The system prompt encodes core behavior, safety rails, and the agent loop, coming in at roughly 12 k to 15 k tokens. The built‑in tool definitions add another 8 k tokens of JSON schemas. CLAUDE.md instructions, discovered by walking up the directory tree, are concatenated and capped at 12 k characters total across all files, with a 4 k‑character per‑file limit CLAUDE.md. Auto‑memory is loaded from the first 200 lines or 25 KB of MEMORY.md, which acts as a compact index of past lessons Memory. MCP servers contribute tool names but not full schemas. Skills show up as one‑line descriptions; their bodies stay out of context until invoked. Environment and git information — platform, branch, dirty tree, recent commits — is injected near the end of the system prompt as a brief status block Environment & git. Together these pieces can push 8 k to 9 k tokens on a 200k window, around 4 % to 5 %. It is a rich but deliberately bounded payload.

The magic that makes this affordable is prompt caching. Claude’s API recognizes repeated prefixes between requests and charges a fraction of the normal price for cached tokens Prompt caching. Claude Code arranges the startup material into a stable prefix that changes only when you update a CLAUDE.md or restart the session. The first API call in a session pays full price for that prefix. Every subsequent turn reuses the cached copy at a steep discount. The runtime further protects the prefix during compaction. When it summarizes the conversation history to reclaim space, it leaves (S) and (T) untouched so that the cache hit rate stays high Claude Code compaction logic. In practice, a long session might run hundreds of turns while the up‑front cost of front‑loading gets amortized to almost nothing.

How the runtime assembles context at startup

The startup sequence is a curated harvest, not a whole‑repo scan. When you launch Claude Code, it reads the working directory, platform, shell, and OS details. For git repos, it runs a lightweight git status and grabs recent commits. That snapshot is the only dynamic state that enters the prefix. In parallel, a directory walk collects CLAUDE.md files from the current directory upward, respecting the hard character limits CLAUDE.md. MEMORY.md is read and truncated to its first 200 lines or 25 KB. Skills are discovered from user and project directories; their full instructions stay on disk. MCP servers advertise their tool names MCP integration. Built‑in tool schemas are embedded into the system prompt. All of this is assembled into the initial prefix before the first API call ever fires.

Crucially, no file contents are injected at startup. The model knows the project’s directory structure and git status because the environment block tells it so, but it has not yet read a single line of code. That knowledge is front‑loaded without burning tokens on bulk file reads. When the conversation later needs a specific file, the model calls the read_file tool and the file’s content enters the context as part of (H). At that point the cost is paid, but only for what is actually needed.

The agentic loop that consumes context

Once the prefix is set, Claude Code drops into a single‑threaded loop that alternates between calling the model and executing tools. Each iteration sends the full current context window to the model. The model replies with either plain text, signaling it is done, or with one or more structured tool calls. If tools are requested, the runtime executes them, captures their output, and appends the results to (H). The loop repeats until the model stops issuing tools or the window fills up Agent loop.

This loop is where the burn rate is managed in real time. Hooks intercept every tool call. Before execution, PreToolUse hooks can inspect the request and restrict arguments. After execution, PostToolUse hooks can summarize, truncate, or redact the output before it reaches the model Hooks. Permissions can automatically allow or deny tool calls based on patterns, preventing dangerous or expensive commands from ever adding output to the window Permissions. System reminders are injected at specific moments to reinforce rules about the plan or the approaching compaction threshold. The effect is that the model’s working memory is fed only the information that has passed a series of gates.

Compaction: the multi‑tier defense against bloat

Context grows — inevitably, as the model reads files, runs tests, and writes code, (H) expands. Claude Code defines a utilization threshold, typically around 92 % to 95 % of the window, where it stops and compacts before the next model call would exceed the limit. This compaction is not a single erasure. It is a layered sequence of increasingly aggressive steps, each designed to free space while preserving as much signal as possible Context compaction.

The first tier clears raw tool outputs that are no longer referenced. The next tier compresses older message turns by replacing verbose exchanges with concise summaries, keeping recent context intact. If that is not enough, the runtime rewrites the entire history into a structured summary. Throughout every tier, the stable prefix — system prompt, tool definitions, CLAUDE.md, memory — is automatically re‑injected into the compacted state so that the model does not lose its project knowledge. MCP tool listings are re‑added as well. The runtime effectively freezes the cached prefix and only compacts (H) and (R), minimizing cache breakage.

This three‑tier approach means that front‑loaded context survives compaction. It persists across what looks to the user like a seamless conversation, even though the underlying history has been radically restructured. The model forgets the exact dialog that led to its current state, but it retains the stable instructions and environment awareness it needs to continue.

Controls that let you shape the budget

You are not a passenger watching the token budget evaporate. Claude Code exposes several knobs that let you actively manage what stays in context.

/context shows you a breakdown of current token usage across categories Context command.
/compact triggers an immediate compaction without waiting for the threshold.
/clear starts a fresh conversation with only the stable prefix, wiping (H) entirely.
Plan Mode restricts the model to high‑level thinking without executing tools, which keeps (H) small while you explore an approach Plan mode.
Subagents spin off isolated tasks with their own fresh context windows, so that a long side‑exploration does not pollute the main conversation Subagents.

These controls shift the trade‑off between continuity and cost. A developer who uses Plan Mode for architecture before heavy code reading can keep the main session lean. A subagent can search across many files, summarize the findings, and hand only the summary back to the parent, keeping the main window focused.

Property	Value
Typical startup token load (200k window)	8 k to 9 k tokens
System prompt size	~12 k to 15 k tokens
Built‑in tool definitions	~8 k tokens
CLAUDE.md global char limit	12 k characters across all files
Per‑file CLAUDE.md char limit	4 k characters
MEMORY.md load limit	First 200 lines or 25 KB
Default compaction threshold	92 % to 95 % of context window
Prompt cache discount	Significant reduction for repeated prefixes
Subagent context handling	Isolated fresh window, summary returned to parent

Frequently Asked Questions

Q: Does front‑loading all that startup context inflate the first turn’s token cost?
Yes, the first turn pays full price for the stable prefix because no part of it is cached yet. But prompt caching amortizes that cost across all subsequent turns in the session. On a session with dozens of interactions, the up‑front premium becomes negligible.

Q: How does compaction decide what to summarize?
Compaction targets the conversation history (H) and dynamic reminders (R) while preserving the stable prefix. It starts by dropping stale tool outputs, then converts long message pairs into summaries, and finally rewrites the entire history as a structured summary. The runtime re‑injects CLAUDE.md, memory, and MCP tool names after compaction so that project rules stay intact.

Q: Can I prevent auto‑compaction entirely?
You can run /context to monitor usage and manually call /compact on your own schedule. You can also set a higher utilization threshold through configuration, but if the context window is exceeded, the API will truncate generation. Preventing compaction is not recommended because it risks an incomplete model response.

Q: What happens if I accidentally blow past the context window?
Claude 5‑series models accept requests that exceed the window and signal stop_reason: "model_context_window_exceeded" when generation stops. Claude Code uses higher‑level compaction to avoid reaching that point, but if it happens, you can use /clear to restart with only the stable prefix.

Q: How much does prompt caching really save on a typical day?
Anthropic’s enterprise telemetry shows average daily cost around $13 per active developer, with a heavy contribution from cached prefixes. Without caching, front‑loaded context on every turn would balloon costs. The exact saving depends on session length, but for a developer who works in long sessions, the prefix cost falls to nearly zero after the first several turns.

If you want this kind of breakdown every week — how real developer tools actually manage resources under the hood — subscribe to Internals Decoded at internalsdecoded.com. We never waste your time, just like Claude Code doesn’t waste tokens.