Context Extension for Coding Agents & Large Documents¶

The Problem¶

Local LLMs have small context windows — typically 4k to 32k tokens depending on the model. This creates a hard ceiling for tasks that involve:

Entire codebases: A mid-size project may contain 100k–500k tokens of source code. No single LLM call can hold all of it.
Long documents: Technical specifications, legal contracts, research papers routinely exceed 20k tokens.
Ongoing conversations with coding agents: Tools like Claude Code or Continue.dev send the system prompt, file contents, tool results, and conversation history in every API call. This accumulates quickly.
Multi-step agentic workflows: Each tool call adds to the history; a long session can exhaust context before the task is done.

Increasing the context window is not a solution — larger windows require more VRAM, slow inference, and still have a ceiling. The real answer is architectural: restructure how information reaches LLMs so that each call receives only what it needs.

How MoE Sovereign Extends Effective Context¶

MoE Sovereign does not extend the physical context window of any individual LLM. Instead, it extends the effective context available to a query by distributing information across multiple specialized calls and structured retrieval layers.

flowchart TD
    Client["Claude Code / coding agent\n(sends: system_prompt + file_context\n+ conversation_history + query)"] --> API

    API["FastAPI :8002\nReceives full payload"] --> Normalize["Normalize + compress\nchat_history (max 4 turns, 3000 chars)"]
    Normalize --> Complex["complexity_estimator.py\nheuristic — no LLM call"]

    Complex -->|trivial| Direct["1 T1 expert\n(code_reviewer or technical_support)\nreceives: query + system_prompt + history"]

    Complex -->|moderate / complex| Planner["Planner LLM\nDecomposes into typed tasks\n(reads compressed history only)"]

    Planner --> PAR

    subgraph PAR ["Parallel specialist calls — each with focused context"]
        E1["code_reviewer expert\nreceives: task + system_prompt\n+ file_context slice + history"]
        E2["technical_support expert\nreceives: task + deployment context\n+ history"]
        GR["graph_rag_node\nStructured retrieval\n(entities, relations,\nwithout raw document text)"]
        MCP["mcp_node\nDeterministic tools\n(calculate, regex, file_hash, …)"]
    end

    PAR --> Merger["Merger LLM\nreceives: compressed summaries\n(MAX_EXPERT_OUTPUT_CHARS = 2400 each)\n+ graph_context\n+ mcp_result"]
    Merger --> Out["Final response"]

Mechanism 1 — Distributed Context (MoE Fan-Out)¶

Instead of one LLM receiving everything, each expert receives only what is relevant to its subtask:

Expert	Receives	Does NOT receive
`code_reviewer`	User query + code/file context + task	Legal reasoning, medical knowledge, math derivations
`technical_support`	User query + deployment/infra context + task	Creative writing, translation, irrelevant file sections
`math` (SymPy)	Symbolic expression only	Any natural language context
`mcp`	Tool name + arguments only	Full conversation history
`graph_rag`	Query terms + category filter	Raw document text (graph is pre-indexed)

Each expert call uses at most ~2400 characters of output (MAX_EXPERT_OUTPUT_CHARS). The merger synthesizes these focused outputs — it never needs to see the full raw documents itself.

Effective throughput: A 50k-token codebase is split by the planner into file-focused tasks. Each expert handles a relevant slice. The merger receives 2–4 × 2400 character summaries — well within any context window.

Mechanism 2 — History Compression¶

Every request from a coding agent carries conversation history. Without management this grows unbounded.

Strategy: - Maximum turns: Only the last 4 turns of conversation are included (HISTORY_MAX_TURNS=4) - Maximum chars: Total history is truncated at 3000 characters (HISTORY_MAX_CHARS=3000) - Compression: Turns that exceed the limit are replaced with […] markers — the LLM can infer that prior context was compressed - What is preserved: The most recent turns (most relevant for coding tasks) are always included; older turns are dropped first

This keeps history consumption bounded regardless of session length.

Per-template override (since April 2026):

Expert templates can now override the global limits via history_max_turns and history_max_chars in their config. Setting either to -1 disables compression entirely for that template — useful for benchmarking or long-context models.

Config value	Behaviour
`0` (default)	Use global `HISTORY_MAX_TURNS` / `HISTORY_MAX_CHARS`
`-1`	Unlimited — no compression, full history passed through
`N > 0`	Override with custom limit

Mechanism 3 — Structured Graph Retrieval (GraphRAG)¶

Rather than dumping raw documents into an LLM prompt, MoE Sovereign pre-indexes project knowledge into Neo4j:

Architecture decisions → entities + relations
Dependency graphs → DEPENDS_ON, USES, IMPLEMENTS triples
Procedural requirements → NECESSITATES_PRESENCE, ENABLES_ACTION (see Causal Learning)

At query time, graph_rag_node retrieves only the relevant slice as structured text:

[Knowledge Graph]
• LangGraph (Framework): DEPENDS_ON LangChain | USES Python
• FastAPI (Framework): USES Python | IS_A Framework

[Procedural Requirements]
• On-Premises Deployment NECESSITATES_PRESENCE Rechenzentrum (Location)

This is ~200–500 tokens of dense, structured information — far more efficient than including the full documentation or source files in the prompt.

Mechanism 4 — System Prompt as File Context Carrier¶

Claude Code and similar tools pass file context in the system_prompt field of the API request. MoE Sovereign passes this through the AgentState.system_prompt field and attaches it to expert calls in agent mode:

# In expert_worker, agent mode
if mode in ("agent", "agent_orchestrated"):
    expert_messages.insert(0, {"role": "system", "content": system_prompt})

This means:

The file context (active file, open tabs, tool results) travels with every expert call
The user query is separated from the file context and handled as a focused task
Experts can reference specific files without needing the full repo in context

For very large file contexts, the MAX_EXPERT_OUTPUT_CHARS limit ensures that even if an expert reads a large file, its output is bounded before reaching the merger.

Mechanism 5 — ChromaDB Semantic Cache¶

Repeated or similar queries hit the cache without any LLM call:

Cache distance < 0.15 → return stored response (< 50 ms, 0 tokens)
Cache distance 0.15–0.50 → inject few-shot example (~200 tokens) to guide expert

For coding agents that repeatedly ask similar questions in a session (e.g., variations of "how do I configure X"), the cache eliminates both context consumption and latency.

Mechanism 6 — Complexity-Based Context Pruning¶

The complexity_estimator.py module classifies queries without an LLM call:

Level	How determined	Context allocation
`trivial`	≤15 words, simple factual question	1 expert, no graph, no research, no thinking
`moderate`	16–79 words, code block or domain marker	Planner + 2–4 experts, graph allowed
`complex`	≥80 words or multi-step marker	Full pipeline including thinking node

Trivial queries (e.g., "What is a Docker volume?") never reach the planner, graph, or research nodes — they get a single focused expert call. This frees resources for complex queries that need the full pipeline.

Coding Agent Modes¶

Two operation modes are specifically designed for coding agent workflows:

`agent` (model: `moe-orchestrator-agent`)¶

Optimized for fast turnaround in IDEs (OpenCode, Continue.dev):

Forces code_reviewer + technical_support categories — skips unrelated experts
Skips <think> wrapper in SSE stream (rendered as raw text by IDE clients)
History and file context passed to both experts
Planner may be skipped entirely by semantic pre-router for common patterns

`agent_orchestrated` (model: `moe-orchestrator-agent-orchestrated`)¶

For Claude Code — full MoE pipeline with synthesis:

All expert categories available — planner decides freely based on query
force_think=True: thinking node runs to produce a coherent synthesis plan
skip_think=True: <think> tags are NOT emitted in SSE stream (Claude Code renders inline)
System prompt (file context) passed to all relevant experts
Graph context + MCP tools available for architecture queries

Token Budget Summary¶

Component	Token cost (typical)	Notes
Planner call	~400–800 tokens	Compressed history only; cached for 30 min
Expert call × 2	~600–1200 tokens each	Focused task + system_prompt slice
Expert output × 2	≤2400 chars (~600 tokens each)	Hard-limited
GraphRAG context	~100–500 tokens	Structured triples, not raw text
MCP tool result	~50–300 tokens	Deterministic, compact
Merger call	~2000–6000 tokens	Receives compressed summaries
Total typical	~6000–12000 tokens	vs. 50k+ if full codebase in one call

For a codebase that would naively require 100k tokens in a single call, the MoE approach brings the per-request token cost down to the 6k–12k range while covering the same semantic surface — through distribution, compression, and structured retrieval.

Configuration¶

Setting	Default	Effect
`HISTORY_MAX_TURNS`	`4`	Conversation turns included per request
`HISTORY_MAX_CHARS`	`3000`	Total history char limit
`history_max_turns` (template)	`0`	Per-template override (`0` = global, `-1` = unlimited)
`history_max_chars` (template)	`0`	Per-template override (`0` = global, `-1` = unlimited)
`MAX_EXPERT_OUTPUT_CHARS`	`2400`	Per-expert output cap before merger
`TOOL_MAX_TOKENS`	`8192`	Max tokens for MCP tool responses
`REASONING_MAX_TOKENS`	`16384`	Max tokens for thinking node output
`CACHE_HIT_THRESHOLD`	`0.15`	Cosine distance for hard cache bypass
`SOFT_CACHE_THRESHOLD`	`0.50`	Distance for few-shot injection

All of these are adjustable via Admin UI → Dashboard → Pipeline Settings.