Expert Templates & Profiles: Complete Guide¶

This is the definitive guide to MoE Sovereign's template and profile system. It covers the three processing modes, expert template design, Claude Code profiles, interactive chat integration, and every tuning lever available.

1. Processing Modes: Native vs Reasoning vs Orchestrated¶

Every request to MoE Sovereign is processed in one of three modes. The mode is selected by the CC Profile (for Claude Code) or by the expert template chosen as the model (for OpenAI-compatible agents).

Native Mode (`moe_mode: native`)¶

A single LLM handles the request directly. No planner, no experts, no merger.

The request is forwarded to one inference endpoint
The LLM processes tool calls (file edits, bash) natively
No pipeline overhead

When to use: Quick edits, simple bug fixes, interactive coding sessions, autocomplete, trivial questions.

When to avoid: Multi-domain synthesis, research tasks, anything requiring domain isolation or knowledge accumulation.

MoE Reasoning (`moe_mode: moe_reasoning`)¶

A single LLM with the Thinking Node enabled. The model performs chain-of-thought reasoning before generating its response.

Streaming <think> blocks show the reasoning process in real time
The thinking trace is visible in Open WebUI's thinking panel and in Claude Code's output
No parallel expert routing — one model does all the work, but with structured deliberation

When to use: Architecture decisions, complex debugging, code review where you want to see the reasoning process.

When to avoid: Simple edits (overkill), deep research (not enough domain coverage).

MoE Orchestrated (`moe_mode: moe_orchestrated`)¶

The full MoE pipeline activates:

Planner LLM decomposes the request into 1-4 subtasks
Expert LLMs process subtasks in parallel (domain-isolated)
MCP Precision Tools handle deterministic calculations (math, regex, subnet, etc.)
GraphRAG (Neo4j) injects accumulated knowledge from prior requests
SearXNG Web Research fetches current information when enabled
Merger/Judge LLM synthesizes all results into a single response
Critic Node fact-checks safety-critical domains (medical, legal)

When to use: Security audits, research tasks, multi-domain synthesis, knowledge-enriched analysis, any task where quality matters more than speed.

When to avoid: Interactive coding where sub-second feedback is expected.

Comparison Table¶

Aspect	Native	Reasoning	Orchestrated
Latency	5-30s	30-120s	2-10 min
Token Cost	1x	1.5x	3-5x
Output Quality	Good	Better	Best
Knowledge Accumulation	No	No	Yes (GraphRAG)
Domain Routing	No	No	Yes (15 categories)
MCP Precision Tools	No	No	Yes (23 tools)
Accumulation Effect	No	No	Yes (9.3x speedup over time)
Thinking Blocks	No	Yes	Optional
Parallel Expert Execution	No	No	Yes
Critic / Fact-Check	No	No	Yes (medical, legal)

The accumulation effect

In orchestrated mode, GraphRAG stores synthesis results as SYNTHESIS_INSIGHT relations. Future requests in the same domain benefit from this accumulated knowledge. Over 5 epochs, benchmarks show a 9.3x latency reduction while quality improves by 56%. See Claude Code Profiles for the full benchmark data.

2. Expert Template Design¶

An expert template defines the complete configuration for an orchestrated MoE pipeline run: which LLMs to use, how to decompose tasks, which tools to enable, and how to synthesize results.

Template Structure¶

{
  "id": "tmpl-xxxxx",
  "name": "template-name",
  "description": "Human-readable description of the template's purpose",
  "planner_prompt": "MoE orchestrator. Decompose the request into 1-4 subtasks...",
  "planner_model": "phi4:14b@N07-GT",
  "judge_prompt": "Synthesize all inputs into one complete response...",
  "judge_model": "phi4:14b@N04-RTX",
  "experts": {
    "code_reviewer": {
      "system_prompt": "Senior SWE: correctness, security (OWASP Top 10)...",
      "models": [
        {"model": "gpt-oss:20b", "endpoint": "N09-M60", "role": "primary"},
        {"model": "devstral-small-2:24b", "endpoint": "N04-RTX", "role": "fallback"}
      ]
    },
    "technical_support": {
      "system_prompt": "Senior IT/DevOps. Immediately executable solutions...",
      "models": [
        {"model": "phi4:14b", "endpoint": "N07-GT", "role": "primary"},
        {"model": "qwen3:32b", "endpoint": "N04-RTX", "role": "fallback"}
      ]
    }
  },
  "enable_cache": true,
  "enable_graphrag": true,
  "enable_web_research": true,
  "cost_factor": 1.0
}

Field Reference¶

Field	Type	Description
`id`	string	Auto-generated unique template ID (`tmpl-xxxxxxxx`)
`name`	string	Human-readable name, appears in `/v1/models`
`description`	string	Shown in Admin UI and model listings
`planner_prompt`	string	System prompt for the Planner LLM
`planner_model`	string	Model and optional node for planning (`model@node` or `model`)
`judge_prompt`	string	System prompt for the Judge/Merger LLM
`judge_model`	string	Model and optional node for synthesis
`experts`	object	Map of category name to expert configuration
`experts.*.system_prompt`	string	Domain-specific system prompt for the expert
`experts.*.models`	array	Ordered list of model assignments (primary, fallback)
`enable_cache`	bool	Enable L1/L2 response and plan caching
`enable_graphrag`	bool	Enable Neo4j knowledge graph enrichment
`enable_web_research`	bool	Enable SearXNG web search for freshness
`cost_factor`	float	Multiplier for token budget accounting

T1/T2 Tier Strategy¶

Every expert in a template can have two model tiers:

T1 (Primary): Fast models at or below the tier boundary (default: 20B parameters). These respond in under 30 seconds and handle the majority of requests.
T2 (Fallback): Larger, more capable models above 20B. Engaged only when the T1 model reports CONFIDENCE: low (below 0.65 threshold).

The tier boundary is configurable via EXPERT_TIER_BOUNDARY_B (in billions of parameters). The default of 20B means:

Models like phi4:14b, hermes3:8b, gpt-oss:20b are T1
Models like devstral-small-2:24b, qwen3-coder:30b, deepseek-r1:32b are T2

Design principle: T1 handles ~80% of requests fast. T2 activates only when depth is needed, keeping average latency low while preserving quality for hard problems.

Pinned vs Floating Assignment¶

Models can be assigned in two ways:

Pinned (model@node):

{"model": "phi4:14b", "endpoint": "N07-GT", "role": "primary"}

The model always runs on the specified GPU node. This guarantees VRAM availability and eliminates cold-start latency (the model stays loaded). Use pinned assignments for production templates where predictable performance matters.

Floating (model only):

{"model": "devstral-small-2:24b", "role": "fallback"}

The system automatically selects the best available node with sufficient VRAM. This provides elasticity — the model can run on whichever node is least loaded. Use floating assignments for T2 fallback models and low-priority workloads.

Rule of thumb: Pin the Planner and Judge to fast nodes (RTX-class GPUs). Float T2 expert models for elasticity.

LLM Selection Guide¶

Based on empirical testing of 69 LLMs across 5 inference nodes:

Best Planner Models¶

Model	Latency	Why
`phi4:14b`	27-37s	Most reliable JSON output. Consistent, fast.
`hermes3:8b`	16s	Ultra-fast for simple decompositions
`gpt-oss:20b`	38s	Reliable, widely available
`devstral-small-2:24b`	45s	Strong on code-related planning

Best Judge/Merger Models¶

Model	Latency	Why
`phi4:14b`	1.7-4.2s	Extremely fast synthesis
`qwen3-coder:30b`	1.7s	Fast, code-aware, strong instruction following
`Qwen3-Coder-Next` (80B)	2.6s	Highest quality but needs large VRAM
`devstral-small-2:24b`	2.5s	Good for code-focused synthesis

Models to Avoid in Pipeline Roles¶

Model	Issue
`qwen3.5:35b`	Thinking mode produces `<think>` blocks instead of JSON — breaks planner and judge parsing
`deepseek-r1:32b`	Chain-of-thought interferes with JSON output — usable as expert only
`starcoder2:15b`	Code completion model, no instruction following capability
`gpt-oss:20b` (as judge)	Passes isolated tests but gets unloaded by Ollama TTL between expert calls in the pipeline

Thinking models break JSON

Models with "thinking" or "reasoning" modes (qwen3.5, deepseek-r1) tend to wrap output in <think> tags, breaking the JSON parsing that Planner and Judge roles require. Either use non-reasoning models for these roles or explicitly disable thinking in the system prompt.

System Prompt Best Practices¶

Planner Prompts¶

Do:

Demand JSON-only output explicitly
List all valid categories in the prompt
Provide a format example with the exact JSON schema
Include the PRECISION_TOOLS block for MCP routing

Do not:

Allow free-text explanations alongside JSON
Use thinking/reasoning instructions
Request markdown formatting

Judge/Merger Prompts¶

Do:

Instruct to preserve code blocks verbatim
Require provenance tags ([REF:entity]) citing which expert provided each insight
Demand verification steps for numeric values
Define explicit priority: MCP > Graph > high-confidence experts > Web > medium > low/cache

Do not:

Allow summarization of code blocks
Allow skipping security findings

Expert Prompts¶

Do:

Define the expert's domain boundary clearly
Require structured output markers: CONFIDENCE, GAPS, REFERRAL
Include domain-specific methodology (OWASP for security, S3/AWMF for medical, etc.)
End with language enforcement

Do not:

Mix domains (a security expert should not comment on code style)
Allow the expert to refuse with "I cannot help with that"

Service Toggles¶

Each template can enable or disable pipeline components independently:

Toggle	Default	Effect When Disabled
`enable_cache`	`true`	Every request runs the full pipeline — no cache hits. Use for testing or when fresh responses are required.
`enable_graphrag`	`true`	No knowledge graph enrichment. Prior synthesis results are not injected. Use for privacy-sensitive queries where no knowledge persistence is desired.
`enable_web_research`	`true`	No SearXNG web search. Responses rely entirely on model knowledge and GraphRAG. Use in air-gapped environments or for speed-critical tasks.

Compliance Badges¶

Templates are automatically classified based on where their models run:

Badge	Meaning
Local Only (green)	All models run on local infrastructure. No data leaves the network.
Mixed (yellow)	Some models use external APIs. Data may leave the network for those calls.
External (red)	Primarily external API models. Most data leaves the network.

The compliance badge is visible in the Admin UI and helps CISOs assess data residency at a glance.

3. Claude Code Profiles¶

Claude Code profiles (CC Profiles) control how MoE Sovereign handles requests from Claude Code CLI, the VS Code extension, and other Anthropic API clients. Each profile maps to a specific processing mode.

Profile Fields¶

Field	Description	Example
`name`	Display name shown in clients	`cc-ref-native`
`moe_mode`	Processing mode	`native`, `moe_reasoning`, `moe_orchestrated`
`tool_model`	LLM for tool execution	`gemma4:31b`
`tool_endpoint`	Inference server node	`N04-RTX`
`expert_template_id`	Expert template for orchestrated mode	`tmpl-d2300eb6`
`tool_max_tokens`	Max output tokens for tool calls	`4096`
`reasoning_max_tokens`	Max tokens for thinking blocks	`16384`
`tool_choice`	Tool selection behavior	`auto`, `required`, `any`
`stream_think`	Whether to stream thinking blocks	`true` / `false`

Three Reference Profile Families¶

Reference Profiles (Baseline)¶

Profile	Mode	Latency	Use Case
`cc-ref-native`	`native`	5-30s	Quick edits, simple fixes, interactive coding
`cc-ref-reasoning`	`moe_reasoning`	30-120s	Architecture decisions, complex debugging
`cc-ref-orchestrated`	`moe_orchestrated`	2-10 min	Deep research, security audits, multi-domain synthesis

Reference profiles use conservative settings and are suitable for most users out of the box.

Innovator Profiles (Power Users)¶

All three innovator profiles use moe_orchestrated mode with dedicated expert templates tuned for different quality/speed trade-offs:

Profile	Tool Model	Thinking	Max Tokens	Target Latency
`cc-innovator-fast`	`gemma4:31b`	Off	4,096	30-90s
`cc-innovator-balanced`	`Qwen3-Coder-Next`	On	8,192 / 16K reasoning	2-5 min
`cc-innovator-deep`	`Qwen3-Coder-Next`	On	8,192 / 32K reasoning	5-15 min

Key differences:

Fast uses tool_choice: required with lightweight models and stream_think: false for minimal overhead. Best for rapid iteration.
Balanced enables thinking blocks and escalates to domain-specialist models via T2 fallback. Good daily-driver default.
Deep uses the largest available models with a security-analyst expert category and an assertive system prompt demanding complete, production-grade code. Best for security audits and architecture reviews.

Creating Custom Profiles¶

Navigate to Admin UI > CC Profile
Click Create Profile
Fill in the profile fields (see table above)
Assign an expert template for orchestrated mode
Save the profile

Users can also create personal profiles in the User Portal under My Templates > CC Profiles. Personal profiles override admin-assigned profiles.

Binding Profiles to API Keys¶

Admin UI > Users > select user > API Keys
Set the CC Profile dropdown for the target key
All requests made with that key now use the bound profile

This allows a single user to have multiple API keys, each bound to a different profile, and switch workflows by changing which key is exported.

5-Epoch Benchmark Results¶

A controlled benchmark across 5 consecutive runs measures the compounding effect of the MoE knowledge pipeline:

Epoch	Avg Score	Avg Latency	Latency vs Epoch 1
1	5.2 / 10	280s	baseline
2	6.4 / 10	125s	0.45x
3	7.1 / 10	72s	0.26x
4	7.8 / 10	45s	0.16x
5	8.1 / 10	30s	0.11x

By Epoch 5, the system delivers 9.3x faster responses than Epoch 1 while improving answer quality by 56%. Three mechanisms drive this:

GraphRAG context enrichment -- prior synthesis results stored as SYNTHESIS_INSIGHT relations are injected into future expert prompts
L2 plan cache -- identical task decompositions hit the Valkey SHA-256 plan cache, skipping the planner LLM entirely
Model warmth -- sticky sessions keep frequently used models loaded in VRAM, eliminating cold-start overhead

4. Interactive Chat (Open WebUI, AnythingLLM)¶

Expert templates are not limited to coding agents. They work equally well with interactive chat interfaces like Open WebUI and AnythingLLM.

How It Works¶

Every expert template configured in MoE Sovereign appears as a "model" in the /v1/models endpoint response. Chat interfaces that support OpenAI-compatible backends can select these models directly.

# List all available models (templates)
curl https://your-moe-instance.example.com/v1/models \
  -H "Authorization: Bearer moe-sk-xxxxxxxx..." | jq '.data[].id'

Recommended Templates by Use Case¶

Use Case	Recommended Template	Why
Quick Q&A	`8b-fast` or native template	Minimal latency, good for simple questions
Code assistance	`30b-balanced`	Strong code models, reasonable latency
Deep research	`70b-deep` or `cc-expert-deep`	Full pipeline with largest models
Creative writing	Template with `creative_writer` expert	Stylistically tuned models
Data analysis	Template with `data_analyst` + MCP tools	MCP provides exact calculations

The Accumulation Effect in Chat Sessions¶

In interactive sessions using orchestrated templates, every exchange enriches the GraphRAG knowledge graph. This means:

First question in a new domain: full pipeline, highest latency
Follow-up questions in the same domain: GraphRAG injects prior context, reducing latency and improving relevance
Returning to a topic days later: accumulated knowledge is still available, providing continuity across sessions

This accumulation effect is most valuable for ongoing projects, research topics, or domains you query repeatedly.

Session Management Tips¶

Start broad, then narrow. The first question in a domain seeds the knowledge graph. Subsequent specific questions benefit from that context.
Use the same template for related questions within a project to ensure GraphRAG continuity.
Switch to native mode for quick follow-ups that do not need the full pipeline (e.g., "format that as a table").
Check the thinking panel in Open WebUI to see exactly which experts were consulted, which MCP tools ran, and what GraphRAG context was injected.

5. Levers and Trade-offs¶

Every configuration choice in MoE Sovereign is a trade-off. This section explains what each lever does and how to make informed decisions.

Cache Toggle (`enable_cache`)¶

Controls the L1 (ChromaDB) and L2 (Valkey) caching layers.

State	Behavior
Enabled (default)	L1 hit = instant response (no pipeline). L2 plan cache hit = skip planner, run experts only.
Disabled	Every request runs the full pipeline from scratch.

Impact: L1 cache hits reduce latency to near-zero for repeated or semantically similar queries. Disabling cache is useful for testing, debugging, or when you need guaranteed fresh responses.

GraphRAG Toggle (`enable_graphrag`)¶

Controls Neo4j knowledge graph enrichment.

State	Behavior
Enabled (default)	Prior synthesis results are stored and injected into future requests. Knowledge accumulates over time.
Disabled	No knowledge persistence. Each request is independent.

Impact: Enabling GraphRAG is what drives the accumulation effect (9.3x speedup over 5 epochs). Disabling it is appropriate for privacy-sensitive queries where you do not want knowledge to persist, or for isolated one-off questions.

Web Research Toggle (`enable_web_research`)¶

Controls SearXNG web search integration.

State	Behavior
Enabled (default)	The planner can flag subtasks with `"web": true`, triggering a SearXNG search. Results are injected into the merger.
Disabled	No web search. Responses rely on model training data and GraphRAG context only.

Impact: Enabling web research improves freshness for current events, recent documentation, and rapidly changing domains. Disabling it reduces latency (no search round-trip) and is required for air-gapped deployments.

T1/T2 Tier Boundary (`EXPERT_TIER_BOUNDARY_B`)¶

The parameter size threshold (in billions) that separates T1 (fast) from T2 (deep) models. Default: 20B.

Boundary	T1 Models (fast)	T2 Models (deep)
20B (default)	`phi4:14b`, `hermes3:8b`, `gpt-oss:20b`	`devstral-small-2:24b`, `qwen3-coder:30b`, `deepseek-r1:32b`
14B (aggressive)	`hermes3:8b` only	Everything above 14B
30B (conservative)	Most models including 24B	Only 32B+ models

Trade-off: Lowering the boundary means more requests escalate to T2 (higher quality, higher latency). Raising it keeps more requests on T1 (faster, but potentially lower quality for hard problems).

Node Pinning vs Floating¶

Strategy	Pros	Cons
Pinned (`model@node`)	Predictable latency, no cold starts, guaranteed VRAM	No elasticity, node failure = expert unavailable
Floating (`model` only)	Elastic, auto-balances across nodes	Cold-start latency if model is not loaded, less predictable

Best practice: Pin the Planner and Judge (they run on every request and need consistent low latency). Float T2 fallback experts (they run infrequently and benefit from elasticity).

VRAM Planning per Template¶

To calculate total VRAM needed for a template, sum the VRAM requirements of all models that may be loaded simultaneously:

Total VRAM = Planner + Judge + (number of parallel experts x largest expert model)

Approximate VRAM per model size:

Model Size	VRAM (Q4 quantized)	VRAM (FP16)
7-8B	~5 GB	~16 GB
14B	~9 GB	~28 GB
20-24B	~14 GB	~48 GB
30-32B	~20 GB	~64 GB
70B	~40 GB	~140 GB

Example: A template with phi4:14b (planner, 9 GB) + phi4:14b (judge, shared, 0 additional) + 3 parallel experts at gpt-oss:20b (3 x 14 GB = 42 GB) needs approximately 51 GB total VRAM across all nodes.

Shared models reduce VRAM

If the planner and judge use the same model on the same node, the model is loaded once. Similarly, if multiple experts use the same model on the same node, VRAM is shared. Design templates to reuse models where possible.

Decision Matrix¶

Priority	Recommended Settings
Minimum latency	Native mode, cache enabled, no GraphRAG, no web research
Maximum quality	Orchestrated mode, all toggles enabled, T2 boundary at 14B
Privacy / air-gap	Any mode, GraphRAG disabled, web research disabled
Cost efficiency	Native for simple tasks, orchestrated only for complex. Cache enabled.
Knowledge building	Orchestrated mode, GraphRAG enabled, cache enabled
Predictable performance	All models pinned, cache enabled
Maximum elasticity	All models floating, cache enabled

Expert Templates & Profiles: Complete Guide¶

1. Processing Modes: Native vs Reasoning vs Orchestrated¶

Native Mode (moe_mode: native)¶

MoE Reasoning (moe_mode: moe_reasoning)¶

MoE Orchestrated (moe_mode: moe_orchestrated)¶

Comparison Table¶

2. Expert Template Design¶

Template Structure¶

Field Reference¶

T1/T2 Tier Strategy¶

Pinned vs Floating Assignment¶

LLM Selection Guide¶

Best Planner Models¶

Best Judge/Merger Models¶

Models to Avoid in Pipeline Roles¶

System Prompt Best Practices¶

Planner Prompts¶

Judge/Merger Prompts¶

Expert Prompts¶

Service Toggles¶

Compliance Badges¶

3. Claude Code Profiles¶

Profile Fields¶

Three Reference Profile Families¶

Reference Profiles (Baseline)¶

Innovator Profiles (Power Users)¶

Creating Custom Profiles¶

Binding Profiles to API Keys¶

5-Epoch Benchmark Results¶

4. Interactive Chat (Open WebUI, AnythingLLM)¶

How It Works¶

Recommended Templates by Use Case¶

The Accumulation Effect in Chat Sessions¶

Session Management Tips¶

5. Levers and Trade-offs¶

Cache Toggle (enable_cache)¶

GraphRAG Toggle (enable_graphrag)¶

Web Research Toggle (enable_web_research)¶

T1/T2 Tier Boundary (EXPERT_TIER_BOUNDARY_B)¶

Node Pinning vs Floating¶

VRAM Planning per Template¶

Decision Matrix¶

Native Mode (`moe_mode: native`)¶

MoE Reasoning (`moe_mode: moe_reasoning`)¶

MoE Orchestrated (`moe_mode: moe_orchestrated`)¶

Cache Toggle (`enable_cache`)¶

GraphRAG Toggle (`enable_graphrag`)¶

Web Research Toggle (`enable_web_research`)¶

T1/T2 Tier Boundary (`EXPERT_TIER_BOUNDARY_B`)¶