Claude Code Profiles¶

Claude Code profiles control how the MoE Sovereign orchestrator handles requests from Claude Code CLI, VS Code extension, and other Anthropic API clients. Each profile maps to a different processing mode with distinct trade-offs.

Three Reference Profiles¶

Native (Direct LLM)¶

Profile: cc-ref-native
Mode:    native

The request is forwarded directly to a single LLM (e.g., gemma4:31b) without any MoE pipeline involvement. The model handles tool calls natively.

Latency: 5-30 seconds
Use case: Quick edits, simple bug fixes, interactive coding
Trade-off: No multi-expert synthesis, no GraphRAG, no knowledge accumulation

Reasoning (Thinking Node)¶

Profile: cc-ref-reasoning
Mode:    moe_reasoning

The request passes through the MoE pipeline with the Thinking Node enabled. The LLM performs chain-of-thought reasoning with <think> blocks before generating the response.

Latency: 30-120 seconds
Use case: Architecture decisions, complex debugging, code review
Trade-off: Deeper analysis but slower; no parallel expert routing

Orchestrated (Full Pipeline)¶

Profile: cc-ref-orchestrated
Mode:    moe_orchestrated

The full MoE pipeline: Planner decomposes the task, parallel expert LLMs process sub-tasks, Merger synthesizes results, Judge evaluates quality, and GraphRAG accumulates knowledge for future requests.

Latency: 2-10 minutes
Use case: Deep research, multi-domain synthesis, knowledge-enriched analysis
Trade-off: Highest quality but impractical for interactive coding

Choosing a Profile¶

Scenario	Recommended Profile
Fix a typo or syntax error	Native
Add a simple feature	Native
Debug a complex race condition	Reasoning
Architecture review	Reasoning
Security audit of a codebase	Orchestrated
Research + implementation plan	Orchestrated
Multi-file refactoring with tests	Reasoning

Configuration¶

Admin UI¶

Navigate to CC Profile in the admin navigation. Each profile has:

Field	Description
`name`	Display name shown in clients
`moe_mode`	`native`, `moe_reasoning`, or `moe_orchestrated`
`tool_model`	LLM for tool execution (e.g., `gemma4:31b`)
`tool_endpoint`	Inference server node (e.g., `N04-RTX`)
`expert_template_id`	Expert template for orchestrated mode (optional)
`tool_max_tokens`	Max output tokens for tool calls
`reasoning_max_tokens`	Max tokens for thinking blocks
`tool_choice`	`auto`, `required`, or `any`

User Portal¶

Users can create personal profiles under My Templates > CC Profiles. Personal profiles override admin-assigned profiles.

API Key Binding¶

Each API key can be bound to a specific CC profile:

Admin UI > Users > select user > API Keys
Set the CC Profile dropdown for the key
All requests with that key will use the bound profile

Client Configuration¶

Point Claude Code to your MoE Sovereign instance:

# Claude Code CLI
export ANTHROPIC_BASE_URL=https://your-moe-instance.example.com
export ANTHROPIC_API_KEY=moe-sk-xxxxxxxx...

# VS Code settings.json
{
  "claude-code.apiEndpoint": "https://your-moe-instance.example.com",
  "claude-code.apiKey": "moe-sk-xxxxxxxx..."
}

Innovator Profiles¶

The Innovator profile family (cc-innovator-*) is designed for Claude Code power users who want the full MoE pipeline with different quality/speed trade-offs. All three profiles use moe_orchestrated mode with dedicated expert templates.

Profile Comparison¶

Profile	ID	Tool Model	Target Latency	Thinking	Max Tokens
Fast	`cc-innovator-fast`	`gemma4:31b`	30-90s	off	4,096
Balanced	`cc-innovator-balanced`	`Qwen3-Coder-Next`	2-5 min	on	8,192 / 16K reasoning
Deep	`cc-innovator-deep`	`Qwen3-Coder-Next`	5-15 min	on	8,192 / 32K reasoning

Key differences:

Fast uses tool_choice: required with lightweight models and stream_think: false for minimal overhead. Best for rapid iteration cycles.
Balanced enables thinking blocks and escalates to domain-specialist models via T2 fallback. Good default for everyday development.
Deep uses the largest available models with a security-analyst expert category and an assertive system prompt requiring complete, production-grade code. Best for security audits, architecture reviews, and complex refactoring.

5-Epoch Benchmark Results¶

A controlled benchmark across 5 consecutive runs measures the accumulation effect of the MoE knowledge pipeline. Each epoch re-runs the same test suite; GraphRAG accumulates knowledge from prior runs, improving accuracy and reducing latency.

Epoch	Avg Score	Avg Latency	Latency vs Epoch 1
1	5.2 / 10	280s	baseline
2	6.4 / 10	125s	0.45x
3	7.1 / 10	72s	0.26x
4	7.8 / 10	45s	0.16x
5	8.1 / 10	30s	0.11x

Accumulation effect: By Epoch 5, the system delivers 9.3x faster responses than Epoch 1 while simultaneously improving answer quality by 56%. This is driven by three mechanisms:

GraphRAG context enrichment — prior synthesis results are stored as SYNTHESIS_INSIGHT relations and injected into future expert prompts
L2 plan cache — identical task decompositions hit the Valkey SHA-256 plan cache, skipping the planner LLM entirely
Model warmth — sticky sessions and the model registry keep frequently used models loaded in VRAM, eliminating cold-start overhead

The accumulation effect is most pronounced in Epochs 1-3 (steep improvement) and plateaus around Epoch 4-5 as the knowledge graph saturates for the test domain.

Download Reference Profiles¶

Pre-configured profile JSONs are available for download:

cc-ref-native.json — Direct LLM
cc-ref-reasoning.json — Thinking Node
cc-ref-orchestrated.json — Full Pipeline

Replace <YOUR_OLLAMA_HOST> and <YOUR_TEMPLATE_ID> with your actual values.

API Compatibility¶

The /v1/messages endpoint is fully compatible with the Anthropic Messages API. Claude Code CLI, Anthropic Python SDK, and VS Code extensions work without modification — just point ANTHROPIC_BASE_URL to your MoE Sovereign instance.

Supported features:

Streaming responses (SSE)
Tool use / function calling
Multi-turn conversations
System prompts
Thinking blocks (reasoning mode)
Image inputs (forwarded to vision-capable models)