Claude Code Profiles¶
Claude Code profiles control how the MoE Sovereign orchestrator handles requests from Claude Code CLI, VS Code extension, and other Anthropic API clients. Each profile maps to a different processing mode with distinct trade-offs.
Three Reference Profiles¶
Native (Direct LLM)¶
The request is forwarded directly to a single LLM (e.g., gemma4:31b)
without any MoE pipeline involvement. The model handles tool calls natively.
- Latency: 5-30 seconds
- Use case: Quick edits, simple bug fixes, interactive coding
- Trade-off: No multi-expert synthesis, no GraphRAG, no knowledge accumulation
Reasoning (Thinking Node)¶
The request passes through the MoE pipeline with the Thinking Node enabled.
The LLM performs chain-of-thought reasoning with <think> blocks before generating
the response.
- Latency: 30-120 seconds
- Use case: Architecture decisions, complex debugging, code review
- Trade-off: Deeper analysis but slower; no parallel expert routing
Orchestrated (Full Pipeline)¶
The full MoE pipeline: Planner decomposes the task, parallel expert LLMs process sub-tasks, Merger synthesizes results, Judge evaluates quality, and GraphRAG accumulates knowledge for future requests.
- Latency: 2-10 minutes
- Use case: Deep research, multi-domain synthesis, knowledge-enriched analysis
- Trade-off: Highest quality but impractical for interactive coding
Choosing a Profile¶
| Scenario | Recommended Profile |
|---|---|
| Fix a typo or syntax error | Native |
| Add a simple feature | Native |
| Debug a complex race condition | Reasoning |
| Architecture review | Reasoning |
| Security audit of a codebase | Orchestrated |
| Research + implementation plan | Orchestrated |
| Multi-file refactoring with tests | Reasoning |
Configuration¶
Admin UI¶
Navigate to CC Profile in the admin navigation. Each profile has:
| Field | Description |
|---|---|
name |
Display name shown in clients |
moe_mode |
native, moe_reasoning, or moe_orchestrated |
tool_model |
LLM for tool execution (e.g., gemma4:31b) |
tool_endpoint |
Inference server node (e.g., N04-RTX) |
expert_template_id |
Expert template for orchestrated mode (optional) |
tool_max_tokens |
Max output tokens for tool calls |
reasoning_max_tokens |
Max tokens for thinking blocks |
tool_choice |
auto, required, or any |
User Portal¶
Users can create personal profiles under My Templates > CC Profiles. Personal profiles override admin-assigned profiles.
API Key Binding¶
Each API key can be bound to a specific CC profile:
- Admin UI > Users > select user > API Keys
- Set the CC Profile dropdown for the key
- All requests with that key will use the bound profile
Client Configuration¶
Point Claude Code to your MoE Sovereign instance:
# Claude Code CLI
export ANTHROPIC_BASE_URL=https://your-moe-instance.example.com
export ANTHROPIC_API_KEY=moe-sk-xxxxxxxx...
# VS Code settings.json
{
"claude-code.apiEndpoint": "https://your-moe-instance.example.com",
"claude-code.apiKey": "moe-sk-xxxxxxxx..."
}
Innovator Profiles¶
The Innovator profile family (cc-innovator-*) is designed for Claude Code
power users who want the full MoE pipeline with different quality/speed trade-offs.
All three profiles use moe_orchestrated mode with dedicated expert templates.
Profile Comparison¶
| Profile | ID | Tool Model | Target Latency | Thinking | Max Tokens |
|---|---|---|---|---|---|
| Fast | cc-innovator-fast |
gemma4:31b |
30-90s | off | 4,096 |
| Balanced | cc-innovator-balanced |
Qwen3-Coder-Next |
2-5 min | on | 8,192 / 16K reasoning |
| Deep | cc-innovator-deep |
Qwen3-Coder-Next |
5-15 min | on | 8,192 / 32K reasoning |
Key differences:
- Fast uses
tool_choice: requiredwith lightweight models andstream_think: falsefor minimal overhead. Best for rapid iteration cycles. - Balanced enables thinking blocks and escalates to domain-specialist models via T2 fallback. Good default for everyday development.
- Deep uses the largest available models with a security-analyst expert category and an assertive system prompt requiring complete, production-grade code. Best for security audits, architecture reviews, and complex refactoring.
5-Epoch Benchmark Results¶
A controlled benchmark across 5 consecutive runs measures the accumulation effect of the MoE knowledge pipeline. Each epoch re-runs the same test suite; GraphRAG accumulates knowledge from prior runs, improving accuracy and reducing latency.
| Epoch | Avg Score | Avg Latency | Latency vs Epoch 1 |
|---|---|---|---|
| 1 | 5.2 / 10 | 280s | baseline |
| 2 | 6.4 / 10 | 125s | 0.45x |
| 3 | 7.1 / 10 | 72s | 0.26x |
| 4 | 7.8 / 10 | 45s | 0.16x |
| 5 | 8.1 / 10 | 30s | 0.11x |
Accumulation effect: By Epoch 5, the system delivers 9.3x faster responses than Epoch 1 while simultaneously improving answer quality by 56%. This is driven by three mechanisms:
- GraphRAG context enrichment — prior synthesis results are stored as
SYNTHESIS_INSIGHTrelations and injected into future expert prompts - L2 plan cache — identical task decompositions hit the Valkey SHA-256 plan cache, skipping the planner LLM entirely
- Model warmth — sticky sessions and the model registry keep frequently used models loaded in VRAM, eliminating cold-start overhead
The accumulation effect is most pronounced in Epochs 1-3 (steep improvement) and plateaus around Epoch 4-5 as the knowledge graph saturates for the test domain.
Download Reference Profiles¶
Pre-configured profile JSONs are available for download:
cc-ref-native.json— Direct LLMcc-ref-reasoning.json— Thinking Nodecc-ref-orchestrated.json— Full Pipeline
Replace <YOUR_OLLAMA_HOST> and <YOUR_TEMPLATE_ID> with your actual values.
API Compatibility¶
The /v1/messages endpoint is fully compatible with the Anthropic Messages API.
Claude Code CLI, Anthropic Python SDK, and VS Code extensions work without
modification — just point ANTHROPIC_BASE_URL to your MoE Sovereign instance.
Supported features:
- Streaming responses (SSE)
- Tool use / function calling
- Multi-turn conversations
- System prompts
- Thinking blocks (reasoning mode)
- Image inputs (forwarded to vision-capable models)