Skip to content

Claude Code Profiles

Claude Code profiles control how the MoE Sovereign orchestrator handles requests from Claude Code CLI, VS Code extension, and other Anthropic API clients. Each profile maps to a different processing mode with distinct trade-offs.

Three Reference Profiles

Native (Direct LLM)

Profile: cc-ref-native
Mode:    native

The request is forwarded directly to a single LLM (e.g., gemma4:31b) without any MoE pipeline involvement. The model handles tool calls natively.

  • Latency: 5-30 seconds
  • Use case: Quick edits, simple bug fixes, interactive coding
  • Trade-off: No multi-expert synthesis, no GraphRAG, no knowledge accumulation

Reasoning (Thinking Node)

Profile: cc-ref-reasoning
Mode:    moe_reasoning

The request passes through the MoE pipeline with the Thinking Node enabled. The LLM performs chain-of-thought reasoning with <think> blocks before generating the response.

  • Latency: 30-120 seconds
  • Use case: Architecture decisions, complex debugging, code review
  • Trade-off: Deeper analysis but slower; no parallel expert routing

Orchestrated (Full Pipeline)

Profile: cc-ref-orchestrated
Mode:    moe_orchestrated

The full MoE pipeline: Planner decomposes the task, parallel expert LLMs process sub-tasks, Merger synthesizes results, Judge evaluates quality, and GraphRAG accumulates knowledge for future requests.

  • Latency: 2-10 minutes
  • Use case: Deep research, multi-domain synthesis, knowledge-enriched analysis
  • Trade-off: Highest quality but impractical for interactive coding

Choosing a Profile

Scenario Recommended Profile
Fix a typo or syntax error Native
Add a simple feature Native
Debug a complex race condition Reasoning
Architecture review Reasoning
Security audit of a codebase Orchestrated
Research + implementation plan Orchestrated
Multi-file refactoring with tests Reasoning

Configuration

Admin UI

Navigate to CC Profile in the admin navigation. Each profile has:

Field Description
name Display name shown in clients
moe_mode native, moe_reasoning, or moe_orchestrated
tool_model LLM for tool execution (e.g., gemma4:31b)
tool_endpoint Inference server node (e.g., N04-RTX)
expert_template_id Expert template for orchestrated mode (optional)
tool_max_tokens Max output tokens for tool calls
reasoning_max_tokens Max tokens for thinking blocks
tool_choice auto, required, or any

User Portal

Users can create personal profiles under My Templates > CC Profiles. Personal profiles override admin-assigned profiles.

API Key Binding

Each API key can be bound to a specific CC profile:

  1. Admin UI > Users > select user > API Keys
  2. Set the CC Profile dropdown for the key
  3. All requests with that key will use the bound profile

Client Configuration

Point Claude Code to your MoE Sovereign instance:

# Claude Code CLI
export ANTHROPIC_BASE_URL=https://your-moe-instance.example.com
export ANTHROPIC_API_KEY=moe-sk-xxxxxxxx...

# VS Code settings.json
{
  "claude-code.apiEndpoint": "https://your-moe-instance.example.com",
  "claude-code.apiKey": "moe-sk-xxxxxxxx..."
}

Innovator Profiles

The Innovator profile family (cc-innovator-*) is designed for Claude Code power users who want the full MoE pipeline with different quality/speed trade-offs. All three profiles use moe_orchestrated mode with dedicated expert templates.

Profile Comparison

Profile ID Tool Model Target Latency Thinking Max Tokens
Fast cc-innovator-fast gemma4:31b 30-90s off 4,096
Balanced cc-innovator-balanced Qwen3-Coder-Next 2-5 min on 8,192 / 16K reasoning
Deep cc-innovator-deep Qwen3-Coder-Next 5-15 min on 8,192 / 32K reasoning

Key differences:

  • Fast uses tool_choice: required with lightweight models and stream_think: false for minimal overhead. Best for rapid iteration cycles.
  • Balanced enables thinking blocks and escalates to domain-specialist models via T2 fallback. Good default for everyday development.
  • Deep uses the largest available models with a security-analyst expert category and an assertive system prompt requiring complete, production-grade code. Best for security audits, architecture reviews, and complex refactoring.

5-Epoch Benchmark Results

A controlled benchmark across 5 consecutive runs measures the accumulation effect of the MoE knowledge pipeline. Each epoch re-runs the same test suite; GraphRAG accumulates knowledge from prior runs, improving accuracy and reducing latency.

Epoch Avg Score Avg Latency Latency vs Epoch 1
1 5.2 / 10 280s baseline
2 6.4 / 10 125s 0.45x
3 7.1 / 10 72s 0.26x
4 7.8 / 10 45s 0.16x
5 8.1 / 10 30s 0.11x

Accumulation effect: By Epoch 5, the system delivers 9.3x faster responses than Epoch 1 while simultaneously improving answer quality by 56%. This is driven by three mechanisms:

  1. GraphRAG context enrichment — prior synthesis results are stored as SYNTHESIS_INSIGHT relations and injected into future expert prompts
  2. L2 plan cache — identical task decompositions hit the Valkey SHA-256 plan cache, skipping the planner LLM entirely
  3. Model warmth — sticky sessions and the model registry keep frequently used models loaded in VRAM, eliminating cold-start overhead

The accumulation effect is most pronounced in Epochs 1-3 (steep improvement) and plateaus around Epoch 4-5 as the knowledge graph saturates for the test domain.


Download Reference Profiles

Pre-configured profile JSONs are available for download:

Replace <YOUR_OLLAMA_HOST> and <YOUR_TEMPLATE_ID> with your actual values.

API Compatibility

The /v1/messages endpoint is fully compatible with the Anthropic Messages API. Claude Code CLI, Anthropic Python SDK, and VS Code extensions work without modification — just point ANTHROPIC_BASE_URL to your MoE Sovereign instance.

Supported features:

  • Streaming responses (SSE)
  • Tool use / function calling
  • Multi-turn conversations
  • System prompts
  • Thinking blocks (reasoning mode)
  • Image inputs (forwarded to vision-capable models)