Sovereign MoE — System Documentation¶
Last updated: 2026-03-29 — Version 2.1.0
Project directory:/opt/deployment/moe-infra
1. Overview¶
Sovereign MoE is a fully self-hosted Multi-Model LLM-System. Incoming requests are analyzed by an orchestrator and distributed to specialized LLM experts, external search tools, and precision calculation tools. The results are synthesized by a Judge LLM.
The system is OpenAI API-compatible (including correctly terminating streaming) and can be embedded as a drop-in replacement in clients like Open WebUI.
Core Principles¶
- No cloud dependency — all LLMs run locally via Ollama on own GPU hardware
- Specialization over generalization — the best available models are used per category
- Exactness over LLM estimation — calculations, data, hashes are deterministic via MCP server
- Learning through use — feedback flows back into expert performance scores and the knowledge graph
- Persistent background processing — Kafka decouples the HTTP path from ingest and logging
2. Hardware & Infrastructure¶
The system supports multiple Ollama-based inference servers. Admins configure inference endpoints via Admin UI → Servers. Each server can host multiple models for different expert categories.
Inference server URLs and model assignments are configured via INFERENCE_SERVERS in .env or through the Admin UI dashboard.
VRAM Management¶
VRAM management is handled dynamically via the CLUSTER_HARDWARE JSON object in .env. Each GPU class receives a configurable priority. The orchestrator routes requests to inference servers based on availability and model assignments. Ollama queues requests internally per node. There are no static semaphores.
For details on cluster configuration: Ollama Cluster and LiteLLM Gateway.
3. System Architecture¶
flowchart TD
CLIENT["☁ Client\n(Open WebUI, curl, SDK, Claude Code)"]
NGINX["Nginx (host-native)\nTLS/Let's Encrypt\n→ :8002"]
subgraph ORCH["LangGraph Orchestrator · Port 8002"]
CACHE["🔍 cache_lookup\nChromaDB semantic search"]
SROUTER["⚡ semantic_router\nPrototype matching"]
PLANNER["🧠 planner\n(Judge-LLM)\nValkey plan cache TTL 30m"]
WORKERS["👥 workers\nT1 + T2 expert models\nGPU semaphore managed"]
RESEARCH["🌐 research\n(SearXNG)"]
MATH["∑ math\n(SymPy)"]
MCP_NODE["🔧 mcp\n(26 Precision Tools)"]
GRAPHRAG["🗃 graph_rag\n(Neo4j 2-hop)\nValkey cache TTL 1h"]
THINKING["💭 thinking\n(CoT, conditional)"]
MERGER["⚖ merger\n(Judge-LLM)\nFast-Path for single expert"]
CRITIC["🔎 critic\n(medical/legal only)"]
end
CLIENT -->|"HTTPS"| NGINX
NGINX -->|"POST /v1/chat/completions"| CACHE
CACHE -->|"Hit (< 0.15)"| CLIENT
CACHE -->|"Miss"| SROUTER
SROUTER -->|"Direct match"| WORKERS
SROUTER -->|"No match"| PLANNER
PLANNER --> WORKERS & RESEARCH & MATH & MCP_NODE & GRAPHRAG
WORKERS & RESEARCH & MATH & MCP_NODE & GRAPHRAG --> THINKING
THINKING --> MERGER
MERGER -->|"Fast-Path"| CRITIC
MERGER -->|"Full synthesis"| CRITIC
CRITIC --> CLIENT
MERGER --> CHROMADB[("ChromaDB :8001\nSemantic cache + prototypes")]
MERGER --> REDIS[("Valkey :6379\nPlan/Graph/Perf cache\nActive request registry")]
MERGER --> PG[("Postgres :5432\nterra_checkpoints\nLangGraph state")]
MERGER -->|"moe.ingest + moe.requests"| KAFKA[("Kafka :9092\nKRaft mode")]
KAFKA -->|"Consumer → extract_and_ingest"| NEO4J[("Neo4j :7687\nKnowledge graph")]
MCP_NODE --> MCPSERVER["MCP Precision Tools\n:8003"]
4. Docker Services¶
sudo docker compose up -d # start all services
sudo docker compose down # stop
sudo docker compose logs -f <name> # live logs
The full stack consists of 18 Docker services plus the host-level Nginx (not containerized). See Docker Services Reference for details on each service, and Webserver & Reverse Proxy for the external access architecture.
Service Overview¶
| Container | Image / Build | Ports (Host→Container) | Function |
|---|---|---|---|
langgraph-orchestrator |
./Dockerfile |
8002:8000 |
Core orchestrator, FastAPI, LangGraph |
mcp-precision |
./mcp_server/Dockerfile |
8003:8003 |
26 deterministic precision tools |
neo4j-knowledge |
neo4j (pinned SHA) |
7474:7474, 7687:7687 |
Knowledge graph (GraphRAG + ontology) |
terra_cache |
valkey/valkey (pinned SHA) |
6379:6379 |
Valkey — cache, scoring, session metadata |
terra_checkpoints |
postgres:17-alpine |
(internal only) | Postgres — LangGraph AsyncPostgresSaver |
chromadb-vector |
chromadb/chroma (pinned SHA) |
8001:8000 |
Vector database for semantic caching |
moe-kafka |
confluentinc/cp-kafka:7.7.0 |
9092:9092 |
Kafka event streaming (KRaft, no Zookeeper) |
docker-socket-proxy |
tecnavia/docker-socket-proxy:0.3.0 |
2375:2375 |
Read-only Docker API proxy for moe-admin |
moe-admin |
./admin_ui/Dockerfile |
8088:8088 |
Admin UI: config, users, monitoring |
moe-prometheus |
prom/prometheus (pinned SHA) |
9090:9090 |
Metrics scraper (90-day TSDB) |
moe-grafana |
grafana/grafana (pinned SHA) |
3001:3000 |
Dashboards & visualization |
node-exporter |
prom/node-exporter (pinned SHA) |
9100:9100 |
Host metrics (CPU, RAM, disk, net) |
cadvisor |
gcr.io/cadvisor (pinned SHA) |
9338:8080 |
Container resource metrics |
moe-docs |
squidfunk/mkdocs-material (pinned SHA) |
8098:8000 |
MkDocs documentation server |
moe-docs-sync |
custom Python | — | Periodic docs sync agent (every 15 min) |
moe-dozzle-init |
python:3.12-alpine |
— | One-shot init: Dozzle bcrypt users.yml |
moe-dozzle |
amir20/dozzle (pinned SHA) |
9999:8080 |
Container log viewer with auth |
moe-caddy |
caddy (pinned SHA) |
80:80, 443:443 |
Internal reverse proxy for docs/dozzle |
Host Volumes¶
| Path | Contents |
|---|---|
/opt/moe-infra/neo4j-data |
Neo4j data |
/opt/moe-infra/redis-data |
Valkey persistence (historical path name) |
/opt/moe-infra/langgraph-checkpoints |
Postgres data for terra_checkpoints |
/opt/moe-infra/chroma-data |
ChromaDB vector data |
/opt/moe-infra/kafka-data |
Kafka log segments |
/opt/moe-infra/agent-logs |
Orchestrator logs |
5. LangGraph Pipeline¶
Pipeline Flow¶
flowchart TD
START([START]) --> CACHE_LOOKUP
CACHE_LOOKUP["cache_lookup_node"] -->|"cache_hit=True"| MERGER_NODE
CACHE_LOOKUP -->|"cache_hit=False"| PLANNER_NODE["planner_node"]
PLANNER_NODE --> EW["expert_worker"]
PLANNER_NODE --> RN["research_node"]
PLANNER_NODE --> MN["math_node_wrapper"]
PLANNER_NODE --> MCPN["mcp_node"]
PLANNER_NODE --> GRN["graph_rag_node"]
EW --> RFN["research_fallback_node"]
RN --> RFN
MN --> RFN
MCPN --> RFN
GRN --> RFN
RFN --> THINKING_NODE["thinking_node\n(CoT, conditionally active)"]
THINKING_NODE --> MERGER_NODE["merger_node"]
MERGER_NODE --> CRITIC_NODE["critic_node\n(fact-check, only medical/legal)"]
CRITIC_NODE --> END([END])
Nodes in Detail¶
cache_lookup_node¶
- Queries ChromaDB for 3 semantically most similar entries
- Skips entries with
flagged=True(negative feedback) - Distance <
CACHE_HIT_THRESHOLD(0.15) →cache_hit=True→ entire pipeline bypassed
planner_node¶
- Judge LLM decomposes request into 1–4 subtasks as JSON array
_sanitize_plan()validates each entry: must be dict withtask+category; strings, empty dicts, and unknown categories are discarded (logged)- Fallback to
[{"task": input, "category": "general"}]on JSON parse error - Knows all expert categories, 16 MCP tools, and research option
Planner Rules (enforced in prompt):
- precision_tools always beats math — never both for the same task
- research only for genuinely external/current information needs
- Never keywords or questions as tasks — always task descriptions
expert_worker¶
- Filters tasks to LLM expert categories (all except
precision_tools,research) - Two-Tier logic: T1 (≤20B) first — if
CONFIDENCE: high→ T2 skipped; otherwise T2 escalates - Sorts experts by performance score (best first); score < 0.3 → skipped
- Injects
chat_history(last 4 turns, max 3000 chars) into all expert messages - Output cap:
MAX_EXPERT_OUTPUT_CHARS(2400 chars ≈ 600 tokens) - Returns
expert_models_used(["model::category", ...])
research_node¶
- Active only for
researchtask in plan - Uses
search_queryfrom plan (optimized by planner) - Returns structured source citations via
search.results()(title + URL numbered)
research_fallback_node¶
- Runs after fan-in of all parallel nodes
- For each
CONFIDENCE: lowexpert result: targeted web research with citation tracking - Aggregates new results in
web_research
thinking_node¶
- Conditionally active: for plans with >1 task OR if at least one expert reports
CONFIDENCE: low - Magistral:24b runs 4-step Chain-of-Thought:
- Problem decomposition
- Source evaluation (which info is reliable, contradictions?)
- Knowledge gaps
- Conclusion
- Output as
reasoning_trace→ prioritized section in merger prompt - Progress appears in Open WebUI
<think>panel
math_node_wrapper¶
- Active only if
mathtask in plan AND noprecision_toolstask present - SymPy-based (symbolic)
mcp_node¶
- All
precision_toolstasks in parallel viaasyncio.gather - HTTP POST to MCP server
/invoke
graph_rag_node¶
- Parallel to all other nodes after planner
- Term extraction from request (regex, no LLM call)
- 2-hop traversal in Neo4j
merger_node¶
- Calls
_dedup_by_category()— keeps only the highest-confidence expert result per category - Source priority: Reasoning trace > MCP > Knowledge graph > Experts > Web > Cache
- On cache hit: direct return, no LLM call
- After response (> 150 chars), publishes:
- ChromaDB: new cache document (sync)
- Valkey: response metadata (async task)
- Kafka
moe.requests: audit log (async task) - Kafka
moe.ingest: GraphRAG ingest job (async task)
critic_node¶
- Only active if plan contains
medical_consultorlegal_advisor - Second Judge LLM pass checks merger response for factual errors and dangerous statements
- Response
CONFIRMED→ response unchanged; error found → corrected version
State Schema (AgentState)¶
| Field | Type | Description |
|---|---|---|
input |
str | User query |
response_id |
str | chatcmpl-<uuid> for feedback tracking |
mode |
str | default | code | concise |
plan |
List[Dict] | Subtask plan |
expert_results |
Annotated[list, add] | Expert responses (fan-in) |
expert_models_used |
Annotated[list, add] | ["model::category", ...] |
web_research |
str | SearXNG result with source citations |
cached_facts |
str | Nearest cache hit content |
cache_hit |
bool | Cache short-circuit triggered |
math_result |
str | SymPy result |
mcp_result |
str | MCP tools result |
graph_context |
str | Neo4j context block |
final_response |
str | Final synthesized response |
prompt_tokens |
Annotated[int, add] | Accumulated prompt tokens across all LLM calls |
completion_tokens |
Annotated[int, add] | Accumulated completion tokens across all LLM calls |
chat_history |
List[Dict] | Conversation context (last 4 turns) for experts |
reasoning_trace |
str | CoT output from thinking_node |
6. Expert System¶
Two-Tier System¶
Experts are divided into tiers by model size:
| Tier | Threshold | Role |
|---|---|---|
| T1 | ≤ 20B parameters | Fast initial assessment |
| T2 | > 20B parameters | Premium escalation for non-high confidence |
T1 runs first. If at least one T1 result returns CONFIDENCE: high → T2 is skipped. Otherwise all T2 models run in parallel.
Configured Experts¶
| Category | Model 1 | Tier | Model 2 | Tier |
|---|---|---|---|---|
general |
gemma3:27b |
T2 | qwen3.5:35b |
T2 |
math |
phi4:14b |
T1 | qwq:32b |
T2 |
technical_support |
deepseek-coder-v2:16b |
T1 | devstral:24b |
T2 |
creative_writer |
gemma3:27b |
T2 | qwen3.5:35b |
T2 |
code_reviewer |
devstral:24b |
T2 | qwen3-coder:30b |
T2 |
medical_consult |
phi4:14b |
T1 | gemma3:27b |
T2 |
legal_advisor |
magistral:24b |
T2 | command-r:35b |
T2 |
translation |
translategemma:27b |
T2 | qwen3.5:35b |
T2 |
reasoning |
phi4:14b |
T1 | deepseek-r1:32b |
T2 |
Judge LLM: magistral:24b (planner, merger, thinking node, critic, GraphRAG extraction)
System Prompts¶
| Category | Role & Behavior |
|---|---|
general |
Versatile, fact-based, structured |
math |
Solution steps + LaTeX, back-verification |
technical_support |
IT engineer, executable steps, concrete commands |
creative_writer |
Vivid, stylistically confident, register matching context |
code_reviewer |
Senior SWE, bugs/security/performance, concrete code |
medical_consult |
Factual, guideline-based, always recommend doctor visit |
legal_advisor |
German law (§§), recommend individual legal consultation |
translation |
Professional, idiomatic, cultural nuances |
reasoning |
Chain-of-Thought, state assumptions, justified conclusions |
7. MCP Precision Tools Server¶
Port: 8003 · File: mcp_server/server.py
REST Endpoints¶
| Endpoint | Method | Description |
|---|---|---|
/health |
GET | Status + tool list |
/tools |
GET | Tool descriptions (for planner prompt) |
/invoke |
POST | {"tool": "name", "args": {...}} |
/mcp/sse |
GET | MCP SSE endpoint (Claude Desktop etc.) |
All 16 Tools¶
| Tool | Description |
|---|---|
calculate |
Exact arithmetic, formulas, percentages (safe AST evaluator) |
solve_equation |
Algebraic equations (SymPy) |
date_diff |
Exact date difference (days, years, months) |
date_add |
Date arithmetic (add/subtract days/months/years) |
day_of_week |
Day of week, calendar week, day of year |
unit_convert |
Physical units (pint): km/h→m/s, °F→°C, ... |
statistics_calc |
mean, median, stdev, variance, min, max, sum, count, mode |
hash_text |
MD5, SHA1, SHA224, SHA256, SHA384, SHA512 |
base64_codec |
Base64 encode / decode |
regex_extract |
Regex matching with flags (i, m, s) |
subnet_calc |
CIDR: network, broadcast, mask, host range |
text_analyze |
Words, characters, sentences, paragraphs, reading time |
prime_factorize |
Prime factorization via SymPy |
gcd_lcm |
GCD and LCM |
json_query |
JSON path queries (key.sub, array[0]) |
roman_numeral |
Arabic ↔ Roman (1–3999) |
8. GraphRAG & Knowledge Ontology¶
File: graph_rag/manager.py, graph_rag/ontology.py · DB: Neo4j 5
Base Ontology (loaded on startup, idempotent)¶
- 104 entities — Medical, Legal, Technical, Math/Science
- 100 relations — IS_A, TREATS, CAUSES, INTERACTS_WITH, CONTRAINDICATES, DEFINES, USES, IMPLEMENTS, DEPENDS_ON, EXTENDS, RELATED_TO, AFFECTS, etc.
Context Query¶
- Term extraction from request (regex, prefers nouns/proper nouns)
- Fuzzy search on
name+aliases_str(case-insensitive) - 2-hop traversal: direct + indirect relations
- Output as
[Knowledge Graph]text block in merger prompt
Background Ingest (via Kafka)¶
- Merger response →
moe.ingesttopic - Kafka consumer calls
extract_and_ingest() - Judge LLM extracts up to 8 triples (JSON format)
- Conflict check: TREATS vs. CAUSES/CONTRAINDICATES (and vice versa)
- New triples are stored with
r.verified=false
Feedback Integration¶
| Rating | Neo4j |
|---|---|
| 1–2 | r.flagged=true, r.verified=false |
| 4–5 | r.verified=true, r.flagged=false |
9. Kafka Event Streaming¶
Detailed documentation: Kafka
Quick Overview¶
| Topic | Producer | Consumer | Content |
|---|---|---|---|
moe.ingest |
merger_node |
_kafka_consumer_loop |
GraphRAG ingest job |
moe.requests |
merger_node |
_kafka_consumer_loop (logging) |
Audit log |
moe.feedback |
/v1/feedback |
— | Feedback events |
Kafka setup: confluentinc/cp-kafka:7.7.0, KRaft mode (no Zookeeper), port 9092, 7-day retention.
Graceful degradation: If Kafka fails, the system continues operating fully — HTTP responses, cache, Valkey unaffected.
10. Memory & Learning Mechanisms¶
Four Memory Levels¶
| Level | Storage | Grows from | TTL |
|---|---|---|---|
| Semantic cache | ChromaDB | Every merger response > 150 chars | unlimited |
| Knowledge graph | Neo4j | Kafka moe.ingest consumer |
unlimited |
| Expert performance | Valkey | POST /v1/feedback |
unlimited |
| LangGraph checkpoints | Valkey | Every request | unlimited |
Expert Performance Scoring¶
Key: moe:perf:{model}:{category}
Fields: total, positive, negative
Score = (positive + 1) / (total + 2) # Laplace smoothing
- Under 5 ratings: 0.5 (neutral)
- Score < 0.3 after ≥ 5 ratings → expert is skipped
Feedback Effect¶
| Rating | ChromaDB | Expert Score | Neo4j |
|---|---|---|---|
| 1–2 (negative) | flagged=True |
negative++ |
flagged=true |
| 3 (neutral) | — | — | — |
| 4–5 (positive) | — | positive++ |
verified=true |
11. OpenAI API & Streaming¶
Streaming (SSE) — fully OpenAI-compatible¶
Each chunk contains id, object, created, model, choices:
1. First chunk: delta={"role":"assistant","content":""}, finish_reason=null
2. Content chunks: delta={"content":"<50 chars>"}, finish_reason=null
3. Stop chunk: delta={}, finish_reason="stop"
4. Final: data: [DONE]
The stop chunk with finish_reason: "stop" is the signal for Open WebUI (and all OpenAI-compatible clients) to close the stream and free LLMs. Without this chunk, the client would wait indefinitely.
Non-Streaming¶
{
"id": "chatcmpl-<uuid>",
"object": "chat.completion",
"created": 1774783273,
"model": "moe-orchestrator",
"choices": [{"index": 0, "message": {"role": "assistant", "content": "..."}, "finish_reason": "stop"}],
"usage": {"prompt_tokens": 0, "completion_tokens": 0, "total_tokens": 0}
}
12. API Reference¶
Base URL: http://<host>:8002
Chat Completions¶
POST /v1/chat/completions
{"model":"moe-orchestrator","messages":[{"role":"user","content":"..."}],"stream":false}
Models¶
Feedback¶
Graph¶
13. Configuration & Environment Variables¶
| Variable | Default | Description |
|---|---|---|
INFERENCE_SERVERS |
— | JSON array of inference server configurations |
JUDGE_ENDPOINT |
— | Server for judge LLM (configured server name) |
JUDGE_MODEL |
gemma3:4b |
Judge model |
EXPERT_MODELS |
{} |
JSON: category → [{model, endpoint}] |
SEARXNG_URL |
— | SearXNG instance |
REDIS_URL |
redis://terra_cache:6379 |
Valkey connection (redis:// is the protocol) |
POSTGRES_CHECKPOINT_URL |
postgresql://langgraph:***@terra_checkpoints:5432/langgraph |
LangGraph checkpoint store |
CHROMA_HOST |
chromadb-vector |
ChromaDB host |
MCP_URL |
http://mcp-precision:8003 |
MCP server |
NEO4J_URI |
bolt://neo4j-knowledge:7687 |
Neo4j |
NEO4J_USER |
neo4j |
Neo4j user |
NEO4J_PASS |
moe-sovereign |
Neo4j password |
KAFKA_URL |
kafka://moe-kafka:9092 |
Kafka bootstrap |
CLUSTER_HARDWARE |
JSON object | GPU nodes, priorities, model assignments |
MAX_EXPERT_OUTPUT_CHARS |
2400 |
Output cap per expert |
CACHE_HIT_THRESHOLD |
0.15 |
Cosine distance threshold |
LOG_LEVEL |
INFO |
DEBUG / INFO / WARNING / ERROR |
14. Data Persistence¶
| Data | Storage | TTL | Valkey Key / ChromaDB Collection |
|---|---|---|---|
| LangGraph checkpoints | Valkey | unlimited | langgraph:* |
| Semantic cache | ChromaDB | unlimited | moe_fact_cache |
| Expert performance | Valkey | unlimited | moe:perf:{model}:{cat} |
| Response metadata | Valkey | 7 days | moe:response:{id} |
| Knowledge graph | Neo4j | unlimited | Volume /opt/moe-infra/neo4j-data |
| Kafka events | Kafka | 7 days / 512 MB | /opt/moe-infra/kafka-data |
15. Deployment¶
First Start¶
cd /opt/deployment/moe-infra
cp temp.env .env
# configure INFERENCE_SERVERS, SEARXNG_URL, EXPERT_MODELS
sudo mkdir -p /opt/moe-infra/{neo4j-data,neo4j-logs,redis-data,langgraph-checkpoints,chroma-data,kafka-data,agent-logs}
sudo docker compose up -d --build
sudo docker compose logs -f langgraph-orchestrator
After Code Changes¶
# Orchestrator
sudo docker compose build langgraph-app && sudo docker compose up -d langgraph-app
# MCP server
sudo docker compose build mcp-precision && sudo docker compose up -d mcp-precision
Startup Sequence (Orchestrator)¶
- Valkey client
- Kafka producer (12 attempts, 5–60s backoff)
- Load MCP tool descriptions
- Neo4j GraphRAG (6 attempts, 10–60s backoff)
- Start Kafka consumer as background task
- AsyncValkeySaver → compile LangGraph graph
- FastAPI on port 8000
Health Checks¶
curl http://localhost:8002/v1/models # orchestrator
curl http://localhost:8002/graph/stats # GraphRAG
curl http://localhost:8003/health # MCP server
sudo docker exec terra_cache valkey-cli ping # Valkey
sudo docker exec moe-kafka kafka-topics \
--bootstrap-server localhost:9092 --list # Kafka topics
# Neo4j Browser: http://localhost:7474
16. Project Structure¶
graph LR
ROOT[/opt/deployment/moe-infra/]
ROOT --> MAIN[main.py<br/>LangGraph orchestrator, FastAPI, nodes, Kafka]
ROOT --> MATH[math_node.py<br/>SymPy math node]
ROOT --> REQ[requirements.txt]
ROOT --> DF[Dockerfile]
ROOT --> DC[docker-compose.yml<br/>all services]
ROOT --> ENV[.env<br/>not in git]
ROOT --> GR[graph_rag/]
ROOT --> MCPD[mcp_server/]
ROOT --> ADMIN[admin_ui/<br/>Flask admin panel]
ROOT --> DOCS[docs/<br/>MkDocs source]
GR --> GRM[manager.py<br/>Neo4j client, query_context, extract_and_ingest]
GR --> GRO[ontology.py<br/>base ontology]
MCPD --> MCPS[server.py<br/>deterministic precision tools]
MCPD --> MCPDF[Dockerfile]
17. Quality Mechanisms¶
Confidence Scoring¶
All experts produce structured output with self-assessed confidence:
high = established expert knowledge · medium = nuances possible · low = data gaps, uncertainty
Automated Quality Assurance Chain¶
flowchart TD
EO[Expert output] --> FB{CONFIDENCE: low?}
FB -->|yes| RF[research_fallback<br/>web research with citations]
FB -->|no| TN
RF --> TN
TN[thinking_node<br/>active if >1 task OR low confidence<br/>4-step CoT: decomposition → sources → gaps → conclusion]
TN --> MN
MN[merger_node<br/>_dedup_by_category: best confidence per category wins<br/>Source priority: Reasoning > MCP > Graph > Experts > Web > Cache]
MN --> CN
CN{critic_node<br/>only medical_consult / legal_advisor<br/>factual errors? dangerous statements?}
CN -->|CONFIRMED| OUT([pass])
CN -->|error| FIX([corrected version])
Open WebUI Integration¶
<think>panel: Progress reports from all nodes viacontextvars.ContextVar→ SSE stream → "Thinking" panel- Internal requests:
_is_openwebui_internal()detects title/follow-up/autocomplete → fast path without pipeline - Token tracking: Accumulated
usagefields (all LLM calls) in every response
Output Modes¶
| Model ID | Mode | Behavior |
|---|---|---|
moe-orchestrator |
default |
Full responses with explanations |
moe-orchestrator-code |
code |
Source code only, no prose |
moe-orchestrator-concise |
concise |
Max 120 words |
Generated on 2026-03-29 — Version 2.1.0