Sovereign MoE — System Documentation¶

Last updated: 2026-03-29 — Version 2.1.0
Project directory: /opt/deployment/moe-infra

1. Overview¶

Sovereign MoE is a fully self-hosted Multi-Model LLM-System. Incoming requests are analyzed by an orchestrator and distributed to specialized LLM experts, external search tools, and precision calculation tools. The results are synthesized by a Judge LLM.

The system is OpenAI API-compatible (including correctly terminating streaming) and can be embedded as a drop-in replacement in clients like Open WebUI.

Core Principles¶

No cloud dependency — all LLMs run locally via Ollama on own GPU hardware
Specialization over generalization — the best available models are used per category
Exactness over LLM estimation — calculations, data, hashes are deterministic via MCP server
Learning through use — feedback flows back into expert performance scores and the knowledge graph
Persistent background processing — Kafka decouples the HTTP path from ingest and logging

2. Hardware & Infrastructure¶

The system supports multiple Ollama-based inference servers. Admins configure inference endpoints via Admin UI → Servers. Each server can host multiple models for different expert categories.

Inference server URLs and model assignments are configured via INFERENCE_SERVERS in .env or through the Admin UI dashboard.

VRAM Management¶

VRAM management is handled dynamically via the CLUSTER_HARDWARE JSON object in .env. Each GPU class receives a configurable priority. The orchestrator routes requests to inference servers based on availability and model assignments. Ollama queues requests internally per node. There are no static semaphores.

For details on cluster configuration: Ollama Cluster and LiteLLM Gateway.

3. System Architecture¶

flowchart TD
    CLIENT["☁ Client\n(Open WebUI, curl, SDK, Claude Code)"]
    NGINX["Nginx (host-native)\nTLS/Let's Encrypt\n→ :8002"]

    subgraph ORCH["LangGraph Orchestrator · Port 8002"]
        CACHE["🔍 cache_lookup\nChromaDB semantic search"]
        SROUTER["⚡ semantic_router\nPrototype matching"]
        PLANNER["🧠 planner\n(Judge-LLM)\nValkey plan cache TTL 30m"]
        WORKERS["👥 workers\nT1 + T2 expert models\nGPU semaphore managed"]
        RESEARCH["🌐 research\n(SearXNG)"]
        MATH["∑ math\n(SymPy)"]
        MCP_NODE["🔧 mcp\n(26 Precision Tools)"]
        GRAPHRAG["🗃 graph_rag\n(Neo4j 2-hop)\nValkey cache TTL 1h"]
        THINKING["💭 thinking\n(CoT, conditional)"]
        MERGER["⚖ merger\n(Judge-LLM)\nFast-Path for single expert"]
        CRITIC["🔎 critic\n(medical/legal only)"]
    end

    CLIENT -->|"HTTPS"| NGINX
    NGINX -->|"POST /v1/chat/completions"| CACHE
    CACHE -->|"Hit (< 0.15)"| CLIENT
    CACHE -->|"Miss"| SROUTER
    SROUTER -->|"Direct match"| WORKERS
    SROUTER -->|"No match"| PLANNER
    PLANNER --> WORKERS & RESEARCH & MATH & MCP_NODE & GRAPHRAG
    WORKERS & RESEARCH & MATH & MCP_NODE & GRAPHRAG --> THINKING
    THINKING --> MERGER
    MERGER -->|"Fast-Path"| CRITIC
    MERGER -->|"Full synthesis"| CRITIC
    CRITIC --> CLIENT

    MERGER --> CHROMADB[("ChromaDB :8001\nSemantic cache + prototypes")]
    MERGER --> REDIS[("Valkey :6379\nPlan/Graph/Perf cache\nActive request registry")]
    MERGER --> PG[("Postgres :5432\nterra_checkpoints\nLangGraph state")]
    MERGER -->|"moe.ingest + moe.requests"| KAFKA[("Kafka :9092\nKRaft mode")]
    KAFKA -->|"Consumer → extract_and_ingest"| NEO4J[("Neo4j :7687\nKnowledge graph")]
    MCP_NODE --> MCPSERVER["MCP Precision Tools\n:8003"]

4. Docker Services¶

sudo docker compose up -d           # start all services
sudo docker compose down            # stop
sudo docker compose logs -f <name>  # live logs

The full stack consists of 18 Docker services plus the host-level Nginx (not containerized). See Docker Services Reference for details on each service, and Webserver & Reverse Proxy for the external access architecture.

Service Overview¶

Container	Image / Build	Ports (Host→Container)	Function
`langgraph-orchestrator`	`./Dockerfile`	`8002:8000`	Core orchestrator, FastAPI, LangGraph
`mcp-precision`	`./mcp_server/Dockerfile`	`8003:8003`	26 deterministic precision tools
`neo4j-knowledge`	`neo4j` (pinned SHA)	`7474:7474`, `7687:7687`	Knowledge graph (GraphRAG + ontology)
`terra_cache`	`valkey/valkey` (pinned SHA)	`6379:6379`	Valkey — cache, scoring, session metadata
`terra_checkpoints`	`postgres:17-alpine`	(internal only)	Postgres — LangGraph `AsyncPostgresSaver`
`chromadb-vector`	`chromadb/chroma` (pinned SHA)	`8001:8000`	Vector database for semantic caching
`moe-kafka`	`confluentinc/cp-kafka:7.7.0`	`9092:9092`	Kafka event streaming (KRaft, no Zookeeper)
`docker-socket-proxy`	`tecnavia/docker-socket-proxy:0.3.0`	`2375:2375`	Read-only Docker API proxy for moe-admin
`moe-admin`	`./admin_ui/Dockerfile`	`8088:8088`	Admin UI: config, users, monitoring
`moe-prometheus`	`prom/prometheus` (pinned SHA)	`9090:9090`	Metrics scraper (90-day TSDB)
`moe-grafana`	`grafana/grafana` (pinned SHA)	`3001:3000`	Dashboards & visualization
`node-exporter`	`prom/node-exporter` (pinned SHA)	`9100:9100`	Host metrics (CPU, RAM, disk, net)
`cadvisor`	`gcr.io/cadvisor` (pinned SHA)	`9338:8080`	Container resource metrics
`moe-docs`	`squidfunk/mkdocs-material` (pinned SHA)	`8098:8000`	MkDocs documentation server
`moe-docs-sync`	custom Python	—	Periodic docs sync agent (every 15 min)
`moe-dozzle-init`	`python:3.12-alpine`	—	One-shot init: Dozzle bcrypt users.yml
`moe-dozzle`	`amir20/dozzle` (pinned SHA)	`9999:8080`	Container log viewer with auth
`moe-caddy`	`caddy` (pinned SHA)	`80:80`, `443:443`	Internal reverse proxy for docs/dozzle

Host Volumes¶

Path	Contents
`/opt/moe-infra/neo4j-data`	Neo4j data
`/opt/moe-infra/redis-data`	Valkey persistence (historical path name)
`/opt/moe-infra/langgraph-checkpoints`	Postgres data for `terra_checkpoints`
`/opt/moe-infra/chroma-data`	ChromaDB vector data
`/opt/moe-infra/kafka-data`	Kafka log segments
`/opt/moe-infra/agent-logs`	Orchestrator logs

5. LangGraph Pipeline¶

Pipeline Flow¶

flowchart TD
    START([START]) --> CACHE_LOOKUP
    CACHE_LOOKUP["cache_lookup_node"] -->|"cache_hit=True"| MERGER_NODE
    CACHE_LOOKUP -->|"cache_hit=False"| PLANNER_NODE["planner_node"]

    PLANNER_NODE --> EW["expert_worker"]
    PLANNER_NODE --> RN["research_node"]
    PLANNER_NODE --> MN["math_node_wrapper"]
    PLANNER_NODE --> MCPN["mcp_node"]
    PLANNER_NODE --> GRN["graph_rag_node"]

    EW --> RFN["research_fallback_node"]
    RN --> RFN
    MN --> RFN
    MCPN --> RFN
    GRN --> RFN

    RFN --> THINKING_NODE["thinking_node\n(CoT, conditionally active)"]
    THINKING_NODE --> MERGER_NODE["merger_node"]
    MERGER_NODE --> CRITIC_NODE["critic_node\n(fact-check, only medical/legal)"]
    CRITIC_NODE --> END([END])

Nodes in Detail¶

`cache_lookup_node`¶

Queries ChromaDB for 3 semantically most similar entries
Skips entries with flagged=True (negative feedback)
Distance < CACHE_HIT_THRESHOLD (0.15) → cache_hit=True → entire pipeline bypassed

`planner_node`¶

Judge LLM decomposes request into 1–4 subtasks as JSON array
_sanitize_plan() validates each entry: must be dict with task+category; strings, empty dicts, and unknown categories are discarded (logged)
Fallback to [{"task": input, "category": "general"}] on JSON parse error
Knows all expert categories, 16 MCP tools, and research option

Planner Rules (enforced in prompt): - precision_tools always beats math — never both for the same task - research only for genuinely external/current information needs - Never keywords or questions as tasks — always task descriptions

`expert_worker`¶

Filters tasks to LLM expert categories (all except precision_tools, research)
Two-Tier logic: T1 (≤20B) first — if CONFIDENCE: high → T2 skipped; otherwise T2 escalates
Sorts experts by performance score (best first); score < 0.3 → skipped
Injects chat_history (last 4 turns, max 3000 chars) into all expert messages
Output cap: MAX_EXPERT_OUTPUT_CHARS (2400 chars ≈ 600 tokens)
Returns expert_models_used (["model::category", ...])

`research_node`¶

Active only for research task in plan
Uses search_query from plan (optimized by planner)
Returns structured source citations via search.results() (title + URL numbered)

`research_fallback_node`¶

Runs after fan-in of all parallel nodes
For each CONFIDENCE: low expert result: targeted web research with citation tracking
Aggregates new results in web_research

`thinking_node`¶

Conditionally active: for plans with >1 task OR if at least one expert reports CONFIDENCE: low
Magistral:24b runs 4-step Chain-of-Thought:
Problem decomposition
Source evaluation (which info is reliable, contradictions?)
Knowledge gaps
Conclusion
Output as reasoning_trace → prioritized section in merger prompt
Progress appears in Open WebUI <think> panel

`math_node_wrapper`¶

Active only if math task in plan AND no precision_tools task present
SymPy-based (symbolic)

`mcp_node`¶

All precision_tools tasks in parallel via asyncio.gather
HTTP POST to MCP server /invoke

`graph_rag_node`¶

Parallel to all other nodes after planner
Term extraction from request (regex, no LLM call)
2-hop traversal in Neo4j

`merger_node`¶

Calls _dedup_by_category() — keeps only the highest-confidence expert result per category
Source priority: Reasoning trace > MCP > Knowledge graph > Experts > Web > Cache
On cache hit: direct return, no LLM call
After response (> 150 chars), publishes:
ChromaDB: new cache document (sync)
Valkey: response metadata (async task)
Kafka moe.requests: audit log (async task)
Kafka moe.ingest: GraphRAG ingest job (async task)

`critic_node`¶

Only active if plan contains medical_consult or legal_advisor
Second Judge LLM pass checks merger response for factual errors and dangerous statements
Response CONFIRMED → response unchanged; error found → corrected version

State Schema (`AgentState`)¶

Field	Type	Description
`input`	str	User query
`response_id`	str	`chatcmpl-<uuid>` for feedback tracking
`mode`	str	`default` \| `code` \| `concise`
`plan`	List[Dict]	Subtask plan
`expert_results`	Annotated[list, add]	Expert responses (fan-in)
`expert_models_used`	Annotated[list, add]	`["model::category", ...]`
`web_research`	str	SearXNG result with source citations
`cached_facts`	str	Nearest cache hit content
`cache_hit`	bool	Cache short-circuit triggered
`math_result`	str	SymPy result
`mcp_result`	str	MCP tools result
`graph_context`	str	Neo4j context block
`final_response`	str	Final synthesized response
`prompt_tokens`	Annotated[int, add]	Accumulated prompt tokens across all LLM calls
`completion_tokens`	Annotated[int, add]	Accumulated completion tokens across all LLM calls
`chat_history`	List[Dict]	Conversation context (last 4 turns) for experts
`reasoning_trace`	str	CoT output from `thinking_node`

6. Expert System¶

Two-Tier System¶

Experts are divided into tiers by model size:

Tier	Threshold	Role
T1	≤ 20B parameters	Fast initial assessment
T2	> 20B parameters	Premium escalation for non-high confidence

T1 runs first. If at least one T1 result returns CONFIDENCE: high → T2 is skipped. Otherwise all T2 models run in parallel.

Configured Experts¶

Category	Model 1	Tier	Model 2	Tier
`general`	`gemma3:27b`	T2	`qwen3.5:35b`	T2
`math`	`phi4:14b`	T1	`qwq:32b`	T2
`technical_support`	`deepseek-coder-v2:16b`	T1	`devstral:24b`	T2
`creative_writer`	`gemma3:27b`	T2	`qwen3.5:35b`	T2
`code_reviewer`	`devstral:24b`	T2	`qwen3-coder:30b`	T2
`medical_consult`	`phi4:14b`	T1	`gemma3:27b`	T2
`legal_advisor`	`magistral:24b`	T2	`command-r:35b`	T2
`translation`	`translategemma:27b`	T2	`qwen3.5:35b`	T2
`reasoning`	`phi4:14b`	T1	`deepseek-r1:32b`	T2

Judge LLM: magistral:24b (planner, merger, thinking node, critic, GraphRAG extraction)

System Prompts¶

Category	Role & Behavior
`general`	Versatile, fact-based, structured
`math`	Solution steps + LaTeX, back-verification
`technical_support`	IT engineer, executable steps, concrete commands
`creative_writer`	Vivid, stylistically confident, register matching context
`code_reviewer`	Senior SWE, bugs/security/performance, concrete code
`medical_consult`	Factual, guideline-based, always recommend doctor visit
`legal_advisor`	German law (§§), recommend individual legal consultation
`translation`	Professional, idiomatic, cultural nuances
`reasoning`	Chain-of-Thought, state assumptions, justified conclusions

7. MCP Precision Tools Server¶

Port: 8003 · File: mcp_server/server.py

REST Endpoints¶

Endpoint	Method	Description
`/health`	GET	Status + tool list
`/tools`	GET	Tool descriptions (for planner prompt)
`/invoke`	POST	`{"tool": "name", "args": {...}}`
`/mcp/sse`	GET	MCP SSE endpoint (Claude Desktop etc.)

All 16 Tools¶

Tool	Description
`calculate`	Exact arithmetic, formulas, percentages (safe AST evaluator)
`solve_equation`	Algebraic equations (SymPy)
`date_diff`	Exact date difference (days, years, months)
`date_add`	Date arithmetic (add/subtract days/months/years)
`day_of_week`	Day of week, calendar week, day of year
`unit_convert`	Physical units (pint): km/h→m/s, °F→°C, ...
`statistics_calc`	mean, median, stdev, variance, min, max, sum, count, mode
`hash_text`	MD5, SHA1, SHA224, SHA256, SHA384, SHA512
`base64_codec`	Base64 encode / decode
`regex_extract`	Regex matching with flags (i, m, s)
`subnet_calc`	CIDR: network, broadcast, mask, host range
`text_analyze`	Words, characters, sentences, paragraphs, reading time
`prime_factorize`	Prime factorization via SymPy
`gcd_lcm`	GCD and LCM
`json_query`	JSON path queries (key.sub, array[0])
`roman_numeral`	Arabic ↔ Roman (1–3999)

8. GraphRAG & Knowledge Ontology¶

File: graph_rag/manager.py, graph_rag/ontology.py · DB: Neo4j 5

Base Ontology (loaded on startup, idempotent)¶

104 entities — Medical, Legal, Technical, Math/Science
100 relations — IS_A, TREATS, CAUSES, INTERACTS_WITH, CONTRAINDICATES, DEFINES, USES, IMPLEMENTS, DEPENDS_ON, EXTENDS, RELATED_TO, AFFECTS, etc.

Context Query¶

Term extraction from request (regex, prefers nouns/proper nouns)
Fuzzy search on name + aliases_str (case-insensitive)
2-hop traversal: direct + indirect relations
Output as [Knowledge Graph] text block in merger prompt

Background Ingest (via Kafka)¶

Merger response → moe.ingest topic
Kafka consumer calls extract_and_ingest()
Judge LLM extracts up to 8 triples (JSON format)
Conflict check: TREATS vs. CAUSES/CONTRAINDICATES (and vice versa)
New triples are stored with r.verified=false

Feedback Integration¶

Rating	Neo4j
1–2	`r.flagged=true, r.verified=false`
4–5	`r.verified=true, r.flagged=false`

9. Kafka Event Streaming¶

Detailed documentation: Kafka

Quick Overview¶

Topic	Producer	Consumer	Content
`moe.ingest`	`merger_node`	`_kafka_consumer_loop`	GraphRAG ingest job
`moe.requests`	`merger_node`	`_kafka_consumer_loop` (logging)	Audit log
`moe.feedback`	`/v1/feedback`	—	Feedback events

Kafka setup: confluentinc/cp-kafka:7.7.0, KRaft mode (no Zookeeper), port 9092, 7-day retention.

Graceful degradation: If Kafka fails, the system continues operating fully — HTTP responses, cache, Valkey unaffected.

10. Memory & Learning Mechanisms¶

Four Memory Levels¶

Level	Storage	Grows from	TTL
Semantic cache	ChromaDB	Every merger response > 150 chars	unlimited
Knowledge graph	Neo4j	Kafka `moe.ingest` consumer	unlimited
Expert performance	Valkey	`POST /v1/feedback`	unlimited
LangGraph checkpoints	Valkey	Every request	unlimited

Expert Performance Scoring¶

Key:    moe:perf:{model}:{category}
Fields: total, positive, negative

Score = (positive + 1) / (total + 2)   # Laplace smoothing

Under 5 ratings: 0.5 (neutral)
Score < 0.3 after ≥ 5 ratings → expert is skipped

Feedback Effect¶

Rating	ChromaDB	Expert Score	Neo4j
1–2 (negative)	`flagged=True`	`negative++`	`flagged=true`
3 (neutral)	—	—	—
4–5 (positive)	—	`positive++`	`verified=true`

11. OpenAI API & Streaming¶

Streaming (SSE) — fully OpenAI-compatible¶

Each chunk contains id, object, created, model, choices:

1. First chunk:    delta={"role":"assistant","content":""},  finish_reason=null
2. Content chunks: delta={"content":"<50 chars>"},           finish_reason=null
3. Stop chunk:     delta={},                                  finish_reason="stop"
4. Final:          data: [DONE]

The stop chunk with finish_reason: "stop" is the signal for Open WebUI (and all OpenAI-compatible clients) to close the stream and free LLMs. Without this chunk, the client would wait indefinitely.

Non-Streaming¶

{
  "id":      "chatcmpl-<uuid>",
  "object":  "chat.completion",
  "created": 1774783273,
  "model":   "moe-orchestrator",
  "choices": [{"index": 0, "message": {"role": "assistant", "content": "..."}, "finish_reason": "stop"}],
  "usage":   {"prompt_tokens": 0, "completion_tokens": 0, "total_tokens": 0}
}

12. API Reference¶

Base URL: http://<host>:8002

Chat Completions¶

POST /v1/chat/completions
{"model":"moe-orchestrator","messages":[{"role":"user","content":"..."}],"stream":false}

Models¶

GET /v1/models

Feedback¶

POST /v1/feedback
{"response_id":"chatcmpl-...","rating":4}

Graph¶

GET /graph/stats
GET /graph/search?q=Ibuprofen&limit=10

13. Configuration & Environment Variables¶

Variable	Default	Description
`INFERENCE_SERVERS`	—	JSON array of inference server configurations
`JUDGE_ENDPOINT`	—	Server for judge LLM (configured server name)
`JUDGE_MODEL`	`gemma3:4b`	Judge model
`EXPERT_MODELS`	`{}`	JSON: category → `[{model, endpoint}]`
`SEARXNG_URL`	—	SearXNG instance
`REDIS_URL`	`redis://terra_cache:6379`	Valkey connection (`redis://` is the protocol)
`POSTGRES_CHECKPOINT_URL`	`postgresql://langgraph:***@terra_checkpoints:5432/langgraph`	LangGraph checkpoint store
`CHROMA_HOST`	`chromadb-vector`	ChromaDB host
`MCP_URL`	`http://mcp-precision:8003`	MCP server
`NEO4J_URI`	`bolt://neo4j-knowledge:7687`	Neo4j
`NEO4J_USER`	`neo4j`	Neo4j user
`NEO4J_PASS`	`moe-sovereign`	Neo4j password
`KAFKA_URL`	`kafka://moe-kafka:9092`	Kafka bootstrap
`CLUSTER_HARDWARE`	JSON object	GPU nodes, priorities, model assignments
`MAX_EXPERT_OUTPUT_CHARS`	`2400`	Output cap per expert
`CACHE_HIT_THRESHOLD`	`0.15`	Cosine distance threshold
`LOG_LEVEL`	`INFO`	DEBUG / INFO / WARNING / ERROR

14. Data Persistence¶

Data	Storage	TTL	Valkey Key / ChromaDB Collection
LangGraph checkpoints	Valkey	unlimited	`langgraph:*`
Semantic cache	ChromaDB	unlimited	`moe_fact_cache`
Expert performance	Valkey	unlimited	`moe:perf:{model}:{cat}`
Response metadata	Valkey	7 days	`moe:response:{id}`
Knowledge graph	Neo4j	unlimited	Volume `/opt/moe-infra/neo4j-data`
Kafka events	Kafka	7 days / 512 MB	`/opt/moe-infra/kafka-data`

15. Deployment¶

First Start¶

cd /opt/deployment/moe-infra
cp temp.env .env
# configure INFERENCE_SERVERS, SEARXNG_URL, EXPERT_MODELS
sudo mkdir -p /opt/moe-infra/{neo4j-data,neo4j-logs,redis-data,langgraph-checkpoints,chroma-data,kafka-data,agent-logs}
sudo docker compose up -d --build
sudo docker compose logs -f langgraph-orchestrator

After Code Changes¶

# Orchestrator
sudo docker compose build langgraph-app && sudo docker compose up -d langgraph-app

# MCP server
sudo docker compose build mcp-precision && sudo docker compose up -d mcp-precision

Startup Sequence (Orchestrator)¶

Valkey client
Kafka producer (12 attempts, 5–60s backoff)
Load MCP tool descriptions
Neo4j GraphRAG (6 attempts, 10–60s backoff)
Start Kafka consumer as background task
AsyncValkeySaver → compile LangGraph graph
FastAPI on port 8000

Health Checks¶

curl http://localhost:8002/v1/models          # orchestrator
curl http://localhost:8002/graph/stats        # GraphRAG
curl http://localhost:8003/health             # MCP server
sudo docker exec terra_cache valkey-cli ping  # Valkey
sudo docker exec moe-kafka kafka-topics \
  --bootstrap-server localhost:9092 --list    # Kafka topics
# Neo4j Browser: http://localhost:7474

16. Project Structure¶

graph LR
    ROOT[/opt/deployment/moe-infra/]

    ROOT --> MAIN[main.py<br/>LangGraph orchestrator, FastAPI, nodes, Kafka]
    ROOT --> MATH[math_node.py<br/>SymPy math node]
    ROOT --> REQ[requirements.txt]
    ROOT --> DF[Dockerfile]
    ROOT --> DC[docker-compose.yml<br/>all services]
    ROOT --> ENV[.env<br/>not in git]
    ROOT --> GR[graph_rag/]
    ROOT --> MCPD[mcp_server/]
    ROOT --> ADMIN[admin_ui/<br/>Flask admin panel]
    ROOT --> DOCS[docs/<br/>MkDocs source]

    GR --> GRM[manager.py<br/>Neo4j client, query_context, extract_and_ingest]
    GR --> GRO[ontology.py<br/>base ontology]

    MCPD --> MCPS[server.py<br/>deterministic precision tools]
    MCPD --> MCPDF[Dockerfile]

17. Quality Mechanisms¶

Confidence Scoring¶

All experts produce structured output with self-assessed confidence:

KEY_MESSAGE: [1-2 sentences]
CONFIDENCE: high | medium | low
DETAILS: [full response]

high = established expert knowledge · medium = nuances possible · low = data gaps, uncertainty

Automated Quality Assurance Chain¶

flowchart TD
    EO[Expert output] --> FB{CONFIDENCE: low?}
    FB -->|yes| RF[research_fallback<br/>web research with citations]
    FB -->|no| TN
    RF --> TN

    TN[thinking_node<br/>active if &gt;1 task OR low confidence<br/>4-step CoT: decomposition → sources → gaps → conclusion]
    TN --> MN

    MN[merger_node<br/>_dedup_by_category: best confidence per category wins<br/>Source priority: Reasoning &gt; MCP &gt; Graph &gt; Experts &gt; Web &gt; Cache]
    MN --> CN

    CN{critic_node<br/>only medical_consult / legal_advisor<br/>factual errors? dangerous statements?}
    CN -->|CONFIRMED| OUT([pass])
    CN -->|error| FIX([corrected version])

Open WebUI Integration¶

<think> panel: Progress reports from all nodes via contextvars.ContextVar → SSE stream → "Thinking" panel
Internal requests: _is_openwebui_internal() detects title/follow-up/autocomplete → fast path without pipeline
Token tracking: Accumulated usage fields (all LLM calls) in every response

Output Modes¶

Model ID	Mode	Behavior
`moe-orchestrator`	`default`	Full responses with explanations
`moe-orchestrator-code`	`code`	Source code only, no prose
`moe-orchestrator-concise`	`concise`	Max 120 words

Generated on 2026-03-29 — Version 2.1.0