Tool Stack Overview¶
Sovereign MoE combines several specialized open-source components into a coherent orchestration stack. Each component solves a specific problem that could not be solved with a single, monolithic LLM system.
Architecture Diagram¶
flowchart TD
CLIENT["Client\n(Open WebUI · curl · SDK)"]
subgraph ORCH["LangGraph Orchestrator · Port 8002"]
direction TB
CACHE["cache_lookup\n(ChromaDB Semantic)"]
PLAN["planner\n(Judge-LLM)"]
WORKERS["workers\n(Expert LLMs)"]
RESEARCH["research\n(SearXNG)"]
MATH["math\n(SymPy internal)"]
MCP_N["mcp\n(Precision Tools)"]
GRAPH_N["graph_rag\n(Neo4j)"]
MERGER["merger\n(Judge-LLM)"]
THINKING["thinking\n(CoT, conditional)"]
CRITIC["critic\n(fact-check)"]
end
subgraph INFERENCE["Inference Servers (configured via Admin UI)"]
direction LR
SRV1["Inference Server 1\nOllama-compatible"]
SRV2["Inference Server 2\noptional"]
end
subgraph PERSIST["Persistence Layer"]
REDIS[("Valkey\nPort 6379\nScoring · Session Cache")]
CHROMA[("ChromaDB\nPort 8001\nSemantic Cache")]
NEO4J[("Neo4j\nPort 7687\nKnowledge Graph")]
end
subgraph STREAMING["Async Streaming"]
KAFKA[("Kafka\nPort 9092\nmoe.ingest · moe.requests · moe.feedback")]
KCONS["Kafka Consumer\n→ Neo4j Ingest"]
KAFKA --> KCONS --> NEO4J
end
MCP_SERVER["MCP Precision Tools\nPort 8003\n16 deterministic tools"]
SEARXNG["SearXNG\nPort 8888\nPrivate web search"]
CLIENT -->|"POST /v1/chat/completions"| CACHE
CACHE -->|"Hit"| CLIENT
CACHE -->|"Miss"| PLAN
PLAN --> WORKERS & RESEARCH & MATH & MCP_N & GRAPH_N
WORKERS --> INFERENCE
PLAN --> INFERENCE
MERGER --> INFERENCE
RESEARCH --> SEARXNG
MCP_N --> MCP_SERVER
GRAPH_N --> NEO4J
WORKERS -->|"Confidence < threshold"| THINKING
THINKING --> MERGER
WORKERS & RESEARCH & MATH & MCP_N & GRAPH_N --> MERGER
MERGER --> CRITIC --> CLIENT
MERGER --> CHROMA
MERGER --> REDIS
MERGER -->|"moe.ingest + moe.requests"| KAFKA
Component Overview¶
| Component | Role | Port | Documentation |
|---|---|---|---|
| LangGraph | Orchestration, parallel fan-out, state management | internal | langgraph.md |
| Ollama | Multi-node LLM inference | 11434 | ollama_cluster.md |
| Neo4j | Temporal GraphRAG, knowledge graph | 7687 | graphrag_neo4j.md |
| Valkey | Expert scoring, session cache | 6379 | — |
| ChromaDB | Semantic response cache | 8001 | — |
| Kafka | Async ingest buffer, audit log | 9092 | Kafka docs |
| SearXNG | Private web search (no Google tracking) | 8888 | — |
| MCP Server | 16 deterministic precision tools | 8003 | mcp_tools.md |
Design Principles¶
Determinism over LLM estimation — calculations, hashes, date operations, and network subnet calculations always run through the MCP server, never through a language model.
Decoupling via Kafka — the HTTP response path and data persistence are completely separated. A Kafka outage blocks no responses, only later graph learning.
Heterogeneous hardware — Ollama abstracts different GPU generations (consumer cards to enterprise Tesla) behind a unified OpenAI API. Inference servers are configured via Admin UI → Servers, with priority routing weighted by availability.
No vendor lock-in — all components are self-hosted. SearXNG instead of Google, Ollama instead of OpenAI, Neo4j Community instead of vector-based cloud services.