Enterprise Architecture Features¶
These features were inspired by an analysis of proprietary enterprise AI platforms (Palantir AIP, Databricks Mosaic AI, Glean) and adapted for the open-source, sovereignty-first architecture of MoE Sovereign.
Confidence Decay & Self-Healing Knowledge Graph¶
Problem¶
False associations from bad queries contaminate the Neo4j knowledge graph permanently.
For example, a query about "car wash automation" might create a relation
(CarWash)-[USES]->(Neo4j) that pollutes all future technical queries.
Solution: Trust Score¶
Every relation receives a computed trust score:
$$ \text{trust} = \text{confidence} \times w_{\text{source}} \times \text{decay} \times v_{\text{bonus}} $$
| Factor | Values | Description |
|---|---|---|
confidence |
0.0 - 1.0 | Base confidence from the extracting LLM |
w_source |
ontology: 1.0, healer: 0.9, extracted: 0.6 | Source reliability weight |
decay |
max(0.3, 1 - days/365) | Temporal decay over one year |
v_bonus |
verified: 1.5x, else 1.0 | Bonus for human-verified relations |
Automatic Cleanup¶
Relations with trust < 0.2 that are:
- Unverified (
verified = false) - Single-assertion (
version = 1, never re-confirmed by a later query)
...are automatically deleted during:
- Phase 3 of Graph Linting (triggered via Kafka
moe.lintingtopic) - Phase 0 of the Nightly Gap Healer (runs at 02:00 via systemd timer)
All deletions are logged to the moe.audit Kafka topic for auditability.
Prometheus Metrics¶
moe_linting_decay_deleted_total— Relations removed by confidence decay
Multi-Tenant RBAC (Graph-Level Isolation)¶
Problem¶
All users see all Neo4j entities. Multi-tenant deployments need data isolation without modifying the LLM pipeline.
Solution: Tenant-Filtered Queries¶
- Permission Type:
graph_tenant— assigns tenant slugs to users via the Admin UI - Neo4j Property:
tenant_idon Entity nodes (indexed for performance) - Pipeline Propagation:
tenant_idsextracted from user permissions, passed throughAgentState - Query Filtering: Cypher includes
WHERE e.tenant_id IN $tenant_ids OR e.tenant_id IS NULL
Entities with tenant_id = NULL are shared/public and visible to all users.
New entities created during a request are tagged with the requesting user's primary tenant.
Admin Configuration¶
Navigate to Users > select user > Permissions > add graph_tenant permission
with the tenant slug as resource ID (e.g., acme-corp, internal).
Inline Provenance Tags¶
Problem¶
Final answers have no source attribution. Users cannot verify which claims come from the knowledge graph vs. general LLM knowledge.
Solution: [REF:entity] Tags¶
The merger prompt includes a PROVENANCE_INSTRUCTION that marks knowledge-graph-derived
facts with [REF:entity_name] tags. These are:
- Extracted post-merger via regex
- Stripped from the user-visible content for clean output
- Returned as
metadata.sourcesin the API response:
{
"choices": [{"message": {"content": "The answer..."}}],
"metadata": {
"sources": [
{"type": "neo4j", "label": "PostgreSQL"},
{"type": "neo4j", "label": "TLS_1.3"}
]
}
}
This is backward-compatible with the OpenAI chat completion format.
Blast-Radius Estimation & Quarantine¶
Problem¶
A single erroneous triple can affect many future queries if it connects to densely-linked entities in the knowledge graph.
Solution: Pre-Write Impact Check¶
Before each triple is written to Neo4j, _estimate_blast_radius() counts how many
entities are reachable within 2 hops from both subject and object:
OPTIONAL MATCH (a:Entity {name: $s})-[*1..2]-(n1:Entity)
OPTIONAL MATCH (b:Entity {name: $o})-[*1..2]-(n2:Entity)
RETURN count(DISTINCT n) AS reach
If reach > 20 (configurable via _BLAST_RADIUS_THRESHOLD):
- The triple is not written to Neo4j
- It is stored in a Valkey sorted set
moe:quarantine(TTL 7 days) - The Admin UI Quarantine page shows all quarantined triples
- An admin can Approve (write to Neo4j) or Reject (discard)
Quarantine Page¶
Navigate to Quarantine in the Admin UI navigation bar. The page shows:
| Column | Description |
|---|---|
| Subject / Object | Entity names and types |
| Relation | Relation type (e.g., TREATS, USES) |
| Reach | Number of connected entities (blast radius) |
| Source Model | Which LLM generated the triple |
| Confidence | Extraction confidence score |
Prometheus Metrics¶
moe_quarantine_added_total— Triples sent to quarantine
Cache Performance Analysis¶
A controlled A/B test measured the actual impact of the L1 semantic cache (ChromaDB, cosine distance < 0.15):
| Query Type | Latency | Tokens | Result |
|---|---|---|---|
| Cold (first time) | 280.7s | 8,001 | Full pipeline |
| Exact repeat | 293.3s | 9,128 | No cache hit |
| Semantically similar | timeout | 0 | Pipeline stalled |
Finding: The L1 cache did not trigger for standard queries in this test. The most likely causes: (1) the cosine distance threshold of 0.15 is very strict, (2) the cache primarily benefits repeated plan patterns (L2 Valkey cache, SHA-256 hash match) rather than full-response caching.
The accumulation effect observed in the benchmark runs (55% latency reduction between Run 1 and Run 2) is primarily driven by GraphRAG context enrichment and model warmth, not by the L1 semantic cache. This is an honest result — the cache hierarchy adds value for specific patterns (identical queries within 30 minutes), but is not a general-purpose acceleration layer.
Floating Node Discovery¶
Problem¶
Pinning every expert model to a specific inference node is operationally rigid. When a node is rebooted or its VRAM is occupied by another model, the pinned expert fails even though the same model may already be warm on a different node.
Solution: Empty Endpoint = All Nodes¶
When an expert's endpoint field in a template is left empty (or set to ""),
the orchestrator enters floating mode: it queries every configured inference
server for availability of the requested model.
Node selection follows a 3-phase strategy:
| Phase | Check | Description |
|---|---|---|
| 0 | Sticky session | Valkey lookup for recent user-node affinity (see below) |
| 1 | Warm model | Ollama /api/ps (5s cache) — prefer nodes where the model is already loaded in VRAM |
| 2 | Load score | Among warm (or cold) candidates, pick the node with the lowest running / gpu_count ratio |
This means floating experts automatically migrate to whichever node currently has the model warm, without any admin intervention.
Sticky Sessions¶
Problem¶
Floating mode could bounce a user between nodes on every request, causing unnecessary model loads and cold-start latency.
Solution: Valkey-Backed User-Node Affinity¶
After each successful node selection, the orchestrator stores an affinity record:
| Parameter | Value |
|---|---|
| Key pattern | moe:sticky:{user_id}:{model_base} |
| Value | Node name (e.g., N04-RTX) |
| TTL | 300 seconds (5 minutes) |
On the next request, Phase 0 of _select_node() checks for a sticky hit before
evaluating warm/cold status. If the sticky node is still in the allowed endpoint
list, it is used immediately — skipping the more expensive /api/ps fan-out.
The 5-minute TTL balances affinity with adaptability: short interactive sessions stay on the same node, while idle users naturally lose their affinity and get re-routed to the currently optimal node.
Model Registry¶
Problem¶
The warm-model check (/api/ps) requires an HTTP fan-out to every inference server.
At scale (many nodes, many requests), this creates unnecessary network traffic.
Solution: Valkey ZSET Registry with Heartbeat¶
A background task polls each inference server every 60 seconds and registers all currently loaded models in Valkey sorted sets:
ZADD moe:model_registry:{model_base} {timestamp} {node_name}
EXPIRE moe:model_registry:{model_base} 120
| Parameter | Value |
|---|---|
| Key pattern | moe:model_registry:{model_base} |
| Score | Unix timestamp of last heartbeat |
| Member | Node name |
| TTL | 120 seconds (2x heartbeat interval) |
This provides a fast O(1) lookup for "which nodes currently have model X warm?" without requiring a live HTTP call. The 120s TTL ensures stale entries are automatically cleaned when a node goes offline or unloads a model.
Validation¶
All four features were validated in a benchmark run after deployment:
| Feature | Impact on Score | Impact on Latency |
|---|---|---|
| Confidence Decay | None (6.0/10 stable) | < 50ms per linting run |
| Tenant RBAC | None | < 10ms per query (indexed) |
| Provenance Tags | None | 0ms (prompt addition only) |
| Blast-Radius | None | < 50ms per ingest (2-hop query) |