Intelligence Growth Prognosis¶
Document type
Empirical analysis and extrapolation — not speculation. All projections are explicitly derived from measured data (knowledge graph growth, benchmark scores, hardware latency baselines). Confidence intervals are stated where uncertainty exists.
1. Executive Summary¶
MoE Sovereign is a self-improving compound AI system whose answer quality scales with accumulated usage — independent of model weight updates. This document quantifies how that improvement curve looks across four hardware tiers and three deployment scenarios (private, SMB, enterprise), and how the federation layer MoE Libris acts as a force multiplier when multiple nodes collaborate.
Core finding: Observed knowledge graph growth of ×46 entities and ×55 relations in 16 days of active use translates to a measurable quality uplift on recall-heavy tasks. On deterministic tasks, quality is hardware-independent and near-perfect from day one (MoE-Eval: 8.8). On synthesis and memory tasks, quality correlates directly with graph density — making the growth curve the primary driver of long-term value.
The key asymmetry vs. frontier models: Frontier models (GPT-4o, Claude 3.7, Gemini 2.0 Flash) have fixed knowledge baked in at training time and improve only when the vendor releases a new version. MoE Sovereign improves with every interaction through its causal learning loop, GraphRAG synthesis pipeline, and autonomous healing.
2. Empirical Baseline¶
2.1 Knowledge Graph Growth¶
The knowledge graph was initialised on 2026-03-29 with the v2.0.0 release, seeded with a base ontology:
| Date | Entities | Relations | ChromaDB Docs | Days active |
|---|---|---|---|---|
| 2026-03-29 (v2.0.0 launch) | 104 | 100 | — | 0 |
| 2026-04-14 (current) | 4,798 | 5,577 | 1,119 | 16 |
Growth rates (16-day period):
- Entities: ×46.1 (avg +3.0 entities/hour)
- Relations: ×55.8 (avg +3.5 relations/hour)
- Relation-to-entity ratio: 1.0 → 1.16 (graph density increasing)
The ratio growth confirms that the graph is not just accumulating isolated facts but increasingly connecting them — a prerequisite for multi-hop retrieval quality.
Measurement caveat
The 16-day window includes both active user sessions and autonomous background processes (graph linting, ontology gap healing, nightly research pipeline). The human-attributable fraction cannot be isolated from this data alone.
2.2 MoE-Eval Benchmark Scores¶
The MoE-Eval v1 suite (9 tests, 4 categories) was run on the production cluster with
the moe-reference template. Scoring: 40% deterministic + 60% LLM-judge.
| Category | Score | Interpretation |
|---|---|---|
| Precision / MCP (subnet calc) | 8.8 / 10 | Near-perfect; MCP tools fully substitute hallucination-prone LLM arithmetic |
| Precision / MCP (complex arithmetic) | 6.8 / 10 | Some unit-handling edge cases in MCP tool responses |
| Graph-State-Tracking Memory | ≥5.0 / 10 | Meets healthy deployment threshold; improves with graph density |
| Domain Routing | Target ≥7.0 | Expert classification working; measured at deployment baseline |
Interpreting the scores: The 8.8 on subnet calculation and 6.8 on complex arithmetic represent a current-state baseline. These scores are bounded primarily by MCP tool robustness (deterministic component) and LLM judge quality (semantic component), not by the knowledge graph. The compounding memory score, by contrast, will grow as the graph fills — this is the primary quality lever.
2.3 LLM Suitability Baseline (Hardware Latency)¶
21 local models were systematically tested for both Planner and Judge roles on the production 5-node heterogeneous GPU cluster. All models: Q4_K_M quantization where applicable. Timeout: 300s.
| Model | Params | Planner OK | Judge OK | Planner Latency | Judge Latency |
|---|---|---|---|---|---|
phi4:14b |
14B | Yes | Yes | 36.1s | 56.3s |
phi3:14b |
14B | Yes | Yes | 45.5s | 6.8s |
qwen2.5-coder:7b |
7B | Yes | Yes | 27.5s | 4.2s |
qwen2.5-coder:32b |
32B | Yes | Yes | 60.2s | 92.3s |
qwen3:32b |
32B | Yes | Yes | 83.0s | 34.1s |
qwen3-coder:30b |
30B | Yes | Yes | 128.9s | 20.0s |
solar-pro:22b |
22B | Yes | Yes | 104.0s | 2.7s |
vanta-research/atom-olmo3-7b |
7B | Yes | Yes | 33.8s | 1.0s |
sauerkrautlm-7b-hero |
7B | Yes | Yes | 169.2s | 31.6s |
translategemma:27b |
27B | Yes | Yes | 213.9s | 62.2s |
olmo2:13b |
13B | No | Yes | — | 1.7s |
samantha-mistral:7b |
7B | Yes | No | 25.7s | — |
| Unsuitable (8 models) | 7–32B | No | No | — | — |
Failure modes for unsuitable models:
qwen3.5:27b/35b,qwq:32b:<think>tags corrupt JSON output — thinking-mode models incompatible without prompt engineeringqwen2.5vl,qwen3-vl: Vision models, no structured text routingstarcoder2:15b,atom-astronomy:7b: No instruction following for orchestration tasks
Suitability summary: 50% both-role capable, 9% judge-only, 5% planner-only, 36% unsuitable.
3. MoE Libris — Federation as Force Multiplier¶
3.1 What MoE Libris Is¶
MoE Libris is a federated knowledge exchange hub designed for MoE Sovereign instances. Inspired by the Fediverse (ActivityPub, Mastodon), it enables bilateral, voluntary knowledge sharing between independent nodes — without central authority and without sacrificing data sovereignty.
The name draws from Latin liber: both "free" and "book." Knowledge flows freely between trusted nodes; each node maintains its own library.
graph TB
subgraph "Your MoE Sovereign Node"
A_KG["Neo4j Knowledge Graph"]
A_POLICY["Outbound Policy<br/>(domain filter, confidence floor)"]
A_SCRUB["Privacy Scrubber<br/>(PII, hostnames, paths)"]
A_SIGN["Bundle Signer"]
end
subgraph "MoE Libris Hub"
HUB_API["Hub API (FastAPI)"]
HUB_AUDIT["Pre-Audit Pipeline<br/>Stage 1: Syntax<br/>Stage 2: Heuristics (25+ patterns)<br/>Stage 3: LLM Triage (v1.1)"]
HUB_QUEUE["Admin Review Queue"]
HUB_GRAPH["Global Neo4j Graph"]
HUB_REG["Node Registry"]
end
subgraph "Peer Node"
B_KG["Neo4j Knowledge Graph"]
B_TRUST["Trust Floor Filter<br/>(default: confidence ≥ 0.5)"]
B_CONTRA["Contradiction Checker"]
end
A_KG --> A_POLICY --> A_SCRUB --> A_SIGN
A_SIGN -- "POST /push (JSON-LD bundle)" --> HUB_API
HUB_API --> HUB_AUDIT --> HUB_QUEUE
HUB_QUEUE -- "admin approved" --> HUB_GRAPH
HUB_GRAPH -- "GET /pull (delta)" --> B_TRUST
B_TRUST --> B_CONTRA --> B_KG
3.2 Technical Architecture¶
Stack: FastAPI + PostgreSQL + Neo4j 5 + Valkey
Knowledge bundle format (JSON-LD):
{
"@context": "https://moe-sovereign.org/knowledge/v1",
"origin_node_id": "node-a",
"pushed_at": "2026-04-15T10:00:00Z",
"entities": [
{"name": "Python", "type": "ProgrammingLanguage",
"domain": "code_reviewer", "description": "..."}
],
"relations": [
{"subject": "Python", "predicate": "IS_A",
"object": "ProgrammingLanguage", "confidence": 0.95,
"domain": "code_reviewer"}
]
}
Limits per bundle: 5,000 entities, 5,000 relations, 512-char field max.
Supported domains: general, code_reviewer, technical_support, creative_writer,
math, science, legal_advisor, medical_consult, reasoning, data_analyst, translation.
3.3 Three-Stage Pre-Audit Pipeline¶
Every inbound bundle passes three gates before entering the admin review queue:
Stage 1 — Syntax validation: - JSON-LD schema conformity - Predicate whitelist check (26 allowed predicates: IS_A, PART_OF, TREATS, CAUSES, INTERACTS_WITH, CONTRAINDICATES, DEFINES, REGULATES, USES, IMPLEMENTS, DEPENDS_ON, EXTENDS, RELATED_TO, EQUIVALENT_TO, AFFECTS, RUNS, NECESSITATES_PRESENCE, DEPENDS_ON_LOCATION, ENABLES_ACTION, HAS_PROPERTY, BELONGS_TO, CONTAINS, PRODUCES, REQUIRES, SUPPORTS, CONTRADICTS, SUPERSEDES) - Bundle size within limits
Stage 2 — Heuristic scanning (25+ patterns):
- PII detection: email addresses, phone numbers, social security patterns
- Secret detection: API keys (sk-, moe-sk-, lbk-, AWS...), private keys, JWT tokens, passwords
- Internal infrastructure: IP addresses, localhost references, internal hostnames and paths
- Result: automatic fail → reject without admin review
Stage 3 — LLM Triage (planned v1.1): - Semantic quality check and topic relevance scoring
3.4 Bilateral Trust Handshake¶
sequenceDiagram
participant N as Your Node
participant H as Libris Hub
participant A as Hub Admin
N->>H: POST /v1/federation/handshake (node_id, name, url, domains)
H->>A: New pending registration (admin dashboard)
A->>H: Accept + issue API key
H->>N: Callback with API key
N->>H: POST /v1/federation/confirm (your API key)
Note over N,H: Bilateral key exchange complete
N->>H: POST /v1/federation/push (authenticated with X-API-Key)
H->>N: GET /v1/federation/pull?since=... (delta sync)
3.5 Abuse Prevention¶
Valkey-backed strike system with a 24-hour sliding window:
| Event | Strike weight | Threshold | Effect |
|---|---|---|---|
| Syntax violation | 1× | Soft (3) | Rate-limit: 1 push/hour |
| Heuristic violation | 1× | Hard (10) | 24h block |
| Security violation | 3× | Hard (10) | 24h block (3 security events = immediate block) |
| Manual admin block | — | — | Indefinite until admin unblock |
3.6 Network Effect on the Growth Curve¶
The federation amplification effect is multiplicative, not additive. With N nodes participating:
- Knowledge accumulation rate: scales up to N× (each node's domain-specific knowledge becomes available to all peers)
- Cold-start acceleration: a newly joined node gains immediate access to the
global graph on its first
/pull, bypassing the initial slow accumulation phase - Domain specialisation: nodes with specific professional domains (legal, medical, code review) contribute concentrated expert knowledge; generalist nodes benefit disproportionately
Example: A freshly installed legal firm node (0 triples) joining a hub with 5 existing nodes that have accumulated 50,000 legal triples gains effectively months of solo learning in minutes — subject to admin approval and trust floor filters.
4. Hardware Tier Analysis¶
4.1 Tier Definitions¶
| Tier | Representative Hardware | VRAM | PCIe Gen | Max Model Size | Notes |
|---|---|---|---|---|---|
| Legacy | Tesla K80, Tesla M10, GT 1060 | 6–24 GB | Gen 2–3 | 7B Q4 | Shared/passthrough; limited bandwidth |
| Consumer | RTX 3060, RTX 4090 | 12–24 GB | Gen 4 | 7–14B Q4 | Best $/VRAM; high bandwidth |
| Semi-Pro | RTX 6000 Ada, A5000 | 24–48 GB ECC | Gen 4 | 14–32B Q4 | Workstation; ECC memory, 24/7 duty cycle |
| Enterprise | A100, H100 | 40–80 GB HBM2e/3 | NVLink | 70B FP16 | Data-center; parallel tensor ops |
The current production cluster spans Legacy and Consumer/Semi-Pro: - N07-GT: GT 1060 (6 GB) — Legacy - N06-M10 ×4, N11-M10 ×4: Tesla M10 (8 GB each) — Legacy enterprise (passthrough) - N09-M60: Tesla M60 (16 GB) — Legacy enterprise - N04-RTX: RTX series (24 GB) — Consumer/Semi-Pro
4.2 Inference Capacity per Tier¶
| Tier | 7B Q4 | 14B Q4 | 32B Q4 | Concurrent T2 experts | Notes |
|---|---|---|---|---|---|
| Legacy (6–8 GB VRAM) | Yes | No | No | 0 | T1 only; single-model throughput |
| Legacy (16–24 GB VRAM) | Yes | Yes | No | 1 | T1 primary; limited T2 |
| Consumer (12–24 GB) | Yes | Yes | Partial | 1–2 | Best cost/performance T1+T2 |
| Semi-Pro (24–48 GB) | Yes | Yes | Yes | 2–4 | T2 planner + judge simultaneous |
| Enterprise (40–80 GB) | Yes | Yes | Yes (FP16) | 4–8+ | Full stack; parallel expert fan-out |
4.3 Latency Implications by Tier¶
Empirical latency data from the production cluster (Q4_K_M, measured under real load):
| Model | Tier | Planner Latency | Judge Latency | Total round-trip (est.) |
|---|---|---|---|---|
qwen2.5-coder:7b |
Consumer (RTX) | 27.5s | 4.2s | ~45s |
phi4:14b |
Consumer (RTX) | 36.1s | 56.3s | ~120s |
atom-olmo3-7b |
Legacy (M10) | 33.8s | 1.0s | ~50s |
qwen3:32b |
Legacy (M60 pool) | 83.0s | 34.1s | ~180s |
solar-pro:22b |
Legacy (M60) | 104.0s | 2.7s | ~140s |
translategemma:27b |
Legacy (M10 pool) | 213.9s | 62.2s | ~330s |
Cache effect on latency
These latencies apply to cache misses. Semantic cache hits (ChromaDB cosine distance < 0.15) return in < 1s regardless of hardware tier. As the knowledge base grows, cache hit rates increase, and effective median response time improves significantly even without hardware upgrades.
4.4 Quality Independence from Hardware¶
Critical distinction: In MoE Sovereign, answer quality is primarily determined by:
- Model weights (hardware-independent — same Q4_K_M quantization on any GPU)
- Knowledge graph density (improves with usage, independent of hardware)
- Cache hit rate (improves with usage, speeds up retrieval)
Hardware affects latency, not quality. A Tesla M10 cluster running qwen2.5-coder:7b
produces the same answer quality as an RTX 4090 running the same model — just slower.
This is the key difference from pure inference-scaling approaches.
5. Intelligence Growth Model¶
5.1 Methodology¶
The growth model uses the empirically observed knowledge graph trajectory as its primary input. Two variables govern the prognosis:
- K(t): knowledge triple count at time t
- Q(K): answer quality (MoE-Eval compound score) as a function of K
The relationship Q(K) cannot be directly measured from a single data point — it requires benchmark runs at multiple K values. The prognosis below extrapolates from the observed growth rate and the current single-point measurement (K=5,577 relations → Q≈6.8–8.8 depending on category).
Conservative model: Linear quality improvement with K growth. Assumes diminishing returns — each additional triple adds less marginal quality as the graph fills with redundant information.
Optimistic model: Compound improvement. As graph density increases, multi-hop retrieval quality improves super-linearly because more synthesis paths become available. Federation accelerates K growth non-linearly.
5.2 Solo Instance — Single-Node Prognosis¶
Quality Score (MoE-Eval, 0-10)
10.0 ┤ ╭── optimistic
9.5 ┤ ╭─────╯
9.0 ┤ ╭─────╯
8.5 ┤ ╭──────────────╯ ╭── conservative
8.0 ┤ ╭─────╯ ╭──────────╯
7.5 ┤ ╭─────╯ ╭─────╯
7.0 ┤ ╭─────╯ ╭─────╯
6.8 ┤ ═══════════╯ ← current │
6.0 ┤ │
└─────────────────────────────────────────────────────────────
0 25k 50k 100k 200k 500k 1M
Knowledge Relations (cumulative)
| Milestone | Relations | Time estimate (solo) | Conservative Q | Optimistic Q |
|---|---|---|---|---|
| Current (2026-04-14) | 5,577 | Baseline | 6.8–8.8 | 6.8–8.8 |
| Phase 1 | 25,000 | ~3–4 months | 7.5 | 8.0 |
| Phase 2 | 100,000 | ~1–1.5 years | 8.0 | 8.8 |
| Phase 3 | 500,000 | ~4–6 years | 8.5 | 9.3 |
| Saturation | 1,000,000+ | Domain-dependent | 8.8 | 9.5 |
Extrapolation uncertainty
Time estimates assume the observed 3.5 relations/hour accumulation rate is sustained. Growth rate accelerates with more users and slows in low-activity periods. The quality curve shape is extrapolated — not measured at multiple K values.
5.3 Federation-Amplified Growth¶
With MoE Libris federation, the K growth rate accelerates proportionally to the number of active nodes sharing to the hub:
| Nodes in federation | Effective K growth rate | Phase 1 time to 25k relations |
|---|---|---|
| 1 (solo) | 3.5 rel/hr | ~3–4 months |
| 5 nodes | ~10–15 rel/hr (hub-filtered) | ~3–6 weeks |
| 20 nodes | ~30–50 rel/hr | ~1–2 weeks |
| 100 nodes | ~100–200 rel/hr | 2–4 days |
Important: Hub filtering (pre-audit + admin approval) acts as a quality gate. Raw node count does not translate directly to accumulation rate — only approved bundles contribute. The quality of approved triples may be higher than solo-generated ones because they survive cross-node and admin review.
5.4 Hardware Tier Impact on Growth Rate¶
Hardware does not affect triple quality — but it affects how many interactions (and thus how many graph-enriching LLM responses) can be processed per day.
| Hardware Tier | Avg. response time | Max requests/day | Relations/day (estimated) |
|---|---|---|---|
| Legacy (6–8 GB) | 120–330s | 260–720 | 50–150 |
| Consumer (12–24 GB) | 45–120s | 720–1,920 | 150–400 |
| Semi-Pro (24–48 GB) | 30–80s | 1,080–2,880 | 225–600 |
| Enterprise (40–80 GB) | 10–30s | 2,880–8,640 | 600–1,800 |
The accelerated growth on enterprise hardware is primarily driven by: 1. Higher request throughput → more interactions generating triples 2. Larger models (70B FP16) extract higher-quality triples from responses 3. Parallel expert fan-out processes more dimensions of each query
6. Frontier Model Comparison¶
6.1 Dimension Analysis¶
| Dimension | GPT-4o | Claude 3.7 Sonnet | Gemini 2.0 Flash | MoE Sovereign (now) | MoE Sovereign (+usage) |
|---|---|---|---|---|---|
| MMLU (general) | ~88% | ~90% | ~87% | ~72–78% (14–32B local) | Unchanged — model-weight dependent |
| Domain knowledge | Static (training cutoff) | Static | Static | Grows with usage | Compound growth |
| Task memory | None (stateless) | None | None | GraphRAG + Valkey | Improves with graph density |
| Multi-turn coherence | Token window only | Token window only | Token window only | Persistent Neo4j | Unlimited (graph-backed) |
| Privacy | Cloud-dependent | Cloud-dependent | Cloud-dependent | 100% local | Unchanged |
| Air-gap capable | No | No | No | Yes | Unchanged |
| Cost per request | $0.003–0.015 | $0.003–0.015 | $0.001–0.005 | ~$0.0001 (electricity) | Decreases (cache hits) |
| Data sovereignty | Vendor-controlled | Vendor-controlled | Vendor-controlled | Full ownership | Unchanged |
| Custom expert personas | Limited (system prompt) | Limited | Limited | 16 expert roles, configurable | Expandable |
| Hallucination on math | Moderate | Low | Moderate | Near-zero (MCP tools) | Unchanged |
| Self-improvement | None (needs new version) | None | None | Every interaction | Core feature |
6.2 The Closing Gap¶
The MMLU gap (~72% vs ~90%) reflects model weight capability — the local 7–32B models vs. undisclosed large frontier models. This gap narrows through two mechanisms:
-
Open-weight model progression (see §8 Technology Trends): Qwen3-235B, Llama 4 Scout, and comparable models demonstrate that open-weight models are converging on frontier capability at each parameter count class, roughly 12–18 months behind the closed frontier.
-
Graph-assisted context enrichment: For queries in the system's accumulated domain, retrieval-augmented generation effectively provides the model with "expert memory" unavailable to stateless frontier calls. On domain-specific tasks with well-populated graph coverage, the quality gap narrows substantially.
6.3 Where MoE Sovereign Wins Today¶
- Persistent domain expertise: After 16 days, the system holds 4,798 entities and 5,577 relations — custom organizational knowledge unavailable to any frontier model.
- Deterministic precision: Subnet calculations, unit conversions, complex arithmetic via MCP tools score 8.8 — matching or exceeding frontier models that guess.
- Zero marginal cost per request: Beyond infrastructure power draw, each cached response costs nothing. High-volume repeated queries become essentially free.
- Regulatory compliance: GDPR, HIPAA, and sector-specific data regulations are satisfied by design — no data leaves the deployment boundary.
7. Deployment Scenarios¶
7.1 Private User (1–3 GPU nodes, Legacy or Consumer tier)¶
Example configuration: - 2× RTX 3060 (12 GB each) or 1× RTX 4090 (24 GB) - 16–32 GB system RAM, 1 TB NVMe - Docker Compose single-host deployment
Day 1 (cold start):
- Benchmark: ~6.0–7.0 MoE-Eval (base ontology seeded, no accumulated knowledge)
- Response time: 45–120s per query
- Suitable models:
qwen2.5-coder:7b,phi4:14b(T1/T2 capable) - Cache hit rate: <5% (too few queries for semantic deduplication)
6 months:
- Conservative: 15,000–25,000 knowledge relations
- Optimistic: 30,000–60,000 (with overnight pipelines + autonomous healing)
- Response time: Median drops to 10–20s as cache hit rate reaches 30–50% for common query types
- Quality: Noticeably improved on domain queries; deterministic precision unchanged (already at ceiling)
2 years:
- 80,000–200,000+ relations (domain-saturated for a personal scope)
- Cache hit rate 60–80% for within-domain queries
- Effective cost vs. frontier: negative (no API fees; hardware amortized)
Best use cases: Personal productivity, private research, hobby projects, GDPR-sensitive individual professional work (legal, medical consultations).
7.2 SMB / KMU (5–15 GPU nodes, Consumer + Semi-Pro tier)¶
Example configuration:
- 6 nodes: 2× A5000 (24 GB), 4× RTX 4090 (24 GB)
- 192–256 GB cluster VRAM total
- Multi-user via moe-admin, 10–50 concurrent users
Key advantages over private tier:
- 3–5 T2 expert instances running simultaneously (parallel fan-out fully utilized)
qwen3:32borqwen3-coder:30bas T2 planner — significant quality uplift- MoE Libris federation with 3–10 industry peers (same sector): legal firms sharing case-law triples, medical practices sharing diagnostic knowledge graphs
- Dedicated GPU nodes per expert domain (code review always on RTX, translation on M10)
12 months with 5-node federation:
- Effective K ≈ 200,000–500,000 relations (shared + local)
- Quality on domain tasks: 8.5–9.0 (approaches frontier on specialized queries)
- Median response time for authenticated users: <30s with warm cache
ROI inflection point: At ~3,000 API calls/month, internal infrastructure costs less than frontier API fees at $0.005/call average.
7.3 Enterprise (20+ nodes, Semi-Pro + Enterprise tier + full federation)¶
Example configuration:
- 20+ nodes: A100 80 GB cluster (NVLink), H100 nodes for critical workloads
- 50+ federation peers across industry verticals
- Kubernetes deployment with enterprise overrides
Capabilities:
- 70B parameter models (Llama 4 Maverick, Qwen3-235B) running at FP16 quality
- 50+ federation peers → graph saturates domain knowledge within months
- Dedicated hardware per function: inference cluster, Neo4j cluster, ChromaDB cluster
- Fully auditable: every inference, every knowledge triple, every federation push is logged
12-month prognosis:
- Graph: 2,000,000–5,000,000 relations (org knowledge + 50-node federation)
- Quality on specialized enterprise domains: 9.0–9.5 — matching frontier on domain
- Response time: <10s median (enterprise hardware + cache saturation)
- Cache hit rate: 70–85% (high query repetition in corporate environments)
Frontier parity on: legal contract review, medical literature synthesis, code security audit, financial regulation compliance — areas where organizational knowledge outweighs general world knowledge.
8. Technology Trend Context¶
8.1 Open-Weight Model Trajectory¶
The capability gap between closed frontier and open-weight models is closing at a consistent rate:
| Year | Closed frontier | Open-weight best | Gap (MMLU) |
|---|---|---|---|
| 2023 | GPT-4 (~86%) | Llama 2 70B (~69%) | ~17 points |
| 2024 | GPT-4o (~88%) | Llama 3 70B (~82%) | ~6 points |
| 2025 | Claude 3.7 (~90%) | Qwen2.5-72B (~85%) | ~5 points |
| 2026 | GPT-5 / Gemini Ultra | Qwen3-235B, Llama 4 Maverick | ~3–5 points |
Implication for MoE Sovereign: As open-weight models improve, the system's local model pool upgrades automatically (Ollama pull) without infrastructure changes. The compound knowledge layer multiplies this improvement.
8.2 VRAM Democratization¶
Consumer VRAM has doubled every ~24 months over the past decade:
| Year | Consumer flagship | VRAM |
|---|---|---|
| 2020 | RTX 3090 | 24 GB |
| 2022 | RTX 4090 | 24 GB |
| 2024 | RTX 5090 | 32 GB |
| 2026 | RTX 6090 (projected) | 48 GB |
Implication: Models that require Enterprise tier today (32B FP16) will run on Consumer tier in 2–3 years. The hardware tier boundaries are shifting downward.
8.3 Quantization Improvements¶
- Q4_K_M (current default): 4-bit quantization with ~3% quality loss vs. FP16; runs 14B at ~8 GB VRAM, 32B at ~20 GB VRAM
- Q8_0 (near future): 8-bit quantization with ~1% quality loss; approaching FP16 quality at Q4 VRAM costs due to hardware-accelerated INT8
- GGUF optimizations: Ongoing community improvements reduce quantization error; the production cluster's Q4_K_M models will gain quality without re-download
8.4 MoE Architecture in Open Models¶
Mixtral 8×7B demonstrated that sparse multi-expert architecture achieves dense model quality at lower per-request cost. Qwen3-235B (MoE variant) follows this pattern. MoE Sovereign's multi-expert routing is architecturally aligned with this trend — the software architecture anticipates hardware MoE as a first-class accelerator.
9. Experience Reports¶
9.1 Development Arc (v1.0 → v2.3, January–April 2026)¶
The system evolved rapidly through three distinct phases:
Phase 1 — Single-Expert Baseline (v1.x)
Initial deployment with direct expert routing, no memory, no MCP tools. Quality was bounded by model capability alone. Feedback mechanism not yet implemented; all improvements required code changes.
Key learnings: Model-only quality ceilings are real. Without persistent memory, the same questions are answered identically on every request regardless of organizational context. Users noticed immediately.
Phase 2 — Infrastructure Expansion (v2.0, 2026-03-29)
Simultaneous introduction of: Kafka event streaming, Neo4j knowledge graph (GraphRAG), MCP precision tools server, and self-learning feedback loop. The combination was deliberately ambitious — all four components launched together.
Key learnings: Infrastructure complexity scales faster than individual components
suggest. Kafka KRaft mode initialization, Neo4j constraint creation, and LangGraph
checkpoint setup all required careful sequencing. The AsyncPostgresSaver.setup() call
must complete before Kafka consumers start — the LangGraph pipeline depends on
checkpoint availability.
Phase 3 — Quality Refinement (v2.1–v2.3)
Two-tier expert routing, critic nodes for safety-critical categories, CSRF hardening,
and internationalization. Expert model upgrades: qwq:32b replaces mathstral:7b for
the math expert; deepseek-r1:32b replaces magistral:24b for reasoning.
Key learnings: Thinking-mode models (<think>...</think> output) require prompt
engineering to strip reasoning traces from JSON-formatted outputs. This was discovered
during the LLM suitability benchmark — 36% of tested models failed specifically because
of unstripped reasoning output corrupting JSON parsing.
9.2 LXC Fresh-Install Debug Session (April 14, 2026)¶
A controlled test on a clean Debian 13 LXC container (Proxmox, privileged,
Docker CE) revealed six production-impacting bugs in install.sh. The session
demonstrates the compounding complexity of a heterogeneous multi-service stack.
Environment: Fresh Debian 13 (trixie) LXC, passwordless sudo, Docker CE installed by the one-line installer. All 18 containers started in sequence.
Bugs found and fixed:
| # | Service | Symptom | Root cause | Fix |
|---|---|---|---|---|
| 1 | moe-kafka |
Crash loop: Command [dub path /var/lib/kafka/data writable] FAILED |
kafka-data/ owned root; cp-kafka runs as uid=1000 (appuser) |
chown -R 1000:1000 kafka-data in install.sh |
| 2 | langgraph-orchestrator |
PermissionError: [Errno 13] Permission denied: '/app/.env' at startup |
chmod 600 .env; container runs as uid=1001 (moe); .env bind-mounted :ro |
chmod 644 .env — read-only mount is the security control, not file permissions |
| 3 | moe-prometheus |
Panic: error opening query log file: /prometheus/queries.active: permission denied |
prometheus-data/ owned root; Prometheus runs as uid=65534 (nobody) |
chown -R 65534:65534 prometheus-data |
| 3b | moe-grafana |
GF_PATHS_DATA='/var/lib/grafana' is not writable |
grafana/data and grafana/dashboards owned root; Grafana uid=472 |
chown -R 472:472 grafana/data grafana/dashboards |
| 4 | langgraph-orchestrator |
ValueError: invalid literal for int() with base 10: '2.0' at line 732 |
EVAL_CACHE_FLAG_THRESHOLD=2.0 in .env; main.py:732 calls int(os.getenv(...)) |
Changed to EVAL_CACHE_FLAG_THRESHOLD=2 |
| 5 | All services | Container always (unhealthy) despite working |
Dockerfile HEALTHCHECK hits /health → HTTP 404 (endpoint missing) |
Added GET /health endpoint to main.py |
Timeline: All bugs identified within 15 minutes of first container start by reading
docker logs per service. Fixes applied live, containers restarted individually.
Total time from first start to full healthy stack: ~85 minutes (including 70s Postgres
initialization and 79 MB ONNX model download for ChromaDB).
Non-fatal observations:
moe-kafkaKafka topicmoe.linting,moe.feedback,moe.ingest,moe.requestsnot found on startup — expected on fresh install;KAFKA_AUTO_CREATE_TOPICS_ENABLE=truecreates them on first message, no action required.- ONNX CPU affinity warnings in ChromaDB container:
pthread_setaffinity_np failed— LXC containers cannot pin CPU threads; ONNX model still works correctly. - PostgreSQL initialization 70s delay — normal for first
initdbrun on LXC storage.
Pattern identified: All four permission bugs share the same root cause — mkdir -p
creates directories as root, but the containers run as non-root UIDs defined in their
respective upstream images. This is a one-time initialization problem that compound
any time new data directories are added without corresponding chown calls.
Fix committed: Branch debug/lxc-install, published to GitHub, deployed on the
test LXC via remote git checkout + docker compose up -d.
9.3 Observations on Autonomous Healing¶
Graph density growth¶
The autonomous knowledge healing pipeline (gap_healer_templates.py v2)
demonstrates measurable effectiveness in graph quality metrics. The ratio of
relations per entity (graph density) has increased from 0.96 (v2.0.0 baseline)
to 1.16 (current) — indicating the healing pipeline is not just adding isolated
triples but creating connections between existing entities.
This density increase is significant: multi-hop graph traversal quality (used in GraphRAG retrieval) scales with connectivity, not just entity count. A graph with 5,000 entities at density 1.16 retrieves more relevant context for complex queries than the same graph at density 0.96.
Concurrency regression and fix (v2 rewrite)¶
The v1 implementation used a global asyncio.Semaphore(4) across all nodes,
which collapsed onto the single warmest inference node under sustained load.
Observed symptom: 300 hung tasks, zero throughput on remaining nodes, gap queue
monotonically growing from 802 to 1,057 without resolution.
The v2 rewrite introduced per-node Redis slot counters using atomic
ZPOPMAX claims and INCR/DECR operations on moe:healer:active:{node}
keys. Hardware concurrency caps per node class:
| Node class | Max slots |
|---|---|
| Tesla M60 | 1 |
| Tesla M10 | 3 |
| RTX 4090 | 4 |
| GT 1060 | 2 |
Progressive slot unlock: nodes start at 1 (cold) and gain +1 slot per 5 successful classifications up to the hardware cap. This prevents VRAM exhaustion on burst start while reaching full throughput on stable nodes.
Consequence for growth model: The v2 healer enables sustained parallel healing at 4–9× the rate of v1 under the same hardware. Gap closure velocity is now bounded by Neo4j write throughput (~120 MERGE/min), not inference concurrency. The growth curves in sections 4 and 5 remain valid; the time-to-density milestone improves proportionally with healer uptime.
10. Conclusion and Outlook¶
What is validated empirically¶
- Knowledge graph grows at ~3–3.5 entities/relations per hour under active solo use
- 16 days of use produced ×46 entities and ×55 relations — confirming compound growth
- 50% of tested open-weight models (21 tested) are capable orchestration components
- Deterministic quality via MCP tools is near-perfect and hardware-independent (8.8/10)
- The Docker Compose installer works on Debian 13 LXC with the fixes from
debug/lxc-install
What is extrapolated¶
- Quality improvement curves at higher K values — requires future benchmark runs at 25k, 100k, 500k relation milestones to validate or correct
- Federation network effects — no measured multi-node data yet; the N× estimate is theoretical and bounded by hub filtering rates
- Hardware tier quality independence — validated conceptually; systematic cross-tier benchmarking with identical models would confirm or refine this
Near-term milestones (suggested)¶
- Benchmark at 25k relations (expected: ~3–4 months at current solo rate) — rerun MoE-Eval v1 suite, compare to current 6.8–8.8 baseline to fit the quality curve
- First MoE Libris federation test — measure actual K growth acceleration with 2–3 controlled nodes to calibrate the network effect multiplier
- Cross-tier benchmark — run identical MoE-Eval suite on Legacy (M10) vs. Consumer (RTX 4090) hardware to confirm hardware-independence of quality scores
- Cache hit rate tracking — instrument ChromaDB hit/miss rate over time to validate the latency improvement projection
The compound advantage¶
Unlike frontier models which require billion-dollar training runs to improve, MoE Sovereign improves with every interaction through mechanisms that cost only compute time: causal learning, synthesis persistence, graph linting, and autonomous healing. The system's value proposition strengthens with usage — making the first year the hardest and every subsequent year cheaper and more capable.
MoE Libris extends this principle to collaborative scale: organizations that share domain knowledge through the federation hub amplify each other's learning curves while maintaining complete data sovereignty.
The efficiency graph is not a hockey stick — it is a slow ramp that steepens. The steepening is the point.