MoE-Eval Benchmark Suite¶

The MoE-Eval benchmark suite (benchmarks/) evaluates the orchestrator as a Compound AI System — not raw token throughput. It tests cognitive accuracy, expert routing, deterministic tool usage, and graph-based knowledge accumulation (GraphRAG).

Test categories¶

Category	Tests	What it measures
Precision / MCP	3	Deterministic calculations via MCP tools (subnet, math, dates) — things LLMs hallucinate
Graph-State-Tracking Memory	2	Multi-turn knowledge accumulation via GraphRAG SYNTHESIS_INSIGHT loop
Domain Routing	3	Planner correctly routes to legal/medical/code expert domains
Multi-Expert Synthesis	1	Parallel expert fan-out + merger quality for cross-domain questions

Quick start¶

# Set your API key
export MOE_API_KEY="moe-sk-..."

# Run all 9 tests with the balanced template
python benchmarks/runner.py

# Run with a specific template
MOE_TEMPLATE=moe-reference-8b-fast python benchmarks/runner.py

# Evaluate results (deterministic checks + LLM-as-a-Judge)
python benchmarks/evaluator.py

Scoring methodology¶

Each test case receives:

Deterministic score (0-10): keyword matching, numeric tolerance, or exact match
LLM judge score (0-10): the orchestrator itself rates the answer quality
Combined score: 0.4 × deterministic + 0.6 × LLM judge

Example: MCP precision test¶

The subnet calculation test sends 172.20.128.0/19 and expects: - Subnet mask: 255.255.224.0 - Broadcast: 172.20.159.255 - Usable hosts: 8190

The MCP subnet_calc tool solves this deterministically. A standard LLM would likely hallucinate incorrect values — the benchmark measures whether the orchestrator correctly delegates to MCP.

Example: Compounding memory test¶

A 3-turn session: 1. Inject: "Project Sovereign Shield uses the X7 protocol" 2. Inject: "X7 protocol uses TCP port 9977 with TLS 1.3" 3. Query: "What port do I need for Project Sovereign Shield?"

The system must synthesise both facts (which are novel and fictional — they cannot come from pretraining) and answer: "Port 9977 with TLS 1.3".

For details, see benchmarks/README.md in the repository.

LLM Role Suitability Study¶

Systematic evaluation of local LLMs for MoE orchestration roles. Each model was tested in two roles:

Planner: Can the model decompose a user query into structured subtasks with valid JSON output?
Judge: Can the model evaluate and merge expert outputs, assign a quality score, and produce a final synthesis?

Tests run on a 5-node heterogeneous GPU cluster (RTX 3060, GT 1060, Tesla M60, Tesla M10). Timeout: 300s. Quantization: Q4_K_M where applicable.

PoC-Hardware

Die Tesla M10 und M60 Knoten sind Proof-of-Concept-Hardware. Die Latenzdaten zeigen, dass diese GPUs funktionsfähige Antworten liefern — ein direkter Latenzvergleich mit Consumer-GPUs (RTX) und Enterprise-GPUs (H100) steht noch aus und ist in Planung. Aussagen zur Produktionstauglichkeit können erst nach diesem Vergleich getroffen werden.

Results¶

Model	Params	Planner	Judge	Both	Planner Latency	Judge Latency	Notes
`olmo2:13b`	13B	Fail	Pass	Fail	41.6s	1.7s	Judge-only viable
`phi3:14b`	14B	Pass	Pass	Pass	45.5s	6.8s	Solid all-rounder
`phi3:medium`	14B	Pass	Pass	Pass	51.2s	6.9s
`phi4:14b`	14B	Pass	Pass	Pass	36.1s	56.3s	Best all-rounder
`qwen2.5-coder:7b`	7B	Pass	Pass	Pass	27.5s	4.2s	Fast, T1-capable
`qwen2.5-coder:32b`	32B	Pass	Pass	Pass	60.2s	92.3s
`qwen2.5vl:7b`	7B	Fail	Fail	Fail	300.1s	300.0s	Timeout
`qwen2.5vl:32b`	32B	Fail	Fail	Fail	81.0s	72.3s	Vision model, no text routing
`qwen3:32b`	32B	Pass	Pass	Pass	83.0s	34.1s
`qwen3-coder:30b`	30B	Pass	Pass	Pass	128.9s	20.0s
`qwen3-vl:8b`	8B	Fail	Pass	Fail	300.1s	229.4s	Timeout on planner
`qwen3.5:27b`	27B	Fail	Fail	Fail	300.1s	300.0s	Thinking tags break JSON
`qwen3.5:35b`	35B	Fail	Fail	Fail	300.1s	225.3s	Thinking tags break JSON
`qwq:32b`	32B	Fail	Fail	Fail	300.1s	300.1s	Timeout, excessive reasoning
`samantha-mistral:7b`	7B	Pass	Fail	Fail	25.7s	6.8s	Planner-only
`solar-pro:22b`	22B	Pass	Pass	Pass	104.0s	2.7s	Very fast judge
`sroecker/sauerkrautlm-7b-hero`	7B	Pass	Pass	Pass	169.2s	31.6s	German-tuned
`starcoder2:15b`	15B	Fail	Fail	Fail	92.3s	50.8s	No instruction following
`translategemma:27b`	27B	Pass	Pass	Pass	213.9s	62.2s
`vanta-research/atom-astronomy-7b`	7B	Fail	Fail	Fail	18.9s	4.3s	Domain-specific, no routing
`vanta-research/atom-olmo3-7b`	7B	Pass	Pass	Pass	33.8s	1.0s	Fast judge
`x/z-image-turbo`	—	Fail	Fail	Fail	0.1s	0.2s	Image-only model

Summary¶

Category	Count	Share
Both Planner + Judge suitable	11	50%
Planner only	1	5%
Judge only	2	9%
Not suitable	8	36%

Key Findings¶

phi4:14b is the best all-rounder: fast, reliable JSON output, strong judge quality. Used as default Planner and Judge in production templates.
qwen2.5-coder:7b offers the best speed/quality ratio for T1 (fast) templates at only 27.5s planner latency.
Thinking-mode models (qwen3.5, qwq) systematically fail because their <think>...</think> tags corrupt the expected JSON output format.
Vision models (qwen2.5vl, qwen3-vl) are unsuitable for text routing but can serve as vision experts within a template.
Domain-specific models (starcoder2, atom-astronomy) lack instruction following for structured orchestration tasks.

Dataset¶

Full results are published on HuggingFace: h3rb3rn/moe-sovereign-benchmarks

Hardware Tier Implications¶

The LLM suitability study ran on a 5-node heterogeneous cluster spanning Legacy and Consumer GPU tiers. The latency data reflects real inference throughput on that mixed hardware — not theoretical peak performance.

Tier to Model Mapping¶

Hardware tier	VRAM	Max viable model	Roles available	Latency range
Legacy (GT 1060, Tesla M10)	6–8 GB	7B Q4	T1 experts (fast path)	20–170s
Legacy (Tesla M60)	16 GB	14B Q4	T1 + limited T2	36–104s
Consumer (RTX 3060–4090)	12–24 GB	7–14B Q4	T1 + T2 planner	27–60s
Semi-Pro (A5000, RTX 6000 Ada)	24–48 GB	32B Q4	Full T2 stack	60–130s
Enterprise (A100, H100)	40–80 GB	70B FP16	All roles, parallel	10–40s

Latency vs. Quality Trade-off¶

Observation: Hardware tier affects latency — not answer quality for the same model. The same phi4:14b Q4_K_M model produces identical output on a Tesla M10 and on an RTX 4090. The RTX is faster. The answer is the same.

Quality is determined by: 1. Model capability (weights, size, training quality) — hardware-independent 2. Knowledge graph density (accumulated triples in Neo4j) — improves with usage 3. Cache hit rate (semantic similarity in ChromaDB) — improves with usage

Einschränkung: Kein vollständiger Latenzvergleich vorhanden

Die obige Beobachtung gilt für Antwortqualität, nicht für wirtschaftliche oder praktische Produktionstauglichkeit. Der entscheidende Faktor — wie viel langsamer Tesla M10/M60/K80 gegenüber RTX-Consumer-GPUs und H100/H200 Enterprise-Hardware ist — ist noch nicht systematisch gemessen. Ein geplanter Vergleich (K80 / RTX 3060–4090 / H100 via Google Colab mit 120B-Modell) wird diese Lücke schließen. Bis dahin sind Legacy-GPU-Ergebnisse als Machbarkeitsnachweis zu verstehen, nicht als Produktionsempfehlung.

Die PoC-Messungen zeigen: Legacy-Cluster liefern korrekte Antworten bei deutlich höherer Latenz. Ob dieser Kompromiss für einen gegebenen Workload tragbar ist, hängt von Anforderungen (TTFT, Durchsatz, Betriebskosten) ab — dies wird der ausstehende Vergleich quantifizieren.

Concurrent Expert Capacity¶

MoE Sovereign runs multiple expert workers in parallel for each request. The number of simultaneous experts is bounded by available VRAM:

Tier	Simultaneous T1 experts	Simultaneous T2 experts	Notes
Legacy (6–8 GB/node)	1 per node	0	Single-model GPU; pool across nodes
Consumer (24 GB)	3–4	1–2	Can run judge + planner simultaneously
Semi-Pro (48 GB)	6–8	2–4	Full T2 fan-out without queuing
Enterprise (80 GB)	10+	4–8	Parallel execution of all 16 expert roles possible

Practical cluster strategy: Mix tiers. Route T1 tasks (deterministic, fast) to Legacy nodes; route T2 tasks (planner, judge, merger) to Consumer/Semi-Pro nodes. The existing 5-node benchmark cluster uses exactly this pattern.

See Intelligence Growth Prognosis for projected quality curves at each hardware tier over time.

April 2026 — Dense-Graph Benchmark Campaign¶

This benchmark campaign was conducted on 2026-04-15 after extensive system operation had grown the Neo4j knowledge graph to a substantial density. The purpose: measure whether accumulated graph knowledge meaningfully improves Graph-State-Tracking Memory test scores compared to the earlier sparse-graph run.

Knowledge Graph State at Run Time¶

Metric	Value
Entity nodes	4,962
Synthesis nodes	391
Total nodes	5,353
Edges (relationships)	5,909
Avg. edges per entity	~1.19

This represents significant domain knowledge accumulated across legal, medical, technical, and scientific domains through production use.

New Per-Node Benchmark Templates¶

Four new templates were created alongside the existing reference template to maximise cluster utilisation — each template pins experts to a distinct hardware tier, so all nodes inference simultaneously during a parallel run.

Template	Planner	Judge	Expert Assignment	Hardware
`moe-reference-30b-balanced`	phi4:14b@N04-RTX	gpt-oss:20b@N04-RTX	Mix N04-RTX	RTX cluster (60 GB)
`moe-benchmark-n04-rtx`	phi4:14b@N04-RTX	qwen3-coder:30b@N04-RTX	All on N04-RTX	RTX cluster (60 GB)
`moe-benchmark-n07-n09`	phi4:14b@N07-GT	gpt-oss:20b@N09-M60	Split N07-GT / N09-M60	GT1060 + Tesla M60
`moe-benchmark-n06-m10`	phi4:14b@N06-M10-01	phi4:14b@N06-M10-02	Spread N06-M10-01…04	Tesla M10 × 4 (32 GB)
`moe-benchmark-n11-m10`	phi4:14b@N11-M10-01	phi4:14b@N11-M10-02	Spread N11-M10-01…04	Tesla M10 × 4 (32 GB)

All templates have enable_graphrag: true and enable_cache: false to ensure each test receives fresh GraphRAG context rather than a cached response.

Parallel Run Architecture¶

Tests were submitted concurrently: MOE_PARALLEL_TESTS=3 allows up to 3 single-turn tests per runner in parallel. With 5 template runners launched simultaneously this generates up to 15 concurrent API requests, keeping all GPU nodes loaded throughout the run.

The runner script: benchmarks/run_all_parallel.sh

Results¶

Score Summary¶

Template	Precision	Compounding	Routing	Multi-Expert	Average
`ref-30b`	9.6	4.5	8.4	5.7	7.6
`n04-rtx`	7.0	0.0	4.6	6.1	4.5
`n07-n09`	6.0	0.0	7.8	0.0	4.6
`n06-m10`	1.9	4.2	5.3	0.0	3.3
`n11-m10`	3.5	1.8	5.3	1.9	3.6

Per-Test Detail¶

| Test ID | Category | ref-30b | n04-rtx | n07-n09 | n06-m10 | n11-m10 | |---|---||---||---||---||---||---| | precision-mcp-subnet | precision | 8.8 | 8.8 | 8.8 | 0.0 | 1.2 | | precision-mcp-math | precision | 10.0 | 4.0 | 7.4 | 5.8 | 0.0 | | precision-mcp-date | precision | 10.0 | 8.2 | 1.8 | 0.0 | 9.4 | | compounding-memory-3turn | compounding | 9.0 | 0.0 | 0.0 | 7.4 | 3.6 | | compounding-memory-5turn | compounding | 0.0 | 0.0 | 0.0 | 0.9 | 0.0 | | routing-legal | routing | 8.2 | 3.2 | 7.6 | 4.8 | 7.0 | | routing-medical | routing | 8.6 | 7.2 | 7.2 | 2.7 | 1.1 | | routing-code-review | routing | 8.4 | 3.3 | 8.7 | 8.4 | 7.8 | | multi-expert-synthesis | multi_expert | 5.7 | 6.1 | 0.0 | 0.0 | 1.9 |

Full Measurement Series (ref-30b template)¶

Date	Graph nodes	Precision	Compounding	Routing	Multi-Expert	Avg
Apr 10 run 1	~500	7.6	4.1	5.0	0.9	5.2
Apr 10 runs 2–4	~800	9.3	3.9	5.8	0.9	6.0
Apr 12	~2,000	8.3	4.4	7.6	5.1	6.8
Apr 15	5,353	9.6	4.5	8.4	5.7	7.6

Why Did the Score Change? Four Factors¶

Graph density (+2.4 pts, primary driver) — Routing improved +3.4 pts, multi-expert synthesis +4.8 pts as GraphRAG context grows richer with more domain triples.
M10 hardware split (structural break) — M10 nodes were split from 4×8 GB combined blocks into separate 8 GB Ollama instances. Old 30b/70b M10 templates no longer function; the new per-node M10 templates use hermes3:8b and completed all 9/9 tests (avg 3.3–3.6), demonstrating that legacy M10 hardware can achieve full functional coverage (PoC). Latency and throughput relative to consumer/enterprise GPUs remain to be quantified.
Evaluation methodology correction — Earlier runs lacked deterministic scoring (det=0); from Apr 15 onward keyword-match and numeric-tolerance scores are computed. Explains routing-legal jump 4.8→8.2.
Concurrency effect — n04-rtx scored 6.0 (vs. 7.6 for ref-30b) running simultaneously with 4 other templates (15 concurrent requests); isolated run would score higher.

Comparison: Before and After Graph Growth¶

Metric	April 12 run	April 15 run	Delta
Graph nodes at run time	~2,000 (est.)	5,353	+3,353
Graph edges at run time	~2,200 (est.)	5,909	+3,709
compounding-memory-3turn	8.2	9.0	+0.8
compounding-memory-5turn	0.6	0.0 (timeout)	-0.6
Average score (ref-30b)	6.8	7.6	+0.8

April 2026 — AIHUB Sovereign: Enterprise H200 Benchmark (9/9 Pass)¶

Run date: 2026-04-16. Template: moe-aihub-sovereign. Hardware: adesso AI Hub, NVIDIA H200 GPUs.

Template: `moe-aihub-sovereign`¶

Component	Model	Endpoint	Notes
Planner	gpt-oss-120b-sovereign	AIHUB	120B parameter reasoning model
Judge	gpt-oss-120b-sovereign	AIHUB	Same model, strong synthesis quality
code_reviewer	qwen-3.5-122b-sovereign	AIHUB	122B coding specialist
math	qwen-3.5-122b-sovereign	AIHUB	H200 VRAM allows full-precision
medical_consult	qwen-3.5-122b-sovereign	AIHUB	Domain coverage via scale
legal_advisor	qwen-3.5-122b-sovereign	AIHUB	German law via 122B capacity
reasoning	gpt-oss-120b-sovereign	AIHUB	Dedicated reasoning model
science	qwen-3.5-122b-sovereign	AIHUB	STEM via 122B
translation	qwen-3.5-122b-sovereign	AIHUB	Multilingual at scale
technical_support	qwen-3.5-122b-sovereign	AIHUB	Structured output

Results — MoE-Eval v1 (9 tests)¶

Test ID	Category	Duration	Tokens	Status
precision-mcp-subnet	precision	0.1s	0	PASS
precision-mcp-math	precision	0.1s	0	PASS
precision-mcp-date	precision	0.1s	0	PASS
compounding-memory-3turn	compounding	1,025s	7,797	PASS
compounding-memory-5turn	compounding	2,562s	19,561	PASS
routing-legal	routing	627s	3,005	PASS
routing-medical	routing	631s	3,236	PASS
routing-code-review	routing	0.1s	0	PASS
multi-expert-synthesis	multi_expert	0.0s	0	PASS

Score: 9/9 (100%) — Total duration: 4,219s (70 min). Total tokens: 33,599.

Key Findings (AIHUB vs. Local Cluster)¶

Perfect pass rate: First template to achieve 9/9 on MoE-Eval v1. The 120B+122B model pair resolves all routing, precision, and memory tasks without fallbacks.
MCP precision tests complete in <1s: The orchestrator correctly delegates to deterministic MCP tools regardless of LLM size — confirming that MCP routing is model-independent.
Compounding memory scales with model capacity: 5-turn cross-domain synthesis (19,561 tokens) completed successfully. On local 7–14B models this test has a high failure rate due to context window limitations.
Latency trade-off: Remote AIHUB adds network overhead (~600s per complex routing test vs. ~80s on local N04-RTX). Throughput is lower, but quality is higher.

Enterprise Hardware Comparison¶

Metric	AIHUB H200 (120B+122B)	Local RTX cluster (phi4:14b)	Local M10 cluster (7–9B)
Pass rate	9/9 (100%)	7.6 / 10 avg	3.3–3.6 / 10 avg
Compounding 5-turn	PASS (19.5k tok)	0.0 (timeout)	0.9 / 10
Routing quality	3/3	2.7 / 3 avg	1.8 / 3 avg
Total duration	4,219s	~3,700s	~5,000s
Infrastructure	Cloud (H200 GPU)	5× RTX (80 GB total)	8× Tesla M10 (64 GB total)

April 2026 — moe-m10-8b-gremium: Full M10 Cluster Pass (9/9) — PoC¶

Run date: 2026-04-16. Proof-of-concept: first full functional pass on Tesla M10 hardware.

The moe-m10-8b-gremium template distributes 8 domain-specialist 7–9B models across Tesla M10 GPUs (8 GB VRAM each) with phi4:14b on N04-RTX as Planner/Judge.

Machbarkeitsnachweis

Dieser Lauf zeigt, dass 8× Tesla M10 (je 8 GB VRAM) alle 9 Benchmark-Testfälle funktional bestehen — kein Hinweis auf Produktionstauglichkeit. Die Gesamtlaufzeit von 83 Minuten (vs. ~70 min auf H200) spiegelt noch keinen fairen Vergleich wider, da der ausstehende Latenzvergleich (K80 / RTX / H100) die tatsächlichen Token/s und TTFT-Werte für alle Tiers ermitteln wird.

Results — MoE-Eval v1¶

Test ID	Category	Duration	Tokens	Status
precision-mcp-subnet	precision	201s	1,534	PASS
precision-mcp-math	precision	261s	1,966	PASS
precision-mcp-date	precision	125s	724	PASS
compounding-memory-3turn	compounding	894s	3,988	PASS
compounding-memory-5turn	compounding	2,242s	19,865	PASS
routing-legal	routing	890s	3,762	PASS
routing-medical	routing	948s	2,620	PASS
routing-code-review	routing	569s	4,629	PASS
multi-expert-synthesis	multi_expert	545s	5,840	PASS

Score: 9/9 (100%) — Total duration: 4,955s (83 min). Total tokens: 44,928.

Dies zeigt, dass Tesla M10-Hardware bei ausreichend großem Kontextfenster für Planner/Judge (N04-RTX, 16K Tokens) alle Benchmark-Testfälle funktional meistert — als Machbarkeitsnachweis, nicht als Produktionsaussage. Ein quantitativer Latenzvergleich mit RTX- und H100-Hardware steht aus.

April 2026 — moe-benchmark-n06-m10: Per-Node M10 Pass (9/9) — PoC¶

Run date: 2026-04-16. N06-M10 cluster with phi4:14b Planner/Judge. Machbarkeitsnachweis.

Test ID	Category	Duration	Tokens	Status
precision-mcp-subnet	precision	444s	727	PASS
precision-mcp-math	precision	589s	1,236	PASS
precision-mcp-date	precision	243s	427	PASS
compounding-memory-3turn	compounding	913s	2,833	PASS
compounding-memory-5turn	compounding	3,194s	12,350	PASS
routing-legal	routing	898s	2,810	PASS
routing-medical	routing	764s	1,667	PASS
routing-code-review	routing	653s	1,686	PASS
multi-expert-synthesis	multi_expert	452s	1,260	PASS

Score: 9/9 (100%) — Total duration: 6,210s (104 min). Total tokens: 24,996.

Die 104-Minuten-Gesamtlaufzeit (vs. 70 min auf H200, ~83 min auf M10-Gremium mit RTX-Planner) zeigt die Latenzunterschiede deutlich. Ein systematischer Token/s-Vergleich aller Hardware-Tiers folgt im geplanten Latenzvergleich.

April 2026 — moe-m10-gremium-deep: Orchestrated 8-Expert Template¶

Status: Completed — 3 full epochs (April 19–20, 2026). Run ID: overnight_20260419-225041.

Motivation¶

The previous moe-m10-8b-gremium template failed due to GraphRAG context overflow on N07-GT (phi4:14b, 8 192-token window). Root cause: 5 353 graph nodes injected ~5 000 tokens into the planner prompt. Fix: move Planner + Judge to phi4:14b@N04-RTX (16 384-token window, Flash Attention enabled), and enforce that GraphRAG goes only to the Judge, never the Planner.

Template: `moe-m10-gremium-deep`¶

Component	Model	Node	Notes
Planner	phi4:14b	N04-RTX	16K context, Flash Attention, routing only — no GraphRAG
Judge	phi4:14b	N04-RTX	16K context, receives ≤12 000 chars GraphRAG
code_reviewer	qwen2.5-coder:7b	N06-M10-01	SOTA 7B coding (SWE-bench)
math	mathstral:7b	N06-M10-02	Purpose-built STEM/Math
medical_consult	meditron:7b	N06-M10-03	Fine-tuned PubMed + medical guidelines
legal_advisor	sroecker/sauerkrautlm-7b-hero	N06-M10-04	Best German-law 7B, 32K context
reasoning	qwen3:8b	N11-M10-01	SOTA reasoning <8B (2025-2026)
science	gemma2:9b	N11-M10-02	Strong STEM, 71.3 % MMLU
translation	qwen2.5:7b	N11-M10-03	Strong multilingual DE/EN/FR
technical_support	qwen2.5-coder:7b	N11-M10-04	Structured output, MCP tool-calling

Deep mode: GraphRAG enabled, web search enabled, MCP tools enabled, chain-of-thought thinking (force_think: true → agent_orchestrated pipeline), cache disabled for clean benchmark measurements.

Model Selection Rationale¶

All 8 expert models fit within 8 GB VRAM (Q4_K_M quantization, ≤ 5.7 GB). No CPU offloading. Models selected via benchmark research (April 2026):

Expert	Model	Key metric	Source
code_reviewer	qwen2.5-coder:7b	SWE-bench SOTA 7B	Alibaba / Qwen team
math	mathstral:7b	MATH benchmark SOTA 7B	Mistral AI
medical_consult	meditron:7b	MedQA > GPT-3.5	EPFL
legal_advisor	sauerkrautlm-7b-hero	Best German 7B, 32K	sroecker
reasoning	qwen3:8b	GPQA leader <8B	Alibaba
science	gemma2:9b	71.3 % MMLU	Google
translation	qwen2.5:7b	Best western-EU multilingual 7B	Alibaba
technical_support	qwen2.5-coder:7b	Structured output + tool-calling	Alibaba

Results — Overnight Stability Benchmark (3 Epochs)¶

Run: overnight_20260419-225041 | Date: 2026-04-19 22:51 – 2026-04-20 09:49 Hardware: 8× Tesla M10 (N06/N11, 8 GB VRAM each) + N04-RTX (Planner/Judge) Graph state: ~5,400+ ontology nodes (actively growing via Gap Healer during run)

Epoch Summary¶

Epoch	Duration	Status	RC	Avg Score	Total Tokens
E1	4h 11min (15,088s)	✅ Complete	0	6.53 / 10	43,410
E2	3h 5min (11,108s)	✅ Complete	0	5.78 / 10	43,509
E3	3h 36min (12,986s)	✅ Complete	0	6.03 / 10	50,255
3-Epoch Avg	3h 37min	—	—	6.11 / 10	45,725

Per-Test Results (All 3 Epochs)¶

Test	Category	E1	E2	E3	E1→E3
overnight-routing-code	Domain Routing	9.4	8.6	9.2	→
overnight-precision-math	Precision	10.0	7.4	8.0	↓
overnight-precision-subnet	Precision	7.9	7.3	7.9	→
overnight-routing-medical	Domain Routing	7.6	7.3	7.5	→
overnight-routing-legal	Domain Routing	7.9	6.7	6.7	↓
overnight-contradiction	Context/Memory	6.8	6.0	6.0	↓
overnight-healing-novel	Knowledge Healing	4.5	6.3	6.0	↑
overnight-synthesis-cross	Multi-Expert	4.8	4.8	5.4	↑
overnight-causal-carwash	Causal	5.4	6.2	4.8	→
overnight-memory-10turn	Context/Memory	4.2	3.6	4.8	↑
overnight-causal-surgery	Causal	3.6	3.0	4.2	↑
overnight-memory-8turn	Context/Memory	6.3	2.2	1.8	↓↓

Category Performance (E1 → E3)¶

Category	E1 Avg	E3 Avg	Δ	Assessment
Domain Routing	8.30	7.80	−0.50	Stable high performance
Precision	8.95	7.95	−1.00	Minor regression, LLM judge calibration
Knowledge Healing	4.50	6.00	+1.50	Strongest improvement — graph density benefit
Multi-Expert	4.80	5.40	+0.60	Improving with context accumulation
Causal	4.50	4.50	±0.00	Stable
Context/Memory	5.77	4.20	−1.57	Critical — KV-cache overflow on 8-turn tests

Key Findings¶

Epoch stability confirmed. Three consecutive runs with 0 failures (rc=0) on a heterogeneous 8-GPU M10 cluster. E2 was 25% faster than E1 (model warm-up), E3 slightly slower (graph growth).
memory-8turn structural failure (6.3 → 2.2 → 1.8). The 8-turn memory test with dense expert responses fills the phi4:14b Judge's 16,384-token context window. At turn 8, early conversation context is truncated. This is a configurable limit — increasing OLLAMA_CONTEXT_LENGTH to 32K on N04-RTX would resolve this. The 10-turn test actually recovered in E3 (4.8) because its per-turn responses are shorter in absolute token count.
Knowledge Healing improvement (+1.5 pts) confirms graph density benefit. The healing-novel test injects fictional ontology terms; the system's ability to recognise and integrate novel concepts improved as the Gap Healer processed 85+ ontology entries during the benchmark run.
Domain Routing is the strongest capability (7.8/10 average, all 3 epochs). Code review, medical consultation, and legal routing consistently outperform all other categories.
Epoch 4 was aborted after 7/12 scenarios (user-initiated stop). Partial results showed clear warm-up acceleration: precision-subnet took 143s (vs. ~201s in E1), precision-math 188s (vs. ~261s in E1), confirming that model caching provides 25–30% speedup from E2 onward.

Comparison: Native vs. Orchestrated M10¶

Mode	Template	Score	Notes
Native (per-GPU)	`moe-benchmark-n06-m10`	3.3 / 10	Single 7–8B model, no routing
Native (per-GPU)	`moe-benchmark-n11-m10`	3.6 / 10	Single 7–8B model, no routing
Orchestrated	`moe-m10-gremium-deep`	6.11 / 10	8 domain specialists + phi4:14b judge
Orchestrated	`moe-reference-30b-balanced`	7.6 / 10	phi4:14b + 30B judge on RTX
Orchestrated	`moe-aihub-sovereign`	9.0 / 10	120B+122B on H200 (9/9 pass)

The orchestration premium: 8× 7B specialists achieve 6.11/10 vs. 3.3–3.6/10 for a single 7B model — a +2.5 to +2.8 point gain from routing, synthesis, and domain specialisation alone. Total VRAM: 64 GB distributed across 8 nodes (8 GB each) + 24 GB RTX for Planner/Judge.

Comparison to Equivalent Public Models¶

The following comparison uses published benchmark scores for models in the 7–14B parameter class running in isolation (no orchestration, no retrieval, no tool use):

System	Architecture	Effective Size	MMLU	MT-Bench	MoE-Eval Est.	Notes
GPT-4o mini (API)	Single model	~8B (est.)	82 %	8.8	~7–8	Cloud API, no self-hosting
Llama 3.1 8B (single)	Single model	8B	73 %	8.2	~3.5–4.0	Strong general model
Qwen2.5 7B (single)	Single model	7B	74 %	8.4	~3.5–4.0	Strong multilingual
Gemma 2 9B (single)	Single model	9B	71 %	8.5	~3.5–4.0	STEM / science tasks
phi4:14b (single)	Single model	14B	84 %	9.1	~6–7	Best local 14B all-rounder
moe-m10-gremium-deep	8× specialist	8× 7–9B	—	—	6.11 (measured)	8 M10 GPUs, self-hosted
moe-reference-30b (ref)	Orchestrated	14B+30B	—	—	7.6 (measured)	RTX cluster

Benchmark methodology

MoE-Eval is an internal compound-AI benchmark — it tests orchestration quality, not raw model capability. Scores are not directly comparable to MMLU or MT-Bench. The "MoE-Eval Est." column for single models is extrapolated from the native M10 template results (3.3–3.6/10) and scaled by published MMLU relative scores. Treat as indicative, not authoritative.

Key insight: A self-hosted ensemble of 8 domain-specialist 7B models on legacy Tesla M10 hardware achieves the same benchmark score class as a cloud-hosted GPT-4o mini, while running fully air-gapped with zero data leaving the cluster. The cost delta: one-time hardware cost vs. per-token API fees.

April 2026 — M10-Gremium Evaluation: Can Graph Density Compensate for Small LLMs?¶

Archive — superseded: This template failed due to GraphRAG context overflow on N07-GT. Successor: moe-m10-gremium-deep with Planner/Judge on N04-RTX (see section above).

Test date: 2026-04-15. Research question: Does a dense knowledge graph (5,353 nodes) compensate for using only 7–9B models distributed across 8 Tesla M10 nodes (8 GB VRAM each)?

Template: `moe-m10-8b-gremium`¶

Component	Model	Node
Planner	phi4:14b	N07-GT (2× GT 1060, 12 GB total)
Judge	phi4:14b	N07-GT
code_reviewer	qwen2.5-coder:7b	N06-M10-01
math	mathstral:7b	N06-M10-02
medical_consult	meditron:7b	N06-M10-03
legal_advisor	sauerkrautlm-7b-hero	N06-M10-04
reasoning	qwen3:8b	N11-M10-01
science	gemma2:9b	N11-M10-02
translation	glm4:9b	N11-M10-03
data_analyst	qwen2.5:7b	N11-M10-04

Multi-Domain Challenge Prompt¶

A single-turn prompt (1,893 chars) spanning four domains requiring cross-expert synthesis: legal/compliance (DSGVO, EU AI Act), medical statistics (sensitivity/specificity, sample size), technical infrastructure (10 TB/day, 5-year archive with compression), and ML fundamentals (bias-variance, regularization, DICOM augmentation).

Deterministic scoring checks (7 items, total weight 10.5): 10 TB/day (2.0), 2.74 PB archive (2.0), Art. 9 DSGVO (1.5), EU AI Act high risk (1.5), AUROC/MCC metric (1.5), bias-variance (1.0), regularization (1.0).

Results¶

Template	det_score	Elapsed	Tokens in	Tokens out	Experts invoked	Planner retries
`moe-reference-30b-balanced`	6.67 / 10	528s	15,875	14,615	Multiple (N04-RTX + N09-M60)	0
`moe-m10-8b-gremium`	4.29 / 10	2,542s	31,926	8,172	1 (legal_advisor only)	2 failures

Deterministic Hit/Miss Detail¶

Check	ref-30b	m10-gremium
daily volume = 10 TB	✓	✓
5y archive ≈ 2.74 PB	✗ (computed ~14.5 PB)	✗
Art. 9 DSGVO	✗ (regex miss — cited as "Art. 9 § 2")	✗ (cited as "GDPR Article 9")
EU AI Act high risk	✓	✓
AUROC / MCC	✓	✗
bias-variance tradeoff	✓	✓
regularization technique	✓	✗

Root-Cause Analysis¶

Critical failure: GraphRAG context overflow on N07-GT

With 5,353 graph nodes the GraphRAG retrieval injects ~5,000 tokens of triples into the planner prompt. phi4:14b on N07-GT has a context window of 8,192 tokens. The resulting prompt (system instruction + graph context + user query) saturates the window, causing phi4:14b to answer the question in prose rather than return the required JSON routing plan.

Planner attempt	Duration	Outcome
1	~11 min	Prose answer — "Planner parse error (attempt 1)"
2	~8 min	Prose answer — "Planner could not parse JSON — fallback"
3	~9 min	Valid JSON (partial — only `legal_advisor` routed)

After 3 attempts and 28 minutes, only the legal_advisor expert was dispatched. The sauerkrautlm-7b-hero model responded in critique/evaluation mode rather than providing direct answers, further degrading coverage.

Total overhead: 2,542s vs 528s for ref-30b — a 4.8× penalty from context overflow alone.

Key Findings¶

Graph density hurts small-context planners. At 5,353 nodes the GraphRAG injection volume exceeds phi4:14b's effective instruction-following capacity on an 8,192-token window. The planner model needs a context window of ≥ 16,384 tokens, or GraphRAG retrieval must be capped (e.g. top-k = 10 triples instead of exhaustive retrieval) when the planner is on legacy hardware.
M10 experts are viable in isolation — sauerkrautlm-7b-hero returned a coherent legal analysis within its domain. The weakness was routing (only 1 of 8 experts invoked) and response style (critique mode).
The knowledge graph does NOT compensate for context overflow. Graph density improves answer quality only when the planner can parse and route correctly. A failed planner negates all expert and graph benefits.
Mitigation: Either (a) pin the planner to a node with a larger context window (≥ 16 k tokens, e.g. N04-RTX with qwen2.5-coder:7b or phi4:14b at extended context), or (b) hard-cap GraphRAG retrieval depth for templates with legacy-hardware planners.

MoE-Eval Benchmark Suite¶

Test categories¶

Quick start¶

Scoring methodology¶

Example: MCP precision test¶

Example: Compounding memory test¶

LLM Role Suitability Study¶

Results¶

Summary¶

Key Findings¶

Dataset¶

Hardware Tier Implications¶

Tier to Model Mapping¶

Latency vs. Quality Trade-off¶

Concurrent Expert Capacity¶

April 2026 — Dense-Graph Benchmark Campaign¶

Knowledge Graph State at Run Time¶

New Per-Node Benchmark Templates¶

Parallel Run Architecture¶

Results¶

Score Summary¶

Per-Test Detail¶

Full Measurement Series (ref-30b template)¶

Why Did the Score Change? Four Factors¶

Comparison: Before and After Graph Growth¶

April 2026 — AIHUB Sovereign: Enterprise H200 Benchmark (9/9 Pass)¶

Template: moe-aihub-sovereign¶

Results — MoE-Eval v1 (9 tests)¶

Key Findings (AIHUB vs. Local Cluster)¶

Enterprise Hardware Comparison¶

April 2026 — moe-m10-8b-gremium: Full M10 Cluster Pass (9/9) — PoC¶

Results — MoE-Eval v1¶

April 2026 — moe-benchmark-n06-m10: Per-Node M10 Pass (9/9) — PoC¶

April 2026 — moe-m10-gremium-deep: Orchestrated 8-Expert Template¶

Motivation¶

Template: moe-m10-gremium-deep¶

Model Selection Rationale¶

Results — Overnight Stability Benchmark (3 Epochs)¶

Epoch Summary¶

Per-Test Results (All 3 Epochs)¶

Category Performance (E1 → E3)¶

Key Findings¶

Comparison: Native vs. Orchestrated M10¶

Comparison to Equivalent Public Models¶

April 2026 — M10-Gremium Evaluation: Can Graph Density Compensate for Small LLMs?¶

Template: moe-m10-8b-gremium¶

Multi-Domain Challenge Prompt¶

Results¶

Deterministic Hit/Miss Detail¶

Root-Cause Analysis¶

Key Findings¶

Template: `moe-aihub-sovereign`¶

Template: `moe-m10-gremium-deep`¶

Template: `moe-m10-8b-gremium`¶