Reinforcement Learning Flywheel¶
Overview¶
MoE Sovereign uses a three-stage incremental reinforcement learning loop that improves routing quality and expert accuracy over time — without external training infrastructure, GPU allocation, or new services.
All three stages reuse existing components (Redis, Neo4j, PostgreSQL) and are individually toggleable via environment variables.
flowchart LR
subgraph "Stage 1: Observe"
A[API Request] --> B[Routing Decision]
B --> C[routing_telemetry\nPostgreSQL]
end
subgraph "Stage 2: Explore"
D[Expert Selection] --> E[Thompson Sampling\nBeta distribution]
E --> F[Redis moe:perf]
end
subgraph "Stage 3: Correct"
G[Judge Refinement] --> H[Correction Memory\nNeo4j :Correction]
H --> I[Expert Prompt\nInjection]
end
C -.->|offline analysis| E
F -.->|outcome signal| C
I -.->|fewer repeat mistakes| G
Stage 1: Routing Telemetry¶
Every API request produces a telemetry row in routing_telemetry (PostgreSQL)
capturing the full routing decision context and outcome.
Captured Fields¶
| Group | Fields |
|---|---|
| Request features | prompt_length, prompt_lang (de/en), complexity (trivial/moderate/complex), has_images, has_code |
| Routing decision | planner_plan (JSON), experts_used[], mcp_tools_used[], cache_hit, fast_path |
| Outcome | self_score (1-5, async), user_rating (1-5, from /v1/feedback), total_tokens, wall_clock_ms |
| Scoring snapshot | expert_scores (JSON — Thompson-sampled values at routing time) |
Integration¶
- Written async at request completion (fire-and-forget, never blocks inference).
self_scoreupdated after the async self-evaluation loop completes.user_ratingupdated when a user submits feedback via/v1/feedback.
Offline Analysis¶
The telemetry table enables queries like:
-- Experts with low user ratings
SELECT unnest(experts_used) AS expert, avg(user_rating) AS avg_rating
FROM routing_telemetry WHERE user_rating IS NOT NULL
GROUP BY expert ORDER BY avg_rating;
-- Routing patterns that lead to corrections
SELECT template_name, planner_plan, count(*)
FROM routing_telemetry WHERE correction_applied = true
GROUP BY template_name, planner_plan ORDER BY count DESC;
Stage 2: Thompson Sampling¶
Replaces the static Laplace-smoothed expert scoring with stochastic Beta distribution sampling for natural exploration.
How It Works¶
Before (Laplace point estimate):
Always returns the same score for the same data — pure exploitation.After (Thompson Sampling):
α = positive + 1 (successes + prior)
β = (total - positive) + 1 (failures + prior)
score = random.betavariate(α, β)
Why This Is Better¶
- Natural exploration: An expert with 5/5 successes occasionally scores lower than one with 50/55 — giving the weaker expert a chance to prove itself on unfamiliar query types.
- Convergence: As data accumulates, the Beta distribution narrows. After ~100 observations, Thompson Sampling and Laplace produce nearly identical rankings.
- Zero migration: Same Redis structure (
moe:perf:{model}:{category}with positive/negative/total fields). No schema changes.
Configuration¶
| Env var | Default | Effect |
|---|---|---|
THOMPSON_SAMPLING_ENABLED |
true |
Set to false for instant rollback to Laplace |
EXPERT_MIN_DATAPOINTS |
5 |
Below this threshold: return 0.5 (neutral) regardless of method |
Monitoring¶
Prometheus histogram moe_thompson_sample tracks sampled score distribution.
Compare with the theoretical Laplace point estimates in Grafana to visualize
exploration breadth over time.
Stage 3: Correction Memory¶
Stores past expert corrections in Neo4j as :Correction nodes and injects
relevant corrections into expert prompts to prevent repeat mistakes.
Write Path¶
Corrections are created when:
- Judge refinement succeeds — improvement ratio ≥ 15% (configurable via
JUDGE_REFINE_MIN_IMPROVEMENT). The original (wrong) and refined (correct) responses are stored as a correction pair. - Self-correction detects a numerical mismatch — the wrong and corrected values are persisted.
- User negative feedback — when a subsequent positive interaction exists for the same topic, the correction pair is extracted.
Storage Schema (Neo4j)¶
(:Correction {
hash: TEXT, -- SHA256(prompt+wrong+correct), dedup key
prompt_pattern: TEXT, -- user query (max 500 chars)
wrong_summary: TEXT, -- what went wrong
correct_summary: TEXT, -- what the correction was
category: TEXT, -- expert category
source_model: TEXT, -- which model failed
correction_source: TEXT, -- 'judge_refinement' | 'self_correction' | 'user_feedback'
confidence: FLOAT, -- reliability score (0-1)
times_applied: INT, -- how often this correction prevented a repeat
tenant_id: TEXT -- RBAC isolation
})
Read Path¶
At expert invocation, the orchestrator queries Neo4j for corrections matching the current category and prompt similarity:
[CORRECTION MEMORY — avoid repeating these past errors]
- Wrong: {wrong_summary}
Correct: {correct_summary}
This is injected into the expert's system prompt before the user query.
Configuration¶
| Env var | Default | Effect |
|---|---|---|
CORRECTION_MEMORY_ENABLED |
true |
Set to false to disable read and write |
Rollback¶
Each stage can be independently disabled without deployment:
| Stage | Rollback | Impact |
|---|---|---|
| Telemetry | DROP TABLE routing_telemetry; |
No runtime effect |
| Thompson Sampling | THOMPSON_SAMPLING_ENABLED=false |
Instant revert to Laplace |
| Correction Memory | CORRECTION_MEMORY_ENABLED=false |
Disables injection + storage |
Design Decision: Why Not Contextual Bandits or DPO?¶
An earlier proposal suggested Vowpal Wabbit contextual bandits for routing and DPO LoRA fine-tuning for expert models. We chose the incremental approach because:
- VW competes with the LLM planner. The planner already performs contextual routing — a parallel statistical router creates conflicting signals.
- DPO requires dedicated GPU time. Training on inference hardware evicts production models from VRAM. The local cluster has no spare capacity.
- Cold-start problem. VW needs thousands of samples per action. During cold-start, the system performs worse than deterministic routing.
- The incremental approach covers 80% of the RL value with 10% of the complexity: Thompson Sampling provides exploration, telemetry provides analysis, correction memory provides model improvement — all without new infrastructure.
DPO remains a future option once 2,000+ clean preference pairs accumulate per model, on separate training hardware.