Infrastructure & Hardware¶
Overview¶
The reference infrastructure consists of 7 GPU nodes running on repurposed enterprise and consumer hardware. It demonstrates that MoE Sovereign functions across a wide spectrum of hardware — from CPU-only inference with 7B models up to high-end enterprise GPUs. The Tesla M10, M60 and K80 nodes are Proof-of-Concept hardware: they show what is technically feasible, not what is recommended for production deployments. A systematic latency comparison across all hardware tiers is planned.
Reference Implementation
The nodes shown are Ollama instances. In an enterprise environment, these can be replaced by cloud API endpoints, dedicated GPU clusters, or cloud inference services.
Hardware Table¶
| Node | CPU | RAM | GPUs | Total VRAM | Notes |
|---|---|---|---|---|---|
| N1 | AMD Ryzen 5 5600G | 64 GB DDR4 | 3× RTX 2060 12 GB + 2× RTX 3060 12 GB | 60 GB | Consumer GPUs |
| N2 | Intel Core i5-4590 | 32 GB DDR3 | 1× Tesla M10 (4× 8 GB) | 32 GB | Legacy Enterprise |
| N3 | AMD Athlon II X2 270 | 16 GB DDR3 | 1× Tesla M10 (4× 8 GB) | 32 GB | Ultra-Legacy CPU |
| N4 | AMD EPYC Embedded 3151 | 128 GB DDR4 ECC | 3× Tesla M10 (96 GB) | 96 GB | HPC Server |
| N5 | AMD EPYC Embedded 3151 | 128 GB DDR4 ECC | 3× Tesla M10 (96 GB) | 96 GB | HPC Server (identical to N4) |
| N6 | AMD EPYC Embedded 3151 | 128 GB DDR4 ECC | 7× Tesla K80 (2× 12 GB) | 168 GB | Ollama37 fork (Kepler CC3.7) |
| EXP | Dell Wyse Thin Client | minimal | Tesla M10 (32 GB) via eGPU | 32 GB | Experiment: MiniPCI → PCIe x16 |
Total VRAM: ~516 GB (all nodes)
Network Topology¶
External Access — Host-Level Nginx¶
The external access layer uses a Nginx instance running natively on the host OS (not in Docker). It terminates TLS via Let's Encrypt (certbot) and proxies requests to the Docker service ports. See Webserver & Reverse Proxy for full details.
graph TD
INET["🌐 Internet"]
subgraph HOST_OS["Host OS (native)"]
NGINX["Nginx\nTLS via Let's Encrypt\nPort 443"]
end
subgraph DOCKER_STACK["Docker Stack"]
ORCH["langgraph-orchestrator\n:8002"]
ADMIN["moe-admin\n:8088"]
DOCS_C["moe-caddy :80/:443\n→ moe-docs :8098\n→ moe-dozzle :9999"]
end
subgraph GPU_CLUSTER["GPU Cluster (separate nodes)"]
N1["N1: Consumer GPUs\n60 GB VRAM · Ollama :11434"]
N2["N2: Tesla M10\n32 GB VRAM · Ollama :11434"]
N3["N3: Tesla M10\n32 GB VRAM · Ollama :11434"]
N4["N4: 3× Tesla M10\n96 GB VRAM · Ollama :11434"]
N5["N5: 3× Tesla M10\n96 GB VRAM · Ollama :11434"]
N6["N6: 7× Tesla K80\n168 GB VRAM · Ollama37 :11434"]
end
INET -->|HTTPS| NGINX
NGINX -->|proxy_pass :8002| ORCH
NGINX -->|proxy_pass :8088| ADMIN
NGINX -->|proxy_pass :80/:443| DOCS_C
ORCH <-->|"HTTP/REST Ollama API"| N1
ORCH <-->|"HTTP/REST Ollama API"| N2
ORCH <-->|"HTTP/REST Ollama API"| N3
ORCH <-->|"HTTP/REST Ollama API"| N4
ORCH <-->|"HTTP/REST Ollama API"| N5
ORCH <-->|"HTTP/REST Ollama API"| N6
Internal Container Dependency Graph¶
graph TD
KAFKA["moe-kafka\n:9092 KRaft"]
NEO4J["neo4j-knowledge\n:7687 Bolt · :7474 HTTP"]
MCP["mcp-precision\n:8003"]
CHROMA["chromadb-vector\n:8001"]
CACHE["terra_cache\nValkey :6379"]
PG["terra_checkpoints\nPostgres :5432"]
PROXY["docker-socket-proxy\n:2375 (read-only)"]
ORCH["langgraph-orchestrator\n:8002"]
ADMIN["moe-admin\n:8088"]
PROM["moe-prometheus\n:9090"]
GRAF["moe-grafana\n:3001"]
NODE["node-exporter\n:9100"]
CADV["cadvisor\n:9338"]
DOCS["moe-docs\n:8098"]
DOZZLE["moe-dozzle\n:9999"]
CADDY["moe-caddy\n:80/:443"]
SYNC["moe-docs-sync"]
%% orchestrator dependencies
ORCH -->|"redis://"| CACHE
ORCH -->|"postgresql://"| PG
ORCH -->|"http://"| CHROMA
ORCH -->|"bolt://"| NEO4J
ORCH -->|"kafka://"| KAFKA
ORCH -->|"http://"| MCP
KAFKA -.->|"moe.ingest consumer"| ORCH
%% admin dependencies
ADMIN -->|"http://"| ORCH
ADMIN -->|"http://"| PROM
ADMIN -->|"redis://"| CACHE
ADMIN -->|"postgresql://"| PG
ADMIN -->|"http:// Docker API"| PROXY
%% observability
PROM -->|"scrape :8002/metrics"| ORCH
PROM -->|"scrape :9100/metrics"| NODE
PROM -->|"scrape :9338/metrics"| CADV
GRAF -->|"datasource"| PROM
%% docs stack
CADDY -->|"reverse_proxy"| DOCS
CADDY -->|"reverse_proxy"| DOZZLE
SYNC -->|"HTTP ingest"| ORCH
style ORCH fill:#1e3a5f,color:#fff
style ADMIN fill:#2a1e5f,color:#fff
style KAFKA fill:#4a2a00,color:#fff
style NEO4J fill:#1e5f3a,color:#fff
style PROM fill:#5f2a00,color:#fff
Ollama vs. Ollama37¶
Standard Ollama¶
- Supports NVIDIA GPUs from Compute Capability 5.0 (Maxwell+)
- Tesla K80 (Kepler, CC 3.7): not supported
Ollama37 Fork¶
- Reactivates CUDA support for Compute Capability 3.7 (Kepler architecture)
- Enables full inference on Tesla K80 GPUs
- Same API as standard Ollama (drop-in)
- Node N6: 7× Tesla K80 = 168 GB VRAM
Machbarkeitsstudie
Tesla K80 GPUs are officially no longer supported by Ollama. The Ollama37 fork reactivates these cards and enables LLM inference on hardware that others treat as electronic waste — demonstrated as a PoC. How this compares in latency and throughput to consumer or enterprise GPUs remains to be quantified in the planned hardware comparison study.
VRAM Management¶
The orchestrator manages model placement dynamically:
- VRAM inventory: Each node reports available VRAM via Ollama API
- Model routing: Large T2 models are preferentially placed on nodes with more VRAM
- Multi-GPU: Tesla M10 and K80 are treated as a single logical pool
- Failover: If a node fails, requests are automatically redirected to others
Runtime Stack per Node¶
flowchart TD
A["Debian 13 (Bookworm)"] --> B[Docker CE]
B --> C["Ollama (or Ollama37 for N6)"]
C --> D[NVIDIA Container Toolkit]
D --> E[CUDA 12.x]