Skip to content

Infrastructure & Hardware

Overview

The reference infrastructure consists of 7 GPU nodes running on repurposed enterprise and consumer hardware. It demonstrates that MoE Sovereign functions across a wide spectrum of hardware — from CPU-only inference with 7B models up to high-end enterprise GPUs. The Tesla M10, M60 and K80 nodes are Proof-of-Concept hardware: they show what is technically feasible, not what is recommended for production deployments. A systematic latency comparison across all hardware tiers is planned.

Reference Implementation

The nodes shown are Ollama instances. In an enterprise environment, these can be replaced by cloud API endpoints, dedicated GPU clusters, or cloud inference services.

Hardware Table

Node CPU RAM GPUs Total VRAM Notes
N1 AMD Ryzen 5 5600G 64 GB DDR4 3× RTX 2060 12 GB + 2× RTX 3060 12 GB 60 GB Consumer GPUs
N2 Intel Core i5-4590 32 GB DDR3 1× Tesla M10 (4× 8 GB) 32 GB Legacy Enterprise
N3 AMD Athlon II X2 270 16 GB DDR3 1× Tesla M10 (4× 8 GB) 32 GB Ultra-Legacy CPU
N4 AMD EPYC Embedded 3151 128 GB DDR4 ECC 3× Tesla M10 (96 GB) 96 GB HPC Server
N5 AMD EPYC Embedded 3151 128 GB DDR4 ECC 3× Tesla M10 (96 GB) 96 GB HPC Server (identical to N4)
N6 AMD EPYC Embedded 3151 128 GB DDR4 ECC 7× Tesla K80 (2× 12 GB) 168 GB Ollama37 fork (Kepler CC3.7)
EXP Dell Wyse Thin Client minimal Tesla M10 (32 GB) via eGPU 32 GB Experiment: MiniPCI → PCIe x16

Total VRAM: ~516 GB (all nodes)

Network Topology

External Access — Host-Level Nginx

The external access layer uses a Nginx instance running natively on the host OS (not in Docker). It terminates TLS via Let's Encrypt (certbot) and proxies requests to the Docker service ports. See Webserver & Reverse Proxy for full details.

graph TD
    INET["🌐 Internet"]

    subgraph HOST_OS["Host OS (native)"]
        NGINX["Nginx\nTLS via Let's Encrypt\nPort 443"]
    end

    subgraph DOCKER_STACK["Docker Stack"]
        ORCH["langgraph-orchestrator\n:8002"]
        ADMIN["moe-admin\n:8088"]
        DOCS_C["moe-caddy :80/:443\n→ moe-docs :8098\n→ moe-dozzle :9999"]
    end

    subgraph GPU_CLUSTER["GPU Cluster (separate nodes)"]
        N1["N1: Consumer GPUs\n60 GB VRAM · Ollama :11434"]
        N2["N2: Tesla M10\n32 GB VRAM · Ollama :11434"]
        N3["N3: Tesla M10\n32 GB VRAM · Ollama :11434"]
        N4["N4: 3× Tesla M10\n96 GB VRAM · Ollama :11434"]
        N5["N5: 3× Tesla M10\n96 GB VRAM · Ollama :11434"]
        N6["N6: 7× Tesla K80\n168 GB VRAM · Ollama37 :11434"]
    end

    INET -->|HTTPS| NGINX
    NGINX -->|proxy_pass :8002| ORCH
    NGINX -->|proxy_pass :8088| ADMIN
    NGINX -->|proxy_pass :80/:443| DOCS_C
    ORCH <-->|"HTTP/REST Ollama API"| N1
    ORCH <-->|"HTTP/REST Ollama API"| N2
    ORCH <-->|"HTTP/REST Ollama API"| N3
    ORCH <-->|"HTTP/REST Ollama API"| N4
    ORCH <-->|"HTTP/REST Ollama API"| N5
    ORCH <-->|"HTTP/REST Ollama API"| N6

Internal Container Dependency Graph

graph TD
    KAFKA["moe-kafka\n:9092 KRaft"]
    NEO4J["neo4j-knowledge\n:7687 Bolt · :7474 HTTP"]
    MCP["mcp-precision\n:8003"]
    CHROMA["chromadb-vector\n:8001"]
    CACHE["terra_cache\nValkey :6379"]
    PG["terra_checkpoints\nPostgres :5432"]
    PROXY["docker-socket-proxy\n:2375 (read-only)"]

    ORCH["langgraph-orchestrator\n:8002"]
    ADMIN["moe-admin\n:8088"]
    PROM["moe-prometheus\n:9090"]
    GRAF["moe-grafana\n:3001"]
    NODE["node-exporter\n:9100"]
    CADV["cadvisor\n:9338"]
    DOCS["moe-docs\n:8098"]
    DOZZLE["moe-dozzle\n:9999"]
    CADDY["moe-caddy\n:80/:443"]
    SYNC["moe-docs-sync"]

    %% orchestrator dependencies
    ORCH -->|"redis://"| CACHE
    ORCH -->|"postgresql://"| PG
    ORCH -->|"http://"| CHROMA
    ORCH -->|"bolt://"| NEO4J
    ORCH -->|"kafka://"| KAFKA
    ORCH -->|"http://"| MCP
    KAFKA -.->|"moe.ingest consumer"| ORCH

    %% admin dependencies
    ADMIN -->|"http://"| ORCH
    ADMIN -->|"http://"| PROM
    ADMIN -->|"redis://"| CACHE
    ADMIN -->|"postgresql://"| PG
    ADMIN -->|"http:// Docker API"| PROXY

    %% observability
    PROM -->|"scrape :8002/metrics"| ORCH
    PROM -->|"scrape :9100/metrics"| NODE
    PROM -->|"scrape :9338/metrics"| CADV
    GRAF -->|"datasource"| PROM

    %% docs stack
    CADDY -->|"reverse_proxy"| DOCS
    CADDY -->|"reverse_proxy"| DOZZLE
    SYNC -->|"HTTP ingest"| ORCH

    style ORCH fill:#1e3a5f,color:#fff
    style ADMIN fill:#2a1e5f,color:#fff
    style KAFKA fill:#4a2a00,color:#fff
    style NEO4J fill:#1e5f3a,color:#fff
    style PROM fill:#5f2a00,color:#fff

Ollama vs. Ollama37

Standard Ollama

  • Supports NVIDIA GPUs from Compute Capability 5.0 (Maxwell+)
  • Tesla K80 (Kepler, CC 3.7): not supported

Ollama37 Fork

  • Reactivates CUDA support for Compute Capability 3.7 (Kepler architecture)
  • Enables full inference on Tesla K80 GPUs
  • Same API as standard Ollama (drop-in)
  • Node N6: 7× Tesla K80 = 168 GB VRAM

Machbarkeitsstudie

Tesla K80 GPUs are officially no longer supported by Ollama. The Ollama37 fork reactivates these cards and enables LLM inference on hardware that others treat as electronic waste — demonstrated as a PoC. How this compares in latency and throughput to consumer or enterprise GPUs remains to be quantified in the planned hardware comparison study.

VRAM Management

The orchestrator manages model placement dynamically:

  • VRAM inventory: Each node reports available VRAM via Ollama API
  • Model routing: Large T2 models are preferentially placed on nodes with more VRAM
  • Multi-GPU: Tesla M10 and K80 are treated as a single logical pool
  • Failover: If a node fails, requests are automatically redirected to others

Runtime Stack per Node

flowchart TD
    A["Debian 13 (Bookworm)"] --> B[Docker CE]
    B --> C["Ollama (or Ollama37 for N6)"]
    C --> D[NVIDIA Container Toolkit]
    D --> E[CUDA 12.x]