Skip to content

Ollama Cluster — Multi-Node LLM Inference

What is Ollama?

Ollama is a local LLM inference server that loads models in GGUF format and exposes an OpenAI-compatible REST API. It manages model loading, VRAM allocation, and parallel requests internally.

Sovereign MoE runs Ollama on multiple inference nodes with heterogeneous GPU hardware, coordinated through LiteLLM as a gateway.

Why Multi-Node on Heterogeneous Hardware?

The cluster does not have a uniform GPU generation. It combines inference nodes with modern GPUs (Ampere/Ada Lovelace architecture, CUDA 8.6/8.9) and legacy GPU nodes with older Maxwell/Kepler architecture (CUDA 5.2/3.5).

The problem with legacy Kepler GPUs: Ollama by default requires CUDA ≥ 5.0 and uses newer CUDA APIs not available on Kepler (CUDA 3.5). The standard Ollama build fails on Kepler-based nodes.

Solution: Ollama37 fork — A community fork that supports CUDA 3.7 (Kepler architecture) and includes older cuBLAS paths. Legacy nodes run on this fork; modern nodes run standard upstream Ollama.

Quantization: q8_0

All models are operated at the q8_0 quantization level:

Quantization Bits per weight VRAM (70B model) Quality loss
f16 16 ~140 GB none (reference)
q8_0 8 ~70 GB < 0.5%
q4_0 4 ~35 GB 3–8%
q2_K 2 ~18 GB 15–25%

q8_0 is the optimal trade-off for the cluster: near-full model quality close to f16, but half the VRAM usage compared to float16. For legacy nodes with limited VRAM per GPU, smaller models (7B/14B) are used in q8_0.

Flash Attention

Flash Attention is an algorithmically more efficient attention kernel that reduces the quadratic VRAM requirement of standard attention for long sequences to linear.

Availability:

  • Modern GPUs (Ampere+): Flash Attention 2 fully supported
  • Maxwell-based nodes: Flash Attention 1 (slower fallback)
  • Kepler-based nodes: No Flash Attention — Ollama37 uses the classic cuBLAS attention path

Ollama activates Flash Attention automatically when the hardware supports it. No manual configuration required.

Configuration

# .env
# INFERENCE_SERVERS — JSON array of configured inference endpoints
# Configure via Admin UI → Configuration

# CLUSTER_HARDWARE — JSON object for generate_litellm_config.py
CLUSTER_HARDWARE='{
  "primary_node": {
    "url": "http://<inference-node-1>:11434",
    "priority": 10,
    "models": ["qwen2.5:72b", "deepseek-r1:70b", "mistral-nemo:12b"]
  },
  "secondary_node": {
    "url": "http://<inference-node-2>:11434",
    "priority": 40,
    "models": ["qwen2.5:14b", "mistral-nemo:12b"]
  }
}'

Priority: Lower number = higher preference by LiteLLM. The primary node (10) is selected significantly more often than secondary nodes (40).

Dynamic VRAM Management

VRAM management is no longer handled via static semaphores (asyncio.Semaphore(GPU_COUNT)). Instead, the orchestrator delegates resource allocation entirely to LiteLLM and Ollama:

  • Ollama manages VRAM per node internally and queues requests when needed
  • LiteLLM distributes via least-busy routing — nodes under load receive fewer new requests
  • Priority routing via weight values in config.yml (generated from CLUSTER_HARDWARE)

This design scales to any number of GPU nodes without changing the orchestrator code.

Model Inventory (example)

# Check available models on an inference node
curl http://<inference-node>:11434/api/tags

# Download a model
curl -X POST http://<inference-node>:11434/api/pull \
  -d '{"name": "qwen2.5:72b-instruct-q8_0"}'

Model assignment to expert categories: see docs/experts/index.md