Ollama Cluster — Multi-Node LLM Inference¶

What is Ollama?¶

Ollama is a local LLM inference server that loads models in GGUF format and exposes an OpenAI-compatible REST API. It manages model loading, VRAM allocation, and parallel requests internally.

Sovereign MoE runs Ollama on multiple inference nodes with heterogeneous GPU hardware, coordinated through LiteLLM as a gateway.

Why Multi-Node on Heterogeneous Hardware?¶

The cluster does not have a uniform GPU generation. It combines inference nodes with modern GPUs (Ampere/Ada Lovelace architecture, CUDA 8.6/8.9) and legacy GPU nodes with older Maxwell/Kepler architecture (CUDA 5.2/3.5).

The problem with legacy Kepler GPUs: Ollama by default requires CUDA ≥ 5.0 and uses newer CUDA APIs not available on Kepler (CUDA 3.5). The standard Ollama build fails on Kepler-based nodes.

Solution: Ollama37 fork — A community fork that supports CUDA 3.7 (Kepler architecture) and includes older cuBLAS paths. Legacy nodes run on this fork; modern nodes run standard upstream Ollama.

Quantization: q8_0¶

All models are operated at the q8_0 quantization level:

Quantization	Bits per weight	VRAM (70B model)	Quality loss
f16	16	~140 GB	none (reference)
q8_0	8	~70 GB	< 0.5%
q4_0	4	~35 GB	3–8%
q2_K	2	~18 GB	15–25%

q8_0 is the optimal trade-off for the cluster: near-full model quality close to f16, but half the VRAM usage compared to float16. For legacy nodes with limited VRAM per GPU, smaller models (7B/14B) are used in q8_0.

Flash Attention¶

Flash Attention is an algorithmically more efficient attention kernel that reduces the quadratic VRAM requirement of standard attention for long sequences to linear.

Availability:

Modern GPUs (Ampere+): Flash Attention 2 fully supported
Maxwell-based nodes: Flash Attention 1 (slower fallback)
Kepler-based nodes: No Flash Attention — Ollama37 uses the classic cuBLAS attention path

Ollama activates Flash Attention automatically when the hardware supports it. No manual configuration required.

Configuration¶

# .env
# INFERENCE_SERVERS — JSON array of configured inference endpoints
# Configure via Admin UI → Configuration

# CLUSTER_HARDWARE — JSON object for generate_litellm_config.py
CLUSTER_HARDWARE='{
  "primary_node": {
    "url": "http://<inference-node-1>:11434",
    "priority": 10,
    "models": ["qwen2.5:72b", "deepseek-r1:70b", "mistral-nemo:12b"]
  },
  "secondary_node": {
    "url": "http://<inference-node-2>:11434",
    "priority": 40,
    "models": ["qwen2.5:14b", "mistral-nemo:12b"]
  }
}'

Priority: Lower number = higher preference by LiteLLM. The primary node (10) is selected significantly more often than secondary nodes (40).

Dynamic VRAM Management¶

VRAM management is no longer handled via static semaphores (asyncio.Semaphore(GPU_COUNT)). Instead, the orchestrator delegates resource allocation entirely to LiteLLM and Ollama:

Ollama manages VRAM per node internally and queues requests when needed
LiteLLM distributes via least-busy routing — nodes under load receive fewer new requests
Priority routing via weight values in config.yml (generated from CLUSTER_HARDWARE)

This design scales to any number of GPU nodes without changing the orchestrator code.

Model Inventory (example)¶

# Check available models on an inference node
curl http://<inference-node>:11434/api/tags

# Download a model
curl -X POST http://<inference-node>:11434/api/pull \
  -d '{"name": "qwen2.5:72b-instruct-q8_0"}'

Model assignment to expert categories: see docs/experts/index.md