Skip to content

Best Practices: LLM Selection & Template Design

This guide is derived from empirical testing of 69 LLMs across 5 inference nodes, covering Planner suitability, Judge suitability, and Expert role performance.

LLM Selection for Pipeline Roles

Planner (Task Decomposition)

The Planner must output strictly valid JSON — no prose, no markdown fences, no thinking blocks. This eliminates a surprising number of models.

Tier Recommended Models Latency Notes
Best phi4:14b 27-36s Fastest reliable Planner. Consistent JSON output.
Best hermes3:8b 16s Ultra-fast, good for simple decompositions
Good gpt-oss:20b 38s Reliable, widely available
Good devstral-small-2:24b 45s Strong on code-related planning
Good nemotron-cascade-2:30b ~200s Excellent quality but slow
Avoid qwen3.5:35b FAIL Thinking mode produces <think> blocks, not JSON
Avoid deepseek-r1:32b P-only Chain-of-thought interferes with JSON output
Avoid starcoder2:15b FAIL Code completion model, no instruction following

Key insight: Models with "thinking" or "reasoning" modes (qwen3.5, deepseek-r1) tend to wrap their output in <think> tags, breaking JSON parsing. Disable thinking mode in the Planner prompt or use non-reasoning models.

Judge / Merger (Response Synthesis & Scoring)

The Judge must synthesize multiple expert responses AND produce structured output (scores, provenance tags). It needs strong instruction following.

Tier Recommended Models Latency Notes
Best phi4:14b 1.7-4.2s Extremely fast Judge responses
Best qwen3-coder:30b 1.7s Fast, code-aware synthesis
Good Qwen3-Coder-Next (80B) 2.6s Highest quality but large
Good devstral-small-2:24b 2.5s Good for code-focused synthesis
Good glm-4.7-flash 15s Strong general synthesis
Avoid gpt-oss:20b in pipeline Works in isolation but gets unloaded by Ollama TTL between expert calls
Avoid qwen3.5:35b FAIL Same thinking-mode issue as Planner

Critical finding: gpt-oss:20b passes isolated Judge tests (4.7s, valid JSON) but fails in the MoE pipeline because Ollama unloads it between expert inference calls. The solution: use sticky sessions or a dedicated Judge node.

Expert Models

Experts are more forgiving — they produce free-text responses, not structured JSON. Almost any instruction-following model works as an Expert.

Domain Recommended Why
Code Review devstral-small-2:24b SWE-bench 68%, code-focused
Code Generation qwen3-coder:30b 370 languages, strong tool calling
Reasoning deepseek-r1:32b Best chain-of-thought on consumer GPUs
Security Analysis devstral-small-2:24b CWEval-aware, OWASP coverage
Research gemma4:31b Strong general knowledge
Math phi4:14b + MCP tools MCP handles calculation, LLM extracts params
Legal gpt-oss:20b German law knowledge, Gesetze-im-Internet tools

Template Composition

T1/T2 Tier Strategy

  • T1 (Primary, ≤20B): Fast screening. Models that respond in <30s. Use phi4:14b, hermes3:8b, gpt-oss:20b.
  • T2 (Fallback, >20B): Deep analysis. Engaged only when T1 reports CONFIDENCE: low. Use devstral-small-2:24b, qwen3-coder:30b, deepseek-r1:32b.

Node Assignment

  • Pinned (model@node): For production templates. Guarantees VRAM availability.
  • Floating (model only): For elastic/low-priority workloads. System finds the best available node automatically.

Rule: Pin the Planner and Judge to fast nodes (RTX). Float T2 experts.

Service Toggles

Each template can disable pipeline components:

Toggle Default When to Disable
enable_cache true Testing, debugging (need fresh responses)
enable_graphrag true Privacy-sensitive queries (no knowledge persistence)
enable_web_research true Air-gapped environments, speed-critical tasks

Compliance Badge

Templates are automatically classified:

  • Local Only (green): All models on local infrastructure
  • Mixed (yellow): Some models on external APIs
  • External (red): Primarily external APIs

The CISO sees at a glance whether data leaves the network.

System Prompt Engineering

Planner Prompts

DO: - Demand JSON-only output explicitly - List valid categories - Provide format examples - Include PRECISION_TOOLS block for MCP routing

DON'T: - Allow free-text explanations - Use thinking/reasoning instructions - Request markdown formatting

Judge Prompts

DO: - Instruct to preserve code blocks verbatim - Require provenance tags [REF:entity] - Demand verification steps - Cite which expert provided each insight

DON'T: - Allow summarization of code - Skip security findings

Expert Prompts

DO: - Define the expert's domain boundary clearly - Require structured output (CONFIDENCE, GAPS, REFERRAL) - Include domain-specific methodology (OWASP for security, etc.) - End with language enforcement

DON'T: - Mix domains (security expert should NOT comment on style) - Allow the expert to refuse ("I cannot help with that")

CC Profile Best Practices

Profile Type Tool Model Thinking Max Tokens Use Case
Fast gemma4:31b off 4,096 Quick edits, simple questions
Balanced Qwen3-Coder-Next on 8,192 / 16K reasoning Daily development
Deep Qwen3-Coder-Next on 8,192 / 32K reasoning Architecture, security audits

Key: The tool_choice: required setting forces the model to always use tools when available. This is critical for Claude Code integration — without it, the model may generate prose instead of executing file edits.