Best Practices: LLM Selection & Template Design¶
This guide is derived from empirical testing of 69 LLMs across 5 inference nodes, covering Planner suitability, Judge suitability, and Expert role performance.
LLM Selection for Pipeline Roles¶
Planner (Task Decomposition)¶
The Planner must output strictly valid JSON — no prose, no markdown fences, no thinking blocks. This eliminates a surprising number of models.
| Tier | Recommended Models | Latency | Notes |
|---|---|---|---|
| Best | phi4:14b |
27-36s | Fastest reliable Planner. Consistent JSON output. |
| Best | hermes3:8b |
16s | Ultra-fast, good for simple decompositions |
| Good | gpt-oss:20b |
38s | Reliable, widely available |
| Good | devstral-small-2:24b |
45s | Strong on code-related planning |
| Good | nemotron-cascade-2:30b |
~200s | Excellent quality but slow |
| Avoid | qwen3.5:35b |
FAIL | Thinking mode produces <think> blocks, not JSON |
| Avoid | deepseek-r1:32b |
P-only | Chain-of-thought interferes with JSON output |
| Avoid | starcoder2:15b |
FAIL | Code completion model, no instruction following |
Key insight: Models with "thinking" or "reasoning" modes (qwen3.5, deepseek-r1)
tend to wrap their output in <think> tags, breaking JSON parsing. Disable thinking
mode in the Planner prompt or use non-reasoning models.
Judge / Merger (Response Synthesis & Scoring)¶
The Judge must synthesize multiple expert responses AND produce structured output (scores, provenance tags). It needs strong instruction following.
| Tier | Recommended Models | Latency | Notes |
|---|---|---|---|
| Best | phi4:14b |
1.7-4.2s | Extremely fast Judge responses |
| Best | qwen3-coder:30b |
1.7s | Fast, code-aware synthesis |
| Good | Qwen3-Coder-Next (80B) |
2.6s | Highest quality but large |
| Good | devstral-small-2:24b |
2.5s | Good for code-focused synthesis |
| Good | glm-4.7-flash |
15s | Strong general synthesis |
| Avoid | gpt-oss:20b in pipeline |
— | Works in isolation but gets unloaded by Ollama TTL between expert calls |
| Avoid | qwen3.5:35b |
FAIL | Same thinking-mode issue as Planner |
Critical finding: gpt-oss:20b passes isolated Judge tests (4.7s, valid JSON)
but fails in the MoE pipeline because Ollama unloads it between expert inference
calls. The solution: use sticky sessions or a dedicated Judge node.
Expert Models¶
Experts are more forgiving — they produce free-text responses, not structured JSON. Almost any instruction-following model works as an Expert.
| Domain | Recommended | Why |
|---|---|---|
| Code Review | devstral-small-2:24b |
SWE-bench 68%, code-focused |
| Code Generation | qwen3-coder:30b |
370 languages, strong tool calling |
| Reasoning | deepseek-r1:32b |
Best chain-of-thought on consumer GPUs |
| Security Analysis | devstral-small-2:24b |
CWEval-aware, OWASP coverage |
| Research | gemma4:31b |
Strong general knowledge |
| Math | phi4:14b + MCP tools |
MCP handles calculation, LLM extracts params |
| Legal | gpt-oss:20b |
German law knowledge, Gesetze-im-Internet tools |
Template Composition¶
T1/T2 Tier Strategy¶
- T1 (Primary, ≤20B): Fast screening. Models that respond in <30s.
Use
phi4:14b,hermes3:8b,gpt-oss:20b. - T2 (Fallback, >20B): Deep analysis. Engaged only when T1 reports
CONFIDENCE: low. Usedevstral-small-2:24b,qwen3-coder:30b,deepseek-r1:32b.
Node Assignment¶
- Pinned (
model@node): For production templates. Guarantees VRAM availability. - Floating (
modelonly): For elastic/low-priority workloads. System finds the best available node automatically.
Rule: Pin the Planner and Judge to fast nodes (RTX). Float T2 experts.
Service Toggles¶
Each template can disable pipeline components:
| Toggle | Default | When to Disable |
|---|---|---|
enable_cache |
true | Testing, debugging (need fresh responses) |
enable_graphrag |
true | Privacy-sensitive queries (no knowledge persistence) |
enable_web_research |
true | Air-gapped environments, speed-critical tasks |
Compliance Badge¶
Templates are automatically classified:
- Local Only (green): All models on local infrastructure
- Mixed (yellow): Some models on external APIs
- External (red): Primarily external APIs
The CISO sees at a glance whether data leaves the network.
System Prompt Engineering¶
Planner Prompts¶
DO:
- Demand JSON-only output explicitly
- List valid categories
- Provide format examples
- Include PRECISION_TOOLS block for MCP routing
DON'T: - Allow free-text explanations - Use thinking/reasoning instructions - Request markdown formatting
Judge Prompts¶
DO:
- Instruct to preserve code blocks verbatim
- Require provenance tags [REF:entity]
- Demand verification steps
- Cite which expert provided each insight
DON'T: - Allow summarization of code - Skip security findings
Expert Prompts¶
DO: - Define the expert's domain boundary clearly - Require structured output (CONFIDENCE, GAPS, REFERRAL) - Include domain-specific methodology (OWASP for security, etc.) - End with language enforcement
DON'T: - Mix domains (security expert should NOT comment on style) - Allow the expert to refuse ("I cannot help with that")
CC Profile Best Practices¶
| Profile Type | Tool Model | Thinking | Max Tokens | Use Case |
|---|---|---|---|---|
| Fast | gemma4:31b |
off | 4,096 | Quick edits, simple questions |
| Balanced | Qwen3-Coder-Next |
on | 8,192 / 16K reasoning | Daily development |
| Deep | Qwen3-Coder-Next |
on | 8,192 / 32K reasoning | Architecture, security audits |
Key: The tool_choice: required setting forces the model to always use tools
when available. This is critical for Claude Code integration — without it, the model
may generate prose instead of executing file edits.