Ollama lets you run models like Llama, Mistral, Qwen, and DeepSeek locally. With Ollama configured in Rumus, you can keep prompts entirely on your machine — no network round-trip, no per-token cost, no provider account.Documentation Index
Fetch the complete documentation index at: https://www.rumus.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
Before you start
You need:- Ollama installed and running on the same machine (or a reachable one). Get it from ollama.com/download.
-
At least one model pulled locally:
- Enough RAM and disk for the model you choose. A 7B model wants ~8 GB RAM; bigger models scale from there.
Add Ollama in Rumus
Make sure Ollama is running
By default Ollama listens on You should get JSON listing the models you’ve pulled.
http://localhost:11434. Verify with:Pick the provider
Set Provider to Ollama. There’s no API key field — Ollama is unauthenticated by default.
Base URL
Default
http://localhost:11434 works for local Ollama. Override if you’ve set OLLAMA_HOST to bind to another port, or if Ollama is running on another machine on your LAN.Enter the model ID
Type the exact model tag you pulled (e.g.
llama3.2, qwen2.5-coder:32b, mistral). Rumus does not auto-fetch the list — model IDs come straight from ollama list.Capabilities
On the Capabilities tab, mark only what your chosen model actually supports:
- Tool Calling — only some models (e.g. Llama 3.1+, Qwen 2.5) handle tools well.
- Vision — only multimodal variants (e.g.
llava,qwen2.5-vl). - Prompt Cache — Ollama doesn’t support an explicit prompt cache API; leave this off.
Recommended models for Rumus
Rumus benefits from models that are good at tool use and code. Solid local choices:| Model | Why |
|---|---|
qwen2.5-coder:32b | Strong at code, supports tools — good agent driver if you have the RAM |
qwen2.5-coder:7b | Smaller variant — runs comfortably on 16 GB RAM |
llama3.2 | Fast generalist for chat-style queries |
llava | Multimodal — useful for screenshots and diagrams |
Tips
- Keep a model warm. First-token latency on a cold model can be many seconds while Ollama loads weights into memory. Hit it with a quick prompt right before a session.
- Reachability across the LAN. Set
OLLAMA_HOST=0.0.0.0:11434on the Ollama host and point Rumus athttp://<host-ip>:11434. Make sure the firewall allows it. - Tool calling quality varies wildly. If the agent stops mid-task or fails to invoke a tool, fall back to a model with documented tool-use support.
- Quantization matters. A 7B model at Q4 quant runs on far less RAM than the FP16 version with little quality loss — pick the tag that fits your hardware.
Troubleshooting
Connection refused / no response
Connection refused / no response
Ollama isn’t running, or it’s bound to a different host/port. Run
ollama serve (or restart the app) and verify with curl http://localhost:11434/api/tags.404 model not found
404 model not found
The model ID doesn’t match anything in
ollama list. Either pull it (ollama pull <name>) or correct the ID — tags are case-sensitive and include the size suffix (e.g. qwen2.5-coder:32b, not just qwen2.5-coder).Very slow generation
Very slow generation
Either the model is too large for available RAM (Ollama is offloading to disk), or there’s no GPU acceleration. Try a smaller model or a more aggressive quant.
Tool calls don't work
Tool calls don't work
The model doesn’t support tools well. Switch to a model with documented tool support like Llama 3.1+ or Qwen 2.5.
Hit a snag we didn’t cover? Ask in the Rumus community.
Next steps
Other providers
Anthropic, OpenAI, Google, Z.AI, DeepSeek, Kimi, OpenAI-compatible.
OpenAI-compatible
For vLLM, LiteLLM, and other local servers that speak OpenAI’s API.