Skip to main content
Ollama lets you run models like Llama, Mistral, Qwen, and DeepSeek locally. With Ollama configured in Rumus, you can keep prompts entirely on your machine — no network round-trip, no per-token cost, no provider account.

Before you start

You need:
  • Ollama installed and running on the same machine (or a reachable one). Get it from ollama.com/download.
  • At least one model pulled locally:
    ollama pull llama3.2
    ollama pull qwen2.5-coder
    
  • Enough RAM and disk for the model you choose. A 7B model wants ~8 GB RAM; bigger models scale from there.

Add Ollama in Rumus

1

Make sure Ollama is running

By default Ollama listens on http://localhost:11434. Verify with:
curl http://localhost:11434/api/tags
You should get JSON listing the models you’ve pulled.
2

Open the model settings

Go to Settings → AI → Models and click Add Model.
3

Pick the provider

Set Provider to Ollama. There’s no API key field — Ollama is unauthenticated by default.
4

Base URL

Default http://localhost:11434 works for local Ollama. Override if you’ve set OLLAMA_HOST to bind to another port, or if Ollama is running on another machine on your LAN.
5

Enter the model ID

Type the exact model tag you pulled (e.g. llama3.2, qwen2.5-coder:32b, mistral). Rumus does not auto-fetch the list — model IDs come straight from ollama list.
6

Capabilities

On the Capabilities tab, mark only what your chosen model actually supports:
  • Tool Calling — only some models (e.g. Llama 3.1+, Qwen 2.5) handle tools well.
  • Vision — only multimodal variants (e.g. llava, qwen2.5-vl).
  • Prompt Cache — Ollama doesn’t support an explicit prompt cache API; leave this off.
7

Save

The model appears in the picker under Custom Models.
Rumus benefits from models that are good at tool use and code. Solid local choices:
ModelWhy
qwen2.5-coder:32bStrong at code, supports tools — good agent driver if you have the RAM
qwen2.5-coder:7bSmaller variant — runs comfortably on 16 GB RAM
llama3.2Fast generalist for chat-style queries
llavaMultimodal — useful for screenshots and diagrams
For the full catalog see ollama.com/library.

Tips

  • Keep a model warm. First-token latency on a cold model can be many seconds while Ollama loads weights into memory. Hit it with a quick prompt right before a session.
  • Reachability across the LAN. Set OLLAMA_HOST=0.0.0.0:11434 on the Ollama host and point Rumus at http://<host-ip>:11434. Make sure the firewall allows it.
  • Tool calling quality varies wildly. If the agent stops mid-task or fails to invoke a tool, fall back to a model with documented tool-use support.
  • Quantization matters. A 7B model at Q4 quant runs on far less RAM than the FP16 version with little quality loss — pick the tag that fits your hardware.

Troubleshooting

Ollama isn’t running, or it’s bound to a different host/port. Run ollama serve (or restart the app) and verify with curl http://localhost:11434/api/tags.
The model ID doesn’t match anything in ollama list. Either pull it (ollama pull <name>) or correct the ID — tags are case-sensitive and include the size suffix (e.g. qwen2.5-coder:32b, not just qwen2.5-coder).
Either the model is too large for available RAM (Ollama is offloading to disk), or there’s no GPU acceleration. Try a smaller model or a more aggressive quant.
The model doesn’t support tools well. Switch to a model with documented tool support like Llama 3.1+ or Qwen 2.5.
Hit a snag we didn’t cover? Ask in the Rumus community.

Next steps

Other providers

Anthropic, OpenAI, Google, Z.AI, DeepSeek, Kimi, OpenAI-compatible.

OpenAI-compatible

For vLLM, LiteLLM, and other local servers that speak OpenAI’s API.