Skip to main content

Documentation Index

Fetch the complete documentation index at: https://www.rumus.ai/docs/llms.txt

Use this file to discover all available pages before exploring further.

Ollama lets you run models like Llama, Mistral, Qwen, and DeepSeek locally. With Ollama configured in Rumus, you can keep prompts entirely on your machine — no network round-trip, no per-token cost, no provider account.

Before you start

You need:
  • Ollama installed and running on the same machine (or a reachable one). Get it from ollama.com/download.
  • At least one model pulled locally:
    ollama pull llama3.2
    ollama pull qwen2.5-coder
    
  • Enough RAM and disk for the model you choose. A 7B model wants ~8 GB RAM; bigger models scale from there.

Add Ollama in Rumus

1

Make sure Ollama is running

By default Ollama listens on http://localhost:11434. Verify with:
curl http://localhost:11434/api/tags
You should get JSON listing the models you’ve pulled.
2

Open the model settings

Go to Settings → AI → Models and click Add Model.
3

Pick the provider

Set Provider to Ollama. There’s no API key field — Ollama is unauthenticated by default.
4

Base URL

Default http://localhost:11434 works for local Ollama. Override if you’ve set OLLAMA_HOST to bind to another port, or if Ollama is running on another machine on your LAN.
5

Enter the model ID

Type the exact model tag you pulled (e.g. llama3.2, qwen2.5-coder:32b, mistral). Rumus does not auto-fetch the list — model IDs come straight from ollama list.
6

Capabilities

On the Capabilities tab, mark only what your chosen model actually supports:
  • Tool Calling — only some models (e.g. Llama 3.1+, Qwen 2.5) handle tools well.
  • Vision — only multimodal variants (e.g. llava, qwen2.5-vl).
  • Prompt Cache — Ollama doesn’t support an explicit prompt cache API; leave this off.
7

Save

The model appears in the picker under Custom Models.
Rumus benefits from models that are good at tool use and code. Solid local choices:
ModelWhy
qwen2.5-coder:32bStrong at code, supports tools — good agent driver if you have the RAM
qwen2.5-coder:7bSmaller variant — runs comfortably on 16 GB RAM
llama3.2Fast generalist for chat-style queries
llavaMultimodal — useful for screenshots and diagrams
For the full catalog see ollama.com/library.

Tips

  • Keep a model warm. First-token latency on a cold model can be many seconds while Ollama loads weights into memory. Hit it with a quick prompt right before a session.
  • Reachability across the LAN. Set OLLAMA_HOST=0.0.0.0:11434 on the Ollama host and point Rumus at http://<host-ip>:11434. Make sure the firewall allows it.
  • Tool calling quality varies wildly. If the agent stops mid-task or fails to invoke a tool, fall back to a model with documented tool-use support.
  • Quantization matters. A 7B model at Q4 quant runs on far less RAM than the FP16 version with little quality loss — pick the tag that fits your hardware.

Troubleshooting

Ollama isn’t running, or it’s bound to a different host/port. Run ollama serve (or restart the app) and verify with curl http://localhost:11434/api/tags.
The model ID doesn’t match anything in ollama list. Either pull it (ollama pull <name>) or correct the ID — tags are case-sensitive and include the size suffix (e.g. qwen2.5-coder:32b, not just qwen2.5-coder).
Either the model is too large for available RAM (Ollama is offloading to disk), or there’s no GPU acceleration. Try a smaller model or a more aggressive quant.
The model doesn’t support tools well. Switch to a model with documented tool support like Llama 3.1+ or Qwen 2.5.
Hit a snag we didn’t cover? Ask in the Rumus community.

Next steps

Other providers

Anthropic, OpenAI, Google, Z.AI, DeepSeek, Kimi, OpenAI-compatible.

OpenAI-compatible

For vLLM, LiteLLM, and other local servers that speak OpenAI’s API.