New in llama.cpp: Model Management

+128

llama.cpp server now ships with router mode , which lets you dynamically load, unload, and switch between multiple models without restarting.

Reminder: llama.cpp server is a lightweight, OpenAI-compatible HTTP server for running LLMs locally.

This feature was a popular request to bring Ollama-style model management to llama.cpp. It uses a multi-process architecture where each model runs in its own process, so if one model crashes, others remain unaffected.

Quick Start

Start the server in router mode by not specifying a model :

llama-server

This auto-discovers models from your llama.cpp cache ( LLAMA_CACHE or ~/.cache/llama.cpp ). If you've previously downloaded models via llama-server -hf user/model , they'll be available automatically.

You can also point to a local directory of GGUF files:

llama-server --models-dir ./my-models

Features

Auto-discovery : Scans your llama.cpp cache (default) or a custom --models-dir folder for GGUF files
On-demand loading : Models load automatically when first requested
LRU eviction : When you hit --models-max (default: 4), the least-recently-used model unloads
Request routing : The model field in your request determines which model handles it

Examples

Chat with a specific model

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ggml-org/gemma-3-4b-it-GGUF:Q4_K_M",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

On the first request, the server automatically loads the model into memory (loading time depends on model size). Subsequent requests to the same model are instant since it's already loaded.

List available models

curl http://localhost:8080/models

Returns all discovered models with their status ( loaded , loading , or unloaded ).

Manually load a model

curl -X POST http://localhost:8080/models/load \
  -H "Content-Type: application/json" \
  -d '{"model": "my-model.gguf"}'

Unload a model to free VRAM

curl -X POST http://localhost:8080/models/unload \
  -H "Content-Type: application/json" \
  -d '{"model": "my-model.gguf"}'

Key Options

All model instances inherit settings from the router:

llama-server --models-dir ./models -c 8192 -ngl 99

All loaded models will use 8192 context and full GPU offload. You can also define per-model settings using presets :

llama-server --models-preset config.ini

[my-model]
model = /path/to/model.gguf
ctx-size = 65536
temp = 0.7

Also available in the Web UI

The built-in web UI also supports model switching. Just select a model from the dropdown and it loads automatically.

Join the Conversation

We hope this feature makes it easier to A/B test different model versions, run multi-tenant deployments, or simply switch models during development without restarting the server.

Have questions or feedback? Drop a comment below or open an issue on GitHub .

Using OCR models with llama.cpp

New in llama.cpp: Anthropic Messages API

Community

Mmproj support?

4 replies

Supported via presets.ini, where you can specify the mmproj (and other long and short arguments) per model.

Awesome new feature! Can model selection be done on something other than requested model name? Like maybe specify the ranking in presets.ini, and then the highest ranked model that can satisfy the request will be the default. So maybe one model is best for short context, another (or the same with other settings) for when the context gets too long, and another when image input is required.

This is good addition, Thank you.

what is the best way to get <think> </think> and the tokens in between? openAI library is removing them.. i want to run llama-server in console and talk to it using a python library that does not remove the thinking tokens.

i checked the llama-cpp-python but it does not have that.

1 reply

llama-server by default in most implementation keeps the reasoning content in reasoning_content variable in response attribute. You can get it from there. Otherwise use reasoning-format flag and pass DeepSeek value to get pure tokens

Now I can use llama.cpp all the time. A big thank you to the devs.

Is there currently a way to have a "default" model if the request doesn't specify? Could be the currently loaded model or a specific model. (Just noticed one of my apps broke because it's used to llama-server not requiring a model name.)

1 reply

This seems to work

[DEFAULT] port = 8080 n-gpu-layers = -1 device = 0 flash-attn = on chat-template = jinja models-max = 4

Does it unload the current model if VRAM is full, to allow swapping to a new model?

fun ideas , add personal avatar and p2p social network also emule p2p models storage

Hey there! Just wanted to drop a quick note saying I'm really digging the new router mode in llama.cpp server. It's a game-changer for me, especially when I need to switch between different models. The auto-discovery of models and LRU eviction is pretty neat – no more manual updates or restarts needed. It's like having a dynamic model manager on-the-fly. And the request routing part? Brilliant! Makes my workflow with dmenu smoother. Check out the full experience and check out my dmenu launcher script on the project's GitHub: https://gitea.com/gnusupport/LLM-Helpers/src/branch/main/bin/rcd-llm-dmenu-launcher.sh

It's a win for sure.

thanks for the update! does it now behave like ollama?

Thank you so much for this, it's great!

I want to specifically pin models to a specific GPU (I have multiple) is that possible?

· Sign up or log in to comment

+122

New in llama.cpp: Model Management