Engines¶
KITT supports multiple inference engines. Engines can run inside Docker containers or as native host processes, depending on the engine and environment.
Supported Engines¶
| Engine | Key | Docker Image | API Format | Default Port | Model Formats | Modes |
|---|---|---|---|---|---|---|
| vLLM | vllm |
vllm/vllm-openai:latest |
OpenAI /v1/completions |
8000 | safetensors, pytorch | docker, native |
| llama.cpp | llama_cpp |
ghcr.io/ggerganov/llama.cpp:server |
OpenAI /v1/completions |
8081 | gguf | docker, native |
| Ollama | ollama |
ollama/ollama:latest |
Ollama /api/generate |
11434 | gguf | docker, native |
| ExLlamaV2 | exllamav2 |
kitt/exllamav2:latest |
OpenAI /v1/completions |
8000 | exl2, gptq | docker |
| MLX | mlx |
(native only) | OpenAI /v1/completions |
8000 | safetensors, mlx | native |
Listing Engines¶
Display all registered engines, their Docker images, pull status, and supported model formats:
The output shows Ready for engines whose Docker image is already pulled, and Not Pulled for those that still need to be fetched.
Checking Availability¶
Run diagnostics on a single engine to see whether its image is available, Docker is reachable, and which model formats it accepts:
If the image has not been pulled, the output includes a Fix: hint with the
exact setup command. If Docker itself is not running, you will see a link to the
Docker installation guide.
Pulling Images¶
Download the Docker image for an engine:
Use --dry-run to see the docker pull command without executing it:
Model Format Compatibility¶
Engines are divided into two groups by the model formats they accept:
safetensors / pytorch -- vLLM loads models in their native
HuggingFace format. Point the --model flag at a directory containing
model.safetensors or pytorch_model.bin files (or a HuggingFace repo ID
for engines that support it).
gguf -- llama.cpp and Ollama load quantized GGUF files. Point --model
at a single .gguf file or a directory containing one. Ollama also accepts
its own tag syntax (e.g. llama3.1:8b).
Attempting to load a safetensors model in llama.cpp (or a GGUF file in vLLM) will fail at container startup with a clear error message.
Custom Engine Configuration¶
Override engine defaults by writing a YAML config and passing it with
--config:
Engine config files live in configs/engines/ and follow this structure:
# configs/engines/vllm.yaml
name: vllm
image: vllm/vllm-openai:latest
port: 8000
health_endpoint: /health
env:
VLLM_ATTENTION_BACKEND: FLASH_ATTN
extra_args:
- --max-model-len
- "4096"
Each engine's built-in config is in configs/engines/<key>.yaml. You can copy
one of these as a starting point and adjust image tags, environment variables,
or extra CLI arguments passed to the engine server inside the container.
Engine Lifecycle¶
When you run kitt run, KITT automatically:
- Starts a Docker container with
--gpus alland--network host. - Mounts the model directory into the container.
- Polls the health endpoint with exponential backoff (up to 300 seconds).
- Runs benchmarks via the engine's HTTP API on
localhost:<port>. - Stops and removes the container when benchmarks finish.
Container names follow the pattern kitt-<timestamp> so they are easy to
identify in docker ps output.