Engines¶

KITT supports multiple inference engines. Engines can run inside Docker containers or as native host processes, depending on the engine and environment.

Supported Engines¶

Engine	Key	Docker Image	API Format	Default Port	Model Formats	Modes
vLLM	`vllm`	`vllm/vllm-openai:latest`	OpenAI `/v1/completions`	8000	safetensors, pytorch	docker, native
llama.cpp	`llama_cpp`	`ghcr.io/ggerganov/llama.cpp:server`	OpenAI `/v1/completions`	8081	gguf	docker, native
Ollama	`ollama`	`ollama/ollama:latest`	Ollama `/api/generate`	11434	gguf	docker, native
ExLlamaV2	`exllamav2`	`kitt/exllamav2:latest`	OpenAI `/v1/completions`	8000	exl2, gptq	docker
MLX	`mlx`	(native only)	OpenAI `/v1/completions`	8000	safetensors, mlx	native

Listing Engines¶

Display all registered engines, their Docker images, pull status, and supported model formats:

kitt engines list

The output shows Ready for engines whose Docker image is already pulled, and Not Pulled for those that still need to be fetched.

Checking Availability¶

Run diagnostics on a single engine to see whether its image is available, Docker is reachable, and which model formats it accepts:

kitt engines check vllm
kitt engines check ollama

If the image has not been pulled, the output includes a Fix: hint with the exact setup command. If Docker itself is not running, you will see a link to the Docker installation guide.

Pulling Images¶

Download the Docker image for an engine:

kitt engines setup vllm
kitt engines setup ollama

Use --dry-run to see the docker pull command without executing it:

kitt engines setup --dry-run llama_cpp

Model Format Compatibility¶

Engines are divided into two groups by the model formats they accept:

safetensors / pytorch -- vLLM loads models in their native HuggingFace format. Point the --model flag at a directory containing model.safetensors or pytorch_model.bin files (or a HuggingFace repo ID for engines that support it).

gguf -- llama.cpp and Ollama load quantized GGUF files. Point --model at a single .gguf file or a directory containing one. Ollama also accepts its own tag syntax (e.g. llama3.1:8b).

Attempting to load a safetensors model in llama.cpp (or a GGUF file in vLLM) will fail at container startup with a clear error message.

Custom Engine Configuration¶

Override engine defaults by writing a YAML config and passing it with --config:

kitt run -m /models/llama2-7b -e vllm --config ./my-engine.yaml

Engine config files live in configs/engines/ and follow this structure:

# configs/engines/vllm.yaml
name: vllm
image: vllm/vllm-openai:latest
port: 8000
health_endpoint: /health
env:
  VLLM_ATTENTION_BACKEND: FLASH_ATTN
extra_args:
  - --max-model-len
  - "4096"

Each engine's built-in config is in configs/engines/<key>.yaml. You can copy one of these as a starting point and adjust image tags, environment variables, or extra CLI arguments passed to the engine server inside the container.

Engine Lifecycle¶

When you run kitt run, KITT automatically:

Starts a Docker container with --gpus all and --network host.
Mounts the model directory into the container.
Polls the health endpoint with exponential backoff (up to 300 seconds).
Runs benchmarks via the engine's HTTP API on localhost:<port>.
Stops and removes the container when benchmarks finish.

Container names follow the pattern kitt-<timestamp> so they are easy to identify in docker ps output.