Architecture¶
KITT is built around a plugin architecture for inference engines and benchmarks, with Docker containers as the execution environment for all engines. This page describes the major components and how they fit together.
High-Level Overview¶
Host (KITT CLI) Docker Container
+-------------------+ +------------------+
| kitt run | HTTP/JSON | Engine Server |
| engine.generate()| <==============> | API endpoint |
| GPUMemoryTracker | localhost:PORT | /health |
+-------------------+ | --gpus all |
| /models (mount)|
+------------------+
KITT runs on the host (or in its own container) and manages inference engine containers via the Docker CLI. All communication between KITT and engine containers happens over HTTP on localhost, using each engine's native API.
Engine Plugin System¶
The engine system is built on three components:
InferenceEngine ABC¶
The abstract base class in engines/base.py defines the contract every engine must fulfill:
initialize()-- Pull the Docker image, create and start the container, wait for the health check to pass.generate(prompt, **kwargs)-- Send a generation request to the running engine via its HTTP API and return the result.cleanup()-- Stop and remove the Docker container.
EngineRegistry¶
The registry in engines/registry.py maintains a mapping of engine names to their implementation classes. Engines register themselves using the @register_engine decorator:
Auto-Discovery¶
EngineRegistry.auto_discover() imports all built-in engine modules from the engines/ package. This is called at startup so that all engines are available without manual imports.
Built-in Engines¶
| Engine | Docker Image | API Style | Default Port |
|---|---|---|---|
| vLLM | vllm/vllm-openai:latest |
OpenAI /v1/completions |
8000 |
| llama.cpp | ghcr.io/ggerganov/llama.cpp:server |
OpenAI /v1/completions |
8081 |
| Ollama | ollama/ollama:latest |
Ollama /api/generate |
11434 |
| ExLlamaV2 | kitt/exllamav2:latest |
OpenAI /v1/completions |
8000 |
| MLX | (native only) | OpenAI /v1/completions |
8000 |
Docker Management¶
No Docker SDK¶
KITT deliberately avoids the Docker Python SDK. Instead, DockerManager in engines/docker_manager.py provides static methods that call the docker CLI via subprocess. This keeps the dependency footprint small and avoids version-pinning issues with the SDK.
Container Naming¶
Containers are named kitt-{timestamp} to allow multiple concurrent runs without conflicts.
Network Mode¶
All engine containers use --network host so the engine server binds directly to localhost on the host. This avoids Docker's port-mapping overhead and simplifies connectivity.
GPU Access¶
Engine containers are started with --gpus all to expose all host GPUs to the inference server.
Sibling Container Pattern¶
KITT can itself run inside a Docker container. In this mode, the Docker socket (/var/run/docker.sock) is mounted into the KITT container, allowing it to create and manage engine containers as siblings rather than nested containers. This avoids Docker-in-Docker complexity.
Remote Agent Architecture¶
KITT supports deploying thin agents to remote GPU servers. The agent is a standalone Python package (kitt-agent) served directly from the KITT web server, ensuring version compatibility.
KITT Server GPU Server (Agent)
+--------------------+ +--------------------+
| Web UI / REST API | Register/HB | kitt-agent daemon |
| /api/v1/agent/* | <=============> | heartbeat thread |
| agent_install.py | Docker cmds | Docker orchestr. |
| (serves tarball) | ==============> | log streaming |
+--------------------+ +--------------------+
Agent installation flow¶
- User runs
curl -fL <server>/api/v1/agent/install.sh | bash - The script creates a venv, downloads the agent sdist from
/api/v1/agent/package, and installs it kitt-agent initwrites~/.kitt/agent.yamlwith server URL, token, name, and portkitt-agent startregisters with the server, starts the heartbeat thread, and listens for commands
Agent command protocol¶
The server sends JSON commands to the agent's /api/commands endpoint:
| Command | Payload | Action |
|---|---|---|
run_container |
image, port, volumes, env, health_url | Pull image, start container, stream logs |
run_test |
model_path, engine_name, suite_name, benchmark_name | Resolve model, run kitt run, stream logs |
stop_container |
command_id | Stop a running container |
check_docker |
(none) | Verify Docker is available |
cleanup_storage |
model_path (optional) | Delete specific or all cached models |
The agent reports results back to the server at /api/v1/agents/{id}/results.
Model storage workflow¶
When executing run_test, the agent uses ModelStorageManager to resolve the model:
- Check if already in local
model_storage_dir - Mount NFS share if configured (
model_share_source→model_share_mount) - Copy model from share to local storage
- Run benchmark with local path
- Clean up local copy if
auto_cleanupis enabled
Standalone agent package¶
The agent package lives in agent-package/ at the repository root:
agent-package/
├── pyproject.toml # Standalone package (kitt-agent)
└── src/kitt_agent/
├── cli.py # Click CLI: init, start, status, update, stop, test, service, preflight
├── config.py # Pydantic config models
├── daemon.py # Flask mini-app receiving commands
├── docker_ops.py # Docker container management
├── hardware.py # Hardware detection with unified memory support
├── heartbeat.py # Heartbeat thread with settings sync
├── log_streamer.py # SSE log streaming
├── model_storage.py # NFS mount, local copy, cleanup
├── preflight.py # Prerequisite checks (Docker, GPU, disk, etc.)
└── registration.py # Server registration
Project Structure¶
src/kitt/
├── cli/ # Click commands (run, engines, test, results, compare, web, fingerprint, stack, agent, monitoring)
├── engines/ # Inference engine plugins (base ABC, registry, vllm, llama_cpp, ollama, exllamav2, mlx)
├── benchmarks/ # Benchmark plugins (base ABC, registry, performance/*, quality/*)
├── config/ # Pydantic models + YAML loader
├── hardware/ # System fingerprinting (GPU, CPU, RAM, storage, CUDA)
├── runners/ # Suite/single test runners + checkpoint recovery
├── collectors/ # GPU memory tracking, system metrics
├── reporters/ # JSON, Markdown, comparison output
├── git_ops/ # KARR legacy Git-backed storage
├── monitoring/ # Monitoring stack config, generator, deployer
├── stack/ # Composable Docker stack config + generator
├── security/ # TLS cert generation and config
├── web/ # Flask dashboard + REST API + blueprints + Devon iframe
└── utils/ # Compression, validation, versioning
agent-package/ # Standalone thin agent (installed on GPU servers)
└── src/kitt_agent/ # Self-contained agent daemon
Key Design Decisions¶
- Dataclasses are used for result types and internal data structures.
- Pydantic v2 is used for configuration validation (YAML configs are loaded and validated through Pydantic models).
- Click is the CLI framework; Rich provides tables, panels, and spinners for terminal output.
- Logging uses
logging.getLogger(__name__)throughout all modules. - Full type hints are required on all public methods.
Web Dashboard¶
The web UI is a Flask application (web/app.py) using TailwindCSS, HTMX, and Alpine.js. It registers page blueprints (Dashboard, Agents, Devon, Models, Campaigns, Quick Test, Results, Settings) and API v1 blueprints under /api/v1/.
Devon Tab¶
When DEVON_URL is set, the Devon tab embeds the Devon web UI in an iframe for integrated model management. A /api/v1/devon/status endpoint checks connectivity. The tab's visibility is controlled via the Settings page and persisted in the web_settings SQLite table.
Settings Persistence¶
UI preferences (such as Devon tab visibility) are stored in a web_settings key-value table managed by SettingsService. Settings are injected into all templates via a Flask context processor.
Relationship to DEVON¶
KITT tests models; DEVON manages and stores them. The two projects share the same technical stack (Poetry, Click, Rich, Python 3.10+, plugin registry pattern). The KITT web dashboard embeds Devon's UI directly when DEVON_URL is configured. DEVON can also export model paths in a format KITT consumes:
Next Steps¶
- Engine Lifecycle -- detailed container lifecycle and health checks
- Benchmark System -- how benchmarks are defined and executed
- Hardware Fingerprinting -- system identification for result organization