Architecture¶

KITT is built around a plugin architecture for inference engines and benchmarks, with Docker containers as the execution environment for all engines. This page describes the major components and how they fit together.

High-Level Overview¶

Host (KITT CLI)                         Docker Container
+-------------------+                  +------------------+
| kitt run           |   HTTP/JSON      | Engine Server    |
|   engine.generate()| <==============> |   API endpoint   |
|   GPUMemoryTracker |  localhost:PORT  |   /health        |
+-------------------+                  |   --gpus all     |
                                       |   /models (mount)|
                                       +------------------+

KITT runs on the host (or in its own container) and manages inference engine containers via the Docker CLI. All communication between KITT and engine containers happens over HTTP on localhost, using each engine's native API.

Engine Plugin System¶

The engine system is built on three components:

InferenceEngine ABC¶

The abstract base class in engines/base.py defines the contract every engine must fulfill:

initialize() -- Pull the Docker image, create and start the container, wait for the health check to pass.
generate(prompt, **kwargs) -- Send a generation request to the running engine via its HTTP API and return the result.
cleanup() -- Stop and remove the Docker container.

EngineRegistry¶

The registry in engines/registry.py maintains a mapping of engine names to their implementation classes. Engines register themselves using the @register_engine decorator:

@register_engine("vllm")
class VLLMEngine(InferenceEngine):
    ...

Auto-Discovery¶

EngineRegistry.auto_discover() imports all built-in engine modules from the engines/ package. This is called at startup so that all engines are available without manual imports.

Built-in Engines¶

Engine	Docker Image	API Style	Default Port
vLLM	`vllm/vllm-openai:latest`	OpenAI `/v1/completions`	8000
llama.cpp	`ghcr.io/ggerganov/llama.cpp:server`	OpenAI `/v1/completions`	8081
Ollama	`ollama/ollama:latest`	Ollama `/api/generate`	11434
ExLlamaV2	`kitt/exllamav2:latest`	OpenAI `/v1/completions`	8000
MLX	(native only)	OpenAI `/v1/completions`	8000

Docker Management¶

No Docker SDK¶

KITT deliberately avoids the Docker Python SDK. Instead, DockerManager in engines/docker_manager.py provides static methods that call the docker CLI via subprocess. This keeps the dependency footprint small and avoids version-pinning issues with the SDK.

Container Naming¶

Containers are named kitt-{timestamp} to allow multiple concurrent runs without conflicts.

Network Mode¶

All engine containers use --network host so the engine server binds directly to localhost on the host. This avoids Docker's port-mapping overhead and simplifies connectivity.

GPU Access¶

Engine containers are started with --gpus all to expose all host GPUs to the inference server.

Sibling Container Pattern¶

KITT can itself run inside a Docker container. In this mode, the Docker socket (/var/run/docker.sock) is mounted into the KITT container, allowing it to create and manage engine containers as siblings rather than nested containers. This avoids Docker-in-Docker complexity.

Remote Agent Architecture¶

KITT supports deploying thin agents to remote GPU servers. The agent is a standalone Python package (kitt-agent) served directly from the KITT web server, ensuring version compatibility.

KITT Server                              GPU Server (Agent)
+--------------------+                  +--------------------+
| Web UI / REST API  |  Register/HB    | kitt-agent daemon   |
|   /api/v1/agent/*  | <=============> |   heartbeat thread  |
|   agent_install.py |  Docker cmds    |   Docker orchestr.  |
|   (serves tarball) | ==============> |   log streaming     |
+--------------------+                  +--------------------+

Agent installation flow¶

User runs curl -fL <server>/api/v1/agent/install.sh | bash
The script creates a venv, downloads the agent sdist from /api/v1/agent/package, and installs it
kitt-agent init writes ~/.kitt/agent.yaml with server URL, token, name, and port
kitt-agent start registers with the server, starts the heartbeat thread, and listens for commands

Agent command protocol¶

The server sends JSON commands to the agent's /api/commands endpoint:

Command	Payload	Action
`run_container`	image, port, volumes, env, health_url	Pull image, start container, stream logs
`run_test`	model_path, engine_name, suite_name, benchmark_name	Resolve model, run `kitt run`, stream logs
`stop_container`	command_id	Stop a running container
`check_docker`	(none)	Verify Docker is available
`cleanup_storage`	model_path (optional)	Delete specific or all cached models

The agent reports results back to the server at /api/v1/agents/{id}/results.

Model storage workflow¶

When executing run_test, the agent uses ModelStorageManager to resolve the model:

Check if already in local model_storage_dir
Mount NFS share if configured (model_share_source → model_share_mount)
Copy model from share to local storage
Run benchmark with local path
Clean up local copy if auto_cleanup is enabled

Standalone agent package¶

The agent package lives in agent-package/ at the repository root:

agent-package/
├── pyproject.toml          # Standalone package (kitt-agent)
└── src/kitt_agent/
    ├── cli.py              # Click CLI: init, start, status, update, stop, test, service, preflight
    ├── config.py           # Pydantic config models
    ├── daemon.py           # Flask mini-app receiving commands
    ├── docker_ops.py       # Docker container management
    ├── hardware.py         # Hardware detection with unified memory support
    ├── heartbeat.py        # Heartbeat thread with settings sync
    ├── log_streamer.py     # SSE log streaming
    ├── model_storage.py    # NFS mount, local copy, cleanup
    ├── preflight.py        # Prerequisite checks (Docker, GPU, disk, etc.)
    └── registration.py     # Server registration

Project Structure¶

src/kitt/
├── cli/           # Click commands (run, engines, test, results, compare, web, fingerprint, stack, agent, monitoring)
├── engines/       # Inference engine plugins (base ABC, registry, vllm, llama_cpp, ollama, exllamav2, mlx)
├── benchmarks/    # Benchmark plugins (base ABC, registry, performance/*, quality/*)
├── config/        # Pydantic models + YAML loader
├── hardware/      # System fingerprinting (GPU, CPU, RAM, storage, CUDA)
├── runners/       # Suite/single test runners + checkpoint recovery
├── collectors/    # GPU memory tracking, system metrics
├── reporters/     # JSON, Markdown, comparison output
├── git_ops/       # KARR legacy Git-backed storage
├── monitoring/    # Monitoring stack config, generator, deployer
├── stack/         # Composable Docker stack config + generator
├── security/      # TLS cert generation and config
├── web/           # Flask dashboard + REST API + blueprints + Devon iframe
└── utils/         # Compression, validation, versioning

agent-package/     # Standalone thin agent (installed on GPU servers)
└── src/kitt_agent/ # Self-contained agent daemon

Key Design Decisions¶

Dataclasses are used for result types and internal data structures.
Pydantic v2 is used for configuration validation (YAML configs are loaded and validated through Pydantic models).
Click is the CLI framework; Rich provides tables, panels, and spinners for terminal output.
Logging uses logging.getLogger(__name__) throughout all modules.
Full type hints are required on all public methods.

Web Dashboard¶

The web UI is a Flask application (web/app.py) using TailwindCSS, HTMX, and Alpine.js. It registers page blueprints (Dashboard, Agents, Devon, Models, Campaigns, Quick Test, Results, Settings) and API v1 blueprints under /api/v1/.

Devon Tab¶

When DEVON_URL is set, the Devon tab embeds the Devon web UI in an iframe for integrated model management. A /api/v1/devon/status endpoint checks connectivity. The tab's visibility is controlled via the Settings page and persisted in the web_settings SQLite table.

Settings Persistence¶

UI preferences (such as Devon tab visibility) are stored in a web_settings key-value table managed by SettingsService. Settings are injected into all templates via a Flask context processor.

Relationship to DEVON¶

KITT tests models; DEVON manages and stores them. The two projects share the same technical stack (Poetry, Click, Rich, Python 3.10+, plugin registry pattern). The KITT web dashboard embeds Devon's UI directly when DEVON_URL is configured. DEVON can also export model paths in a format KITT consumes:

devon export --format kitt -o models.txt

Next Steps¶

Engine Lifecycle -- detailed container lifecycle and health checks
Benchmark System -- how benchmarks are defined and executed
Hardware Fingerprinting -- system identification for result organization