Benchmark System¶
KITT's benchmark system is built on a plugin architecture that supports both built-in and custom benchmarks. Benchmarks are registered via decorators, can be defined in Python or YAML, and support checkpoint recovery for long-running evaluations.
Benchmark ABC¶
The LLMBenchmark abstract base class in benchmarks/base.py defines the interface every benchmark must implement:
run(engine, config)-- Public entry point. Handles setup, calls_execute(), and collects results. This method manages checkpoint loading and saving._execute(engine, config)-- The actual benchmark logic. Subclasses override this to implement their specific evaluation.
All benchmarks receive an initialized engine instance and a configuration object, and return structured result data.
BenchmarkRegistry¶
The BenchmarkRegistry in benchmarks/registry.py maintains the mapping from benchmark names to their implementation classes. Registration uses the @register_benchmark decorator:
@register_benchmark("throughput")
class ThroughputBenchmark(LLMBenchmark):
def _execute(self, engine, config):
...
At startup, all built-in benchmarks are auto-discovered and registered, similar to the engine plugin system.
Built-in Benchmarks¶
Performance Benchmarks¶
Performance benchmarks measure inference engine speed and resource usage. They do not evaluate output quality.
| Benchmark | What It Measures |
|---|---|
throughput |
Tokens per second at various batch sizes and sequence lengths |
latency |
Time-to-first-token (TTFT) and inter-token latency (ITL) |
memory |
Peak GPU memory usage under different loads |
warmup_analysis |
Performance difference between cold-start and warmed-up inference |
Quality Benchmarks¶
Quality benchmarks evaluate the correctness and consistency of model outputs. These use standard academic evaluation datasets.
| Benchmark | Dataset | What It Measures |
|---|---|---|
mmlu |
MMLU | Multitask language understanding across 57 subjects |
gsm8k |
GSM8K | Grade-school math reasoning with chain-of-thought |
truthfulqa |
TruthfulQA | Resistance to generating common falsehoods |
hellaswag |
HellaSwag | Common-sense sentence completion |
YAML-Defined Benchmarks¶
Custom benchmarks can be defined in YAML without writing Python code. The YAMLBenchmark class in benchmarks/loader.py loads YAML files and creates benchmark instances at runtime.
name: custom_throughput
category: performance
base: throughput
config:
batch_sizes: [1, 4, 8]
sequence_lengths: [128, 512]
num_iterations: 5
YAML benchmarks reference a base benchmark and override specific configuration values. This makes it easy to create variants of built-in benchmarks tuned for specific hardware or use cases.
Create a new custom benchmark with:
Checkpoint Recovery¶
Long-running benchmarks (especially quality evaluations that process thousands of dataset items) save checkpoints periodically so that interrupted runs can be resumed.
The CheckpointManager handles checkpoint persistence:
- Save interval -- Every 100 items processed.
- Checkpoint contents -- Completed items, partial results, progress index, configuration snapshot.
- Recovery -- On restart, if a matching checkpoint exists, the benchmark resumes from where it left off rather than starting over.
Checkpoints are stored locally and are identified by a combination of model, engine, and benchmark name.
Test Suites¶
Suites group multiple benchmarks into a single run. KITT ships with three predefined suites:
| Suite | Purpose | Benchmarks | Runs per Benchmark |
|---|---|---|---|
quick |
Smoke test | throughput only | 1 |
standard |
Full evaluation | all quality + all performance | 3 |
performance |
Performance-focused | throughput, latency, memory, warmup | 3 |
Suite Configuration¶
Suites are defined in YAML configuration files. A suite config specifies which benchmarks to run, global settings, and optional per-test overrides:
name: standard
runs: 3
global_config:
max_tokens: 256
temperature: 0.0
benchmarks:
- throughput
- latency
- memory
- warmup_analysis
- mmlu
- gsm8k
- truthfulqa
- hellaswag
test_overrides:
mmlu:
max_tokens: 64
gsm8k:
max_tokens: 512
The global_config applies to all benchmarks unless overridden. The test_overrides section allows per-test configuration within the suite.
SuiteRunner¶
The SuiteRunner in runners/ orchestrates suite execution:
- Load and validate the suite configuration.
- For each benchmark in the suite, for each run iteration:
- Initialize the engine (if not already running).
- Execute the benchmark via
run(). - Collect results and GPU memory data.
- Aggregate results across runs (mean, stddev, min, max).
- Generate reports (JSON, Markdown).
- Persist results through KARR.
Next Steps¶
- Engine Lifecycle -- how engines are started before benchmarks run
- KARR — Results Storage -- how benchmark results are stored