Benchmark System¶

KITT's benchmark system is built on a plugin architecture that supports both built-in and custom benchmarks. Benchmarks are registered via decorators, can be defined in Python or YAML, and support checkpoint recovery for long-running evaluations.

Benchmark ABC¶

The LLMBenchmark abstract base class in benchmarks/base.py defines the interface every benchmark must implement:

run(engine, config) -- Public entry point. Handles setup, calls _execute(), and collects results. This method manages checkpoint loading and saving.
_execute(engine, config) -- The actual benchmark logic. Subclasses override this to implement their specific evaluation.

All benchmarks receive an initialized engine instance and a configuration object, and return structured result data.

BenchmarkRegistry¶

The BenchmarkRegistry in benchmarks/registry.py maintains the mapping from benchmark names to their implementation classes. Registration uses the @register_benchmark decorator:

@register_benchmark("throughput")
class ThroughputBenchmark(LLMBenchmark):
    def _execute(self, engine, config):
        ...

At startup, all built-in benchmarks are auto-discovered and registered, similar to the engine plugin system.

Built-in Benchmarks¶

Performance Benchmarks¶

Performance benchmarks measure inference engine speed and resource usage. They do not evaluate output quality.

Benchmark	What It Measures
`throughput`	Tokens per second at various batch sizes and sequence lengths
`latency`	Time-to-first-token (TTFT) and inter-token latency (ITL)
`memory`	Peak GPU memory usage under different loads
`warmup_analysis`	Performance difference between cold-start and warmed-up inference

Quality Benchmarks¶

Quality benchmarks evaluate the correctness and consistency of model outputs. These use standard academic evaluation datasets.

Benchmark	Dataset	What It Measures
`mmlu`	MMLU	Multitask language understanding across 57 subjects
`gsm8k`	GSM8K	Grade-school math reasoning with chain-of-thought
`truthfulqa`	TruthfulQA	Resistance to generating common falsehoods
`hellaswag`	HellaSwag	Common-sense sentence completion

YAML-Defined Benchmarks¶

Custom benchmarks can be defined in YAML without writing Python code. The YAMLBenchmark class in benchmarks/loader.py loads YAML files and creates benchmark instances at runtime.

name: custom_throughput
category: performance
base: throughput
config:
  batch_sizes: [1, 4, 8]
  sequence_lengths: [128, 512]
  num_iterations: 5

YAML benchmarks reference a base benchmark and override specific configuration values. This makes it easy to create variants of built-in benchmarks tuned for specific hardware or use cases.

Create a new custom benchmark with:

kitt test new my-benchmark

Checkpoint Recovery¶

Long-running benchmarks (especially quality evaluations that process thousands of dataset items) save checkpoints periodically so that interrupted runs can be resumed.

The CheckpointManager handles checkpoint persistence:

Save interval -- Every 100 items processed.
Checkpoint contents -- Completed items, partial results, progress index, configuration snapshot.
Recovery -- On restart, if a matching checkpoint exists, the benchmark resumes from where it left off rather than starting over.

Checkpoints are stored locally and are identified by a combination of model, engine, and benchmark name.

Test Suites¶

Suites group multiple benchmarks into a single run. KITT ships with three predefined suites:

Suite	Purpose	Benchmarks	Runs per Benchmark
`quick`	Smoke test	throughput only	1
`standard`	Full evaluation	all quality + all performance	3
`performance`	Performance-focused	throughput, latency, memory, warmup	3

Suite Configuration¶

Suites are defined in YAML configuration files. A suite config specifies which benchmarks to run, global settings, and optional per-test overrides:

name: standard
runs: 3
global_config:
  max_tokens: 256
  temperature: 0.0
benchmarks:
  - throughput
  - latency
  - memory
  - warmup_analysis
  - mmlu
  - gsm8k
  - truthfulqa
  - hellaswag
test_overrides:
  mmlu:
    max_tokens: 64
  gsm8k:
    max_tokens: 512

The global_config applies to all benchmarks unless overridden. The test_overrides section allows per-test configuration within the suite.

SuiteRunner¶

The SuiteRunner in runners/ orchestrates suite execution:

Load and validate the suite configuration.
For each benchmark in the suite, for each run iteration:
- Initialize the engine (if not already running).
- Execute the benchmark via run().
- Collect results and GPU memory data.
Aggregate results across runs (mean, stddev, min, max).
Generate reports (JSON, Markdown).
Persist results through KARR.

Next Steps¶

Engine Lifecycle -- how engines are started before benchmarks run
KARR — Results Storage -- how benchmark results are stored