Skip to content

Benchmarks

KITT ships with quality and performance benchmarks organized into test suites. You can also define custom benchmarks in YAML.

Test Suites

Suite Description Benchmarks Default Runs
quick Smoke test Throughput only 1
standard Full evaluation All quality + performance 3
performance Performance-focused Throughput, latency, memory, warmup 3

Built-in Benchmarks

Quality

Benchmark Description
MMLU Massive Multitask Language Understanding -- broad knowledge evaluation
GSM8K Grade school math reasoning
TruthfulQA Factual consistency evaluation
HellaSwag Commonsense reasoning

Quality benchmarks require the datasets extra (poetry install -E datasets).

Performance

Benchmark Description
Throughput Requests per second at various concurrency levels
Latency Time-to-first-token and end-to-end response time
Memory Peak VRAM and CPU memory usage during inference
Warmup Analysis Measures performance stabilization over initial requests

Running Benchmarks

Use kitt run to execute a test suite against a model and engine:

kitt run -m MODEL -e ENGINE -s SUITE -o OUTPUT
Option Short Description
--model -m Path to model or model identifier (required)
--engine -e Inference engine key (required)
--suite -s Test suite: quick, standard, performance (default: quick)
--output -o Output directory for results
--runs Override the number of runs per benchmark
--skip-warmup Skip the warmup phase
--config Path to custom engine configuration YAML
--store-karr Also store results in KARR's legacy Git-backed backend

Examples:

# Quick throughput test with Ollama
kitt run -m llama3 -e ollama

# Full standard suite with vLLM
kitt run -m /models/llama2-7b -e vllm -s standard -o ./my-results

# Performance suite with legacy Git-backed KARR storage
kitt run -m /models/mistral-7b -e llama_cpp -s performance --store-karr

# Override run count
kitt run -m /models/qwen-7b -e ollama -s standard --runs 5

Output Artifacts

Each run produces the following files in the output directory:

File Description
metrics.json Full benchmark metrics in JSON format
hardware.json Detected hardware information
config.json Configuration used for the run
summary.md Human-readable Markdown summary
outputs/ Compressed benchmark outputs (chunked)

Custom Benchmarks

Creating a New Benchmark

Generate a YAML template with kitt test new:

kitt test new my-eval
kitt test new my-eval --category performance

This creates a file at configs/tests/quality/custom/my-eval.yaml (or in the performance directory if you set --category performance).

YAML Benchmark Template

name: my-eval
category: quality_custom
description: "My custom evaluation"
dataset:
  source: local
  path: ./data/my-eval.jsonl
prompts:
  template: "{question}"
  answer_key: "answer"
sampling:
  max_tokens: 256
  temperature: 0.0
scoring:
  method: exact_match
runs: 3

Edit the template to point at your dataset, adjust the prompt template, and configure scoring. KITT picks up custom YAML benchmarks automatically when they are placed in configs/tests/.

Listing Benchmarks

List all available benchmarks (built-in and custom):

kitt test list

Filter by category:

kitt test list -c performance
kitt test list -c quality

Checkpoint Recovery

Long-running benchmarks save checkpoints every 100 items. If a run is interrupted, restarting with the same output directory resumes from the last checkpoint rather than starting over.