Custom Benchmark Configuration¶

Custom benchmarks allow you to define your own evaluation tests using YAML files. They are loaded at runtime by the YAMLBenchmark class from kitt.benchmarks.loader and validated against the TestConfig Pydantic model.

File location¶

Place custom benchmark YAML files in:

configs/tests/quality/custom/

KITT discovers them automatically when listing tests or running suites.

Schema¶

Field	Type	Required	Description
`name`	`str`	Yes	Unique benchmark identifier
`version`	`str`	No	Semantic version (default: `"1.0.0"`)
`category`	`str`	Yes	Must be `quality_custom` for custom benchmarks
`description`	`str`	No	Human-readable description
`warmup`	`WarmupConfig`	No	Warmup phase settings
`dataset`	`DatasetConfig`	No	Dataset source configuration
`prompts`	`PromptConfig`	No	Prompt template and few-shot settings
`sampling`	`SamplingParams`	No	Generation sampling parameters
`evaluation`	`EvaluationConfig`	No	Metrics and answer extraction
`runs`	`int`	No	Number of runs (default: `3`)

DatasetConfig¶

Field	Type	Default	Description
`source`	`str`	`null`	HuggingFace dataset ID
`local_path`	`str`	`null`	Path to a local dataset directory
`split`	`str`	`"test"`	Dataset split to use
`sample_size`	`int`	`null`	Number of samples (`null` = all)

PromptConfig¶

Field	Type	Default	Description
`template`	`str`	`""`	Prompt template with `{question}` placeholder
`few_shot`	`int`	`0`	Number of few-shot examples to prepend
`few_shot_source`	`str`	`"dev"`	Split to draw few-shot examples from

EvaluationConfig¶

Field	Type	Description
`metrics`	`list[str]`	Metric names to compute (e.g. `accuracy`)
`answer_extraction`	`dict`	Rules for extracting answers from output

Example¶

name: example_custom
version: "1.0.0"
category: quality_custom
description: "Example custom benchmark"

warmup:
  enabled: true
  iterations: 3
  log_warmup_times: true

dataset:
  local_path: ./datasets/custom/
  sample_size: 10

prompts:
  template: |
    Answer the following question concisely.

    Question: {question}
    Answer:

sampling:
  temperature: 0.0
  max_tokens: 256

evaluation:
  metrics:
    - accuracy

runs: 1

Creating a custom benchmark¶

Copy the example file to a new YAML in configs/tests/quality/custom/:

cp configs/tests/quality/custom/example_custom.yaml \
   configs/tests/quality/custom/my_test.yaml

Edit the file: set name, category: quality_custom, configure the dataset source (HuggingFace source or local_path), and customize the prompt template.
Verify the benchmark loads:
```
kitt test list
```

Run it as part of a suite or directly:

kitt run -m /path/to/model -e vllm -s my_suite