Skip to content

QitOS

GAIA Benchmark

GAIA Benchmark Integration¶

What is already supported¶

QitOS has a working GAIA adapter and runnable agent pipeline:

Adapter: qitos/benchmark/gaia/adapter.py
Canonical conversion: GAIA row -> Task
Runtime: standard Engine loop (no benchmark-specific runtime fork)
Example runner: examples/real/open_deep_research_gaia_agent.py

Why this matters¶

You can evaluate agent designs with the same kernel used in product agents:

same AgentModule + Engine
same hooks/trace/qita inspection
same env/tool abstractions

This keeps research and engineering on one path.

Quick commands¶

Run one GAIA sample¶

python examples/real/open_deep_research_gaia_agent.py \
  --workspace ./qitos_gaia_workspace \
  --gaia-download-snapshot \
  --gaia-split validation \
  --gaia-index 0

Run a full split¶

python examples/real/open_deep_research_gaia_agent.py \
  --workspace ./qitos_gaia_workspace \
  --gaia-download-snapshot \
  --gaia-split validation \
  --run-all --concurrency 2 --resume

Run only a subset window¶

python examples/real/open_deep_research_gaia_agent.py \
  --workspace ./qitos_gaia_workspace \
  --gaia-download-snapshot \
  --gaia-split validation \
  --run-all --start-index 100 --limit 50 --resume

Output artifacts¶

Per-task answer file in task workspace
Standard run traces (manifest/events)
Aggregate benchmark JSONL (in workspace root unless --output-jsonl is set)

Then inspect with:

qita board --logdir runs

Source Index¶