GAIA Benchmark Integration¶
What is already supported¶
QitOS has a working GAIA adapter and runnable agent pipeline:
- Adapter:
qitos/benchmark/gaia/adapter.py - Canonical conversion: GAIA row ->
Task - Runtime: standard
Engineloop (no benchmark-specific runtime fork) - Example runner:
examples/real/open_deep_research_gaia_agent.py
Why this matters¶
You can evaluate agent designs with the same kernel used in product agents:
- same
AgentModule + Engine - same hooks/trace/qita inspection
- same env/tool abstractions
This keeps research and engineering on one path.
Quick commands¶
Run one GAIA sample¶
python examples/real/open_deep_research_gaia_agent.py \
--workspace ./qitos_gaia_workspace \
--gaia-download-snapshot \
--gaia-split validation \
--gaia-index 0
Run a full split¶
python examples/real/open_deep_research_gaia_agent.py \
--workspace ./qitos_gaia_workspace \
--gaia-download-snapshot \
--gaia-split validation \
--run-all --concurrency 2 --resume
Run only a subset window¶
python examples/real/open_deep_research_gaia_agent.py \
--workspace ./qitos_gaia_workspace \
--gaia-download-snapshot \
--gaia-split validation \
--run-all --start-index 100 --limit 50 --resume
Output artifacts¶
- Per-task answer file in task workspace
- Standard run traces (manifest/events)
- Aggregate benchmark JSONL (in workspace root unless
--output-jsonlis set)
Then inspect with: