Evaluate & Metric¶
Separation of concerns¶
QitOS uses two layers:
qitos.evaluate: task-level judgement for one trajectory.qitos.metric: benchmark-level aggregation over many runs.
This keeps custom success logic independent from reporting logic.
Core interfaces¶
Evaluate¶
TrajectoryEvaluatorEvaluationContextEvaluationResultEvaluationSuite
Metric¶
MetricMetricInputMetricReportMetricRegistry
Kit implementations¶
Evaluators (qitos.kit.evaluate)¶
RuleBasedEvaluatorDSLEvaluatorModelBasedEvaluator
Metrics (qitos.kit.metric)¶
SuccessRateMetricAverageRewardMetricRewardSuccessRateMetric(success from reward≈1)RewardPassHatMetric(tau-style pass^k series)PassAtKMetricMeanStepsMetricStopReasonDistributionMetricCustomFieldMetric
Minimal usage¶
from qitos.evaluate import EvaluationContext, EvaluationSuite
from qitos.kit.evaluate import RuleBasedEvaluator
suite = EvaluationSuite([RuleBasedEvaluator(min_reward=1.0)], mode="all")
out = suite.evaluate(EvaluationContext(task=task, manifest=manifest, extras={"reward": 1.0}))
print(out.success, out.score)
from qitos.metric import MetricInput, MetricRegistry
from qitos.kit.metric import SuccessRateMetric, PassAtKMetric
rows = [MetricInput(task_id="a", trial=0, success=True), MetricInput(task_id="a", trial=1, success=False)]
reports = MetricRegistry([SuccessRateMetric(), PassAtKMetric(k=1)]).compute_all(rows)