Lab 3 - Upgrade PlanAct to Reflexion (30 min, with code)¶
Goal¶
Add structured self-critique with evidence grounding, then measure quality vs cost.
Part A: Define quality dimensions (5 min)¶
quality_axes = {
"grounding": "claims must be supported by evidence",
"completeness": "important aspects should not be missing",
"conciseness": "avoid superfluous claims",
}
print(quality_axes)
Part B: Design reflection loop state (5 min)¶
from dataclasses import dataclass, field
from typing import Any, Dict, List
from qitos import StateSchema
@dataclass
class ReflexionState(StateSchema):
target_url: str = ""
page_text: str = ""
draft_answer: str = ""
reflections: List[Dict[str, Any]] = field(default_factory=list)
max_reflections: int = 2
Part C: Implement structured reflexion (10 min)¶
C1. Output contract¶
REFLEXION_PROMPT = """Return valid JSON only:
{
\"answer\": \"...\",
\"citations\": [{\"source\": \"source_text\", \"quote\": \"exact supporting quote\"}],
\"critique\": {
\"missing\": [\"...\"],
\"superfluous\": [\"...\"],
\"grounding\": [\"...\"],
\"needs_revision\": true
}
}
"""
C2. Robust JSON parse + loop control¶
import json
from qitos import Decision
def reflect_once(llm, prompt: str):
raw = llm([
{"role": "system", "content": "Return valid JSON only."},
{"role": "user", "content": prompt},
])
text = str(raw).strip()
try:
return json.loads(text)
except Exception:
s, e = text.find("{"), text.rfind("}")
if s >= 0 and e > s:
return json.loads(text[s : e + 1])
return None
class ReflexionAgent(...):
def decide(self, state: ReflexionState, observation: dict):
payload = reflect_once(self.llm, REFLEXION_PROMPT)
if payload is None:
return Decision.final("Failed to produce valid reflexion JSON output")
state.draft_answer = str(payload.get("answer", "")).strip()
state.reflections.append(payload)
needs_revision = bool(payload.get("critique", {}).get("needs_revision", False))
if needs_revision and len(state.reflections) <= state.max_reflections:
return Decision.wait("reflexion_revision_cycle")
return Decision.final(state.draft_answer)
Part D: Run and evaluate (10 min)¶
Evaluate:
- quality improvements (groundedness/completeness/conciseness)
- step/token cost
- new failure modes (often JSON formatting/parsing)