Source Snapshot
- Origin: Introducing Rubrics: Build Agents that Evaluate and Correct Their Work
- Type: Product note
- One-line takeaway: Rubric-guided evaluation can turn explicit completion criteria into an automated generate, grade, correct, and re-grade loop, while iteration caps and human review remain necessary boundaries.
Garden Card
Use this note to assess whether rubric-guided self-correction can improve the reliability of bounded enterprise agent workflows without treating model-based grading as proof of correctness.
-
Core question: How can an agent evaluate its work against explicit acceptance criteria and correct deficiencies before returning a result?
-
Operational value: It can reduce manual inspection and reruns for tasks with verifiable completion criteria, such as test execution, required-section coverage, and forbidden-pattern detection.
-
Best connection: Agent Loop, Bounded Agent, and Enterprise Agent Governance.
1. Executive Summary
LangChain’s beta RubricMiddleware adds a dedicated grader sub-agent to a Deep Agents run, allowing generated work to be checked against explicit criteria and revised with criterion-specific feedback. The pattern is operationally useful where success can be verified through tests, validators, required content, or prohibited patterns. Adoption is most ready for bounded, observable workflows because retries increase cost and latency, while model-only grading can reproduce ambiguity or error. Enterprise use should therefore combine precise rubrics, evidence-producing tools, iteration limits, failure-state handling, and human review for consequential decisions.
-
Main idea: Separate work generation from evaluation, then feed structured grading feedback back to the working agent until the criteria pass or the loop reaches a configured terminal state.
-
Why now: Longer agent runs accumulate ambiguity, tool errors, context pressure, and probabilistic variation, making first-pass completion increasingly unreliable.
-
Where it applies: Code generation with executable tests, structured reports with mandatory sections, compliance-oriented content checks, and other workflows with explicit and verifiable acceptance criteria.
Decision Signal
If I only remember one thing from this note, it should be:
A self-correction loop improves reliability only when “done” is defined as observable criteria and the grader can collect credible evidence.
2. Key Technical Terms
Use these terms consistently when designing or reviewing rubric-guided agent systems.
-
Rubric: An explicit checklist describing the conditions that a completed agent run must satisfy.
-
Grader sub-agent: A separate agent that evaluates the working agent’s transcript and outputs a verdict with feedback for each criterion.
-
RubricMiddleware: The beta Deep Agents component described by the source that runs grading before completion and triggers another iteration when criteria remain unsatisfied.
-
Tool-grounded evaluation: Evaluation based on evidence collected through tools, such as executing tests or validators, rather than relying only on language-model judgment.
-
Iteration cap: The configured maximum number of correction and re-grading cycles used to bound cost, latency, and runaway behavior.
3. Core Notes
3.1 Problem
Agents may produce plausible outputs that approach the requested result without satisfying every completion condition.
-
As task context grows, ambiguous instructions, tool misuse, non-deterministic failures, and accumulated context can reduce output quality.
-
Developers must otherwise inspect incomplete results, diagnose deficiencies, and manually rerun the task.
-
A general request to “try again” does not identify which acceptance criterion failed or what evidence would demonstrate completion.
3.2 Mechanism
RubricMiddleware places a grader loop around the base agent while keeping the working agent’s operating instructions separate from the evaluation criteria.
-
The application defines a grader model, grader system prompt, optional evidence-gathering tools, and
max_iterations. -
A rubric is supplied at invocation time. If no rubric is provided, the source states that the middleware does nothing.
-
Before the run completes, the grader evaluates the full transcript against each criterion and may call tools to gather evidence.
-
Failed criteria produce targeted feedback that is injected into the conversation, after which the working agent attempts a correction and is graded again.
-
The documented terminal states are
satisfied,max_iterations_reached,failed, andgrader_error.
3.3 Evidence
The source provides a code-generation example illustrating the mechanism, but it does not present a comparative benchmark or production-scale reliability study.
-
In the reported example, the first implementation failed a test involving unhashable input values. The grader identified the specific failing behavior, and the revised implementation passed the tests on the second iteration.
-
The grader can call a test-suite tool, allowing its verdict to use executable evidence rather than abstract reasoning alone.
-
Feedback is returned per criterion, giving the working agent a more precise correction target than an undifferentiated retry request.
-
These observations demonstrate the workflow, not a general guarantee that rubric-guided agents will be correct, cheaper, or faster across enterprise tasks.
3.4 Boundary
The pattern is strongest when criteria are specific, independently verifiable, and connected to trustworthy tools.
-
The source labels
RubricMiddlewareas beta and warns that its API may change, so production adoption requires version pinning, upgrade tests, and abstraction around the integration. -
A grader using only model reasoning may accept a polished but incorrect result, reject a valid result, or share the working agent’s blind spots.
-
More iterations create additional model calls, tool executions, latency, and cost; iteration caps bound exposure but may return unfinished work.
-
Rubrics that are vague, conflicting, incomplete, or easy to satisfy superficially can optimize the agent toward the wrong outcome.
-
Human review remains necessary for safety-critical, regulatory, financial, legal, or operational decisions where passing a rubric is not sufficient authority to act.
4. Concept Map
Use these links to place rubric-guided correction within a governed agent architecture.
- Related domain: Claude Agent SDK Core Concepts
- Related platform: Core AI Platforms & Agents
- Related architecture: Bounded Agent
- Related source note: Agent Loop
flowchart LR A["Task and Rubric"] --> B["Working Agent"] B --> C["Grader Agent"] C --> D["Evidence Tools"] D --> C C -->|Criteria Failed| E["Targeted Feedback"] E --> B C -->|Satisfied| F["Completed Result"] C -->|Limit or Error| G["Human Review"]
Diagram labels stay in English for rendering consistency and easier reuse across published pages.
5. Quartz Publishing Notes
Check these before publishing the note.
-
Frontmatter uses only approved fields:
title,publish,source,source_date,created,tags,permalink, andaliases. -
Tags are broad and durable, with no more than three items.
-
permalinkis the stable public entrypoint;aliasespreserve old paths when folders move. -
Internal links use Quartz / Obsidian wikilinks such as
[[Wiki/ideas/AgentLoop|Agent Loop]]. -
Diagrams use fenced
mermaidblocks. -
Private or personal information has been removed.
Publish Boundary
Do not treat a grader verdict as independent proof of correctness. Publish and operationalize only claims supported by the cited source or reproducible evidence.
6. My Take
Rubrics are best understood as an executable quality-control contract for bounded agent work, not as a universal self-improvement mechanism.
-
What changed my thinking: The important architectural move is not simply adding retries; it is separating execution from evaluation and returning criterion-specific evidence to the execution loop.
-
What I may do next: Pilot the pattern on one low-risk workflow with executable validators, measure first-pass success, final success, iteration count, latency, cost, and escalation frequency, then decide whether broader adoption is justified.
-
What still needs verification: Independent benchmarks, production failure rates, grader-model selection guidance, concurrency behavior, rubric persistence, observability, and the operational consequences of each non-success terminal state.
Reuse Path
Convert this note into a pilot design containing rubric authoring rules, evidence-tool requirements, retry budgets, terminal-state handling, observability metrics, and human-escalation criteria.
