Rubric-Guided Agents That Evaluate and Correct Their Work

Source Snapshot

Origin: Introducing Rubrics: Build Agents that Evaluate and Correct Their Work

Type: Product note

One-line takeaway: Rubric-guided evaluation can turn explicit completion criteria into an automated generate, grade, correct, and re-grade loop, while iteration caps and human review remain necessary boundaries.

Garden Card

Use this note to assess whether rubric-guided self-correction can improve the reliability of bounded enterprise agent workflows without treating model-based grading as proof of correctness.

Core question: How can an agent evaluate its work against explicit acceptance criteria and correct deficiencies before returning a result?
Operational value: It can reduce manual inspection and reruns for tasks with verifiable completion criteria, such as test execution, required-section coverage, and forbidden-pattern detection.
Best connection: Agent Loop, Bounded Agent, and Enterprise Agent Governance.

1. Executive Summary

LangChain’s beta RubricMiddleware adds a dedicated grader sub-agent to a Deep Agents run, allowing generated work to be checked against explicit criteria and revised with criterion-specific feedback. The pattern is operationally useful where success can be verified through tests, validators, required content, or prohibited patterns. Adoption is most ready for bounded, observable workflows because retries increase cost and latency, while model-only grading can reproduce ambiguity or error. Enterprise use should therefore combine precise rubrics, evidence-producing tools, iteration limits, failure-state handling, and human review for consequential decisions.

Main idea: Separate work generation from evaluation, then feed structured grading feedback back to the working agent until the criteria pass or the loop reaches a configured terminal state.
Why now: Longer agent runs accumulate ambiguity, tool errors, context pressure, and probabilistic variation, making first-pass completion increasingly unreliable.
Where it applies: Code generation with executable tests, structured reports with mandatory sections, compliance-oriented content checks, and other workflows with explicit and verifiable acceptance criteria.

Decision Signal

If I only remember one thing from this note, it should be:

A self-correction loop improves reliability only when “done” is defined as observable criteria and the grader can collect credible evidence.

2. Key Technical Terms

Use these terms consistently when designing or reviewing rubric-guided agent systems.

Rubric: An explicit checklist describing the conditions that a completed agent run must satisfy.
Grader sub-agent: A separate agent that evaluates the working agent’s transcript and outputs a verdict with feedback for each criterion.
RubricMiddleware: The beta Deep Agents component described by the source that runs grading before completion and triggers another iteration when criteria remain unsatisfied.
Tool-grounded evaluation: Evaluation based on evidence collected through tools, such as executing tests or validators, rather than relying only on language-model judgment.
Iteration cap: The configured maximum number of correction and re-grading cycles used to bound cost, latency, and runaway behavior.

3. Core Notes

3.1 Problem

Agents may produce plausible outputs that approach the requested result without satisfying every completion condition.

As task context grows, ambiguous instructions, tool misuse, non-deterministic failures, and accumulated context can reduce output quality.
Developers must otherwise inspect incomplete results, diagnose deficiencies, and manually rerun the task.
A general request to “try again” does not identify which acceptance criterion failed or what evidence would demonstrate completion.

3.2 Mechanism

RubricMiddleware places a grader loop around the base agent while keeping the working agent’s operating instructions separate from the evaluation criteria.

The application defines a grader model, grader system prompt, optional evidence-gathering tools, and max_iterations.
A rubric is supplied at invocation time. If no rubric is provided, the source states that the middleware does nothing.
Before the run completes, the grader evaluates the full transcript against each criterion and may call tools to gather evidence.
Failed criteria produce targeted feedback that is injected into the conversation, after which the working agent attempts a correction and is graded again.
The documented terminal states are satisfied, max_iterations_reached, failed, and grader_error.

3.3 Evidence

The source provides a code-generation example illustrating the mechanism, but it does not present a comparative benchmark or production-scale reliability study.

In the reported example, the first implementation failed a test involving unhashable input values. The grader identified the specific failing behavior, and the revised implementation passed the tests on the second iteration.
The grader can call a test-suite tool, allowing its verdict to use executable evidence rather than abstract reasoning alone.
Feedback is returned per criterion, giving the working agent a more precise correction target than an undifferentiated retry request.
These observations demonstrate the workflow, not a general guarantee that rubric-guided agents will be correct, cheaper, or faster across enterprise tasks.

3.4 Boundary

The pattern is strongest when criteria are specific, independently verifiable, and connected to trustworthy tools.

The source labels RubricMiddleware as beta and warns that its API may change, so production adoption requires version pinning, upgrade tests, and abstraction around the integration.
A grader using only model reasoning may accept a polished but incorrect result, reject a valid result, or share the working agent’s blind spots.
More iterations create additional model calls, tool executions, latency, and cost; iteration caps bound exposure but may return unfinished work.
Rubrics that are vague, conflicting, incomplete, or easy to satisfy superficially can optimize the agent toward the wrong outcome.
Human review remains necessary for safety-critical, regulatory, financial, legal, or operational decisions where passing a rubric is not sufficient authority to act.

4. Concept Map

Use these links to place rubric-guided correction within a governed agent architecture.

Related domain: Claude Agent SDK Core Concepts
Related platform: Core AI Platforms & Agents
Related architecture: Bounded Agent
Related source note: Agent Loop

flowchart LR
  A["Task and Rubric"] --> B["Working Agent"]
  B --> C["Grader Agent"]
  C --> D["Evidence Tools"]
  D --> C
  C -->|Criteria Failed| E["Targeted Feedback"]
  E --> B
  C -->|Satisfied| F["Completed Result"]
  C -->|Limit or Error| G["Human Review"]

Diagram labels stay in English for rendering consistency and easier reuse across published pages.

5. Quartz Publishing Notes

Check these before publishing the note.

Frontmatter uses only approved fields: title, publish, source, source_date, created, tags, permalink, and aliases.
Tags are broad and durable, with no more than three items.
permalink is the stable public entrypoint; aliases preserve old paths when folders move.
Internal links use Quartz / Obsidian wikilinks such as Agent Loop.
Diagrams use fenced mermaid blocks.
Private or personal information has been removed.

Publish Boundary

Do not treat a grader verdict as independent proof of correctness. Publish and operationalize only claims supported by the cited source or reproducible evidence.

6. My Take

Rubrics are best understood as an executable quality-control contract for bounded agent work, not as a universal self-improvement mechanism.

What changed my thinking: The important architectural move is not simply adding retries; it is separating execution from evaluation and returning criterion-specific evidence to the execution loop.
What I may do next: Pilot the pattern on one low-risk workflow with executable validators, measure first-pass success, final success, iteration count, latency, cost, and escalation frequency, then decide whether broader adoption is justified.
What still needs verification: Independent benchmarks, production failure rates, grader-model selection guidance, concurrency behavior, rubric persistence, observability, and the operational consequences of each non-success terminal state.

Reuse Path

Convert this note into a pilot design containing rubric authoring rules, evidence-tool requirements, retry budgets, terminal-state handling, observability metrics, and human-escalation criteria.

Rubric-Guided Agents That Evaluate and Correct Their Work

Garden Card

1. Executive Summary

2. Key Technical Terms

3. Core Notes

3.1 Problem

3.2 Mechanism

3.3 Evidence

3.4 Boundary

4. Concept Map

5. Quartz Publishing Notes

6. My Take

References

Graph View

Table of Contents

Backlinks

DL

Rubric-Guided Agents That Evaluate and Correct Their Work

Garden Card

1. Executive Summary

2. Key Technical Terms

3. Core Notes

3.1 Problem

3.2 Mechanism

3.3 Evidence

3.4 Boundary

4. Concept Map

5. Quartz Publishing Notes

6. My Take

References

Graph View

Table of Contents

Backlinks