PluginBench
Skill
Official
Review
Audit score 70

agentic-eval

github/awesome-copilot

Patterns for self-improving AI agents through iterative evaluation and refinement loops.

What is agentic-eval?

Agentic evaluation enables agents to assess and improve their own outputs by implementing self-critique, reflection, and refinement cycles. Use this skill when building quality-critical generation systems, test-driven code refinement, or LLM-as-judge evaluation pipelines that require iterative improvement.

  • Implement self-critique and reflection loops for agent outputs
  • Build evaluator-optimizer pipelines with separate generation and evaluation components
  • Create test-driven refinement workflows for code generation
  • Design rubric-based and LLM-as-judge evaluation systems
  • Measure output quality against structured criteria with convergence detection
  • Integrate iterative improvement into code, reports, and analysis generation

How to install agentic-eval

npx skills add https://github.com/github/awesome-copilot --skill agentic-eval
Claude Code
Cursor
Windsurf
Cline

How to use agentic-eval

  1. 1.Define clear evaluation criteria or rubric for your task
  2. 2.Choose an evaluation strategy (outcome-based, LLM-as-judge, or rubric-based)
  3. 3.Implement a generate() function for initial output creation
  4. 4.Implement an evaluate() function that returns structured JSON scores
  5. 5.Implement an optimize() function that refines output based on feedback
  6. 6.Wire up the refinement loop with iteration limits (typically 3-5 iterations)
  7. 7.Add convergence detection to stop when scores plateau

Use cases

Good for
  • Refining generated code through automated test execution and error fixing
  • Evaluating reports or analysis against multiple quality dimensions with weighted scoring
  • Implementing self-critique loops where agents assess their own outputs and improve them
  • Building code generation systems that validate against test suites before returning results
  • Creating multi-turn refinement workflows for quality-critical content generation
Who it's for
  • AI agent developers building quality-critical systems
  • Code generation tool builders
  • Teams implementing test-driven development with agents
  • Developers creating self-improving or iterative generation pipelines

agentic-eval FAQ

How many iterations should I run?

Set max iterations to 3-5 to prevent infinite loops. Add convergence detection to stop early if output quality isn't improving between iterations.

What evaluation strategy should I use?

Use outcome-based evaluation for clear success criteria, LLM-as-judge for comparative ranking, or rubric-based scoring for multi-dimensional quality assessment. Rubric-based with weighted dimensions works well for complex outputs.

How do I prevent evaluation parse failures?

Use structured JSON output format for all evaluations. Always validate JSON parsing and handle failures gracefully with fallback logic.

Can I use this for code generation?

Yes. The code-specific reflection pattern runs generated code against test suites, detects failures, and iteratively fixes errors until tests pass.

Should I log the refinement history?

Yes. Keep full trajectory of all iterations for debugging, analysis, and understanding where improvements occurred.

Full instructions (SKILL.md)

Source of truth, from github/awesome-copilot.


name: agentic-eval description: | Patterns and techniques for evaluating and improving AI agent outputs. Use this skill when:

  • Implementing self-critique and reflection loops
  • Building evaluator-optimizer pipelines for quality-critical generation
  • Creating test-driven code refinement workflows
  • Designing rubric-based or LLM-as-judge evaluation systems
  • Adding iterative improvement to agent outputs (code, reports, analysis)
  • Measuring and improving agent response quality

Agentic Evaluation Patterns

Patterns for self-improvement through iterative evaluation and refinement.

Overview

Evaluation patterns enable agents to assess and improve their own outputs, moving beyond single-shot generation to iterative refinement loops.

Generate → Evaluate → Critique → Refine → Output
    ↑                              │
    └──────────────────────────────┘

When to Use

  • Quality-critical generation: Code, reports, analysis requiring high accuracy
  • Tasks with clear evaluation criteria: Defined success metrics exist
  • Content requiring specific standards: Style guides, compliance, formatting

Pattern 1: Basic Reflection

Agent evaluates and improves its own output through self-critique.

def reflect_and_refine(task: str, criteria: list[str], max_iterations: int = 3) -> str:
    """Generate with reflection loop."""
    output = llm(f"Complete this task:\n{task}")
    
    for i in range(max_iterations):
        # Self-critique
        critique = llm(f"""
        Evaluate this output against criteria: {criteria}
        Output: {output}
        Rate each: PASS/FAIL with feedback as JSON.
        """)
        
        critique_data = json.loads(critique)
        all_pass = all(c["status"] == "PASS" for c in critique_data.values())
        if all_pass:
            return output
        
        # Refine based on critique
        failed = {k: v["feedback"] for k, v in critique_data.items() if v["status"] == "FAIL"}
        output = llm(f"Improve to address: {failed}\nOriginal: {output}")
    
    return output

Key insight: Use structured JSON output for reliable parsing of critique results.


Pattern 2: Evaluator-Optimizer

Separate generation and evaluation into distinct components for clearer responsibilities.

class EvaluatorOptimizer:
    def __init__(self, score_threshold: float = 0.8):
        self.score_threshold = score_threshold
    
    def generate(self, task: str) -> str:
        return llm(f"Complete: {task}")
    
    def evaluate(self, output: str, task: str) -> dict:
        return json.loads(llm(f"""
        Evaluate output for task: {task}
        Output: {output}
        Return JSON: {{"overall_score": 0-1, "dimensions": {{"accuracy": ..., "clarity": ...}}}}
        """))
    
    def optimize(self, output: str, feedback: dict) -> str:
        return llm(f"Improve based on feedback: {feedback}\nOutput: {output}")
    
    def run(self, task: str, max_iterations: int = 3) -> str:
        output = self.generate(task)
        for _ in range(max_iterations):
            evaluation = self.evaluate(output, task)
            if evaluation["overall_score"] >= self.score_threshold:
                break
            output = self.optimize(output, evaluation)
        return output

Pattern 3: Code-Specific Reflection

Test-driven refinement loop for code generation.

class CodeReflector:
    def reflect_and_fix(self, spec: str, max_iterations: int = 3) -> str:
        code = llm(f"Write Python code for: {spec}")
        tests = llm(f"Generate pytest tests for: {spec}\nCode: {code}")
        
        for _ in range(max_iterations):
            result = run_tests(code, tests)
            if result["success"]:
                return code
            code = llm(f"Fix error: {result['error']}\nCode: {code}")
        return code

Evaluation Strategies

Outcome-Based

Evaluate whether output achieves the expected result.

def evaluate_outcome(task: str, output: str, expected: str) -> str:
    return llm(f"Does output achieve expected outcome? Task: {task}, Expected: {expected}, Output: {output}")

LLM-as-Judge

Use LLM to compare and rank outputs.

def llm_judge(output_a: str, output_b: str, criteria: str) -> str:
    return llm(f"Compare outputs A and B for {criteria}. Which is better and why?")

Rubric-Based

Score outputs against weighted dimensions.

RUBRIC = {
    "accuracy": {"weight": 0.4},
    "clarity": {"weight": 0.3},
    "completeness": {"weight": 0.3}
}

def evaluate_with_rubric(output: str, rubric: dict) -> float:
    scores = json.loads(llm(f"Rate 1-5 for each dimension: {list(rubric.keys())}\nOutput: {output}"))
    return sum(scores[d] * rubric[d]["weight"] for d in rubric) / 5

Best Practices

PracticeRationale
Clear criteriaDefine specific, measurable evaluation criteria upfront
Iteration limitsSet max iterations (3-5) to prevent infinite loops
Convergence checkStop if output score isn't improving between iterations
Log historyKeep full trajectory for debugging and analysis
Structured outputUse JSON for reliable parsing of evaluation results

Quick Start Checklist

## Evaluation Implementation Checklist

### Setup
- [ ] Define evaluation criteria/rubric
- [ ] Set score threshold for "good enough"
- [ ] Configure max iterations (default: 3)

### Implementation
- [ ] Implement generate() function
- [ ] Implement evaluate() function with structured output
- [ ] Implement optimize() function
- [ ] Wire up the refinement loop

### Safety
- [ ] Add convergence detection
- [ ] Log all iterations for debugging
- [ ] Handle evaluation parse failures gracefully