Skill

Official

Review

Audit score 70

agentic-eval

github/awesome-copilot

Patterns for self-improving AI agents through iterative evaluation and refinement loops.

What is agentic-eval?

Agentic evaluation enables agents to assess and improve their own outputs by implementing self-critique, reflection, and refinement cycles. Use this skill when building quality-critical generation systems, test-driven code refinement, or LLM-as-judge evaluation pipelines that require iterative improvement.

Implement self-critique and reflection loops for agent outputs
Build evaluator-optimizer pipelines with separate generation and evaluation components
Create test-driven refinement workflows for code generation
Design rubric-based and LLM-as-judge evaluation systems
Measure output quality against structured criteria with convergence detection
Integrate iterative improvement into code, reports, and analysis generation

How to install agentic-eval

npx skills add https://github.com/github/awesome-copilot --skill agentic-eval

Claude Code

Cursor

Windsurf

Cline

How to use agentic-eval

1.Define clear evaluation criteria or rubric for your task
2.Choose an evaluation strategy (outcome-based, LLM-as-judge, or rubric-based)
3.Implement a generate() function for initial output creation
4.Implement an evaluate() function that returns structured JSON scores
5.Implement an optimize() function that refines output based on feedback
6.Wire up the refinement loop with iteration limits (typically 3-5 iterations)
7.Add convergence detection to stop when scores plateau

Use cases

Good for

Refining generated code through automated test execution and error fixing
Evaluating reports or analysis against multiple quality dimensions with weighted scoring
Implementing self-critique loops where agents assess their own outputs and improve them
Building code generation systems that validate against test suites before returning results
Creating multi-turn refinement workflows for quality-critical content generation

Who it's for

AI agent developers building quality-critical systems
Code generation tool builders
Teams implementing test-driven development with agents
Developers creating self-improving or iterative generation pipelines

agentic-eval FAQ

How many iterations should I run?

Set max iterations to 3-5 to prevent infinite loops. Add convergence detection to stop early if output quality isn't improving between iterations.

What evaluation strategy should I use?

Use outcome-based evaluation for clear success criteria, LLM-as-judge for comparative ranking, or rubric-based scoring for multi-dimensional quality assessment. Rubric-based with weighted dimensions works well for complex outputs.

How do I prevent evaluation parse failures?

Use structured JSON output format for all evaluations. Always validate JSON parsing and handle failures gracefully with fallback logic.

Can I use this for code generation?

Yes. The code-specific reflection pattern runs generated code against test suites, detects failures, and iteratively fixes errors until tests pass.

Should I log the refinement history?

Yes. Keep full trajectory of all iterations for debugging, analysis, and understanding where improvements occurred.

Full instructions (SKILL.md)

Source of truth, from github/awesome-copilot.

name: agentic-eval description: | Patterns and techniques for evaluating and improving AI agent outputs. Use this skill when:

Implementing self-critique and reflection loops
Building evaluator-optimizer pipelines for quality-critical generation
Creating test-driven code refinement workflows
Designing rubric-based or LLM-as-judge evaluation systems
Adding iterative improvement to agent outputs (code, reports, analysis)
Measuring and improving agent response quality

Agentic Evaluation Patterns

Patterns for self-improvement through iterative evaluation and refinement.

Overview

Evaluation patterns enable agents to assess and improve their own outputs, moving beyond single-shot generation to iterative refinement loops.

Generate → Evaluate → Critique → Refine → Output
    ↑                              │
    └──────────────────────────────┘

When to Use

Quality-critical generation: Code, reports, analysis requiring high accuracy
Tasks with clear evaluation criteria: Defined success metrics exist
Content requiring specific standards: Style guides, compliance, formatting

Pattern 1: Basic Reflection

Agent evaluates and improves its own output through self-critique.

def reflect_and_refine(task: str, criteria: list[str], max_iterations: int = 3) -> str:
    """Generate with reflection loop."""
    output = llm(f"Complete this task:\n{task}")
    
    for i in range(max_iterations):
        # Self-critique
        critique = llm(f"""
        Evaluate this output against criteria: {criteria}
        Output: {output}
        Rate each: PASS/FAIL with feedback as JSON.
        """)
        
        critique_data = json.loads(critique)
        all_pass = all(c["status"] == "PASS" for c in critique_data.values())
        if all_pass:
            return output
        
        # Refine based on critique
        failed = {k: v["feedback"] for k, v in critique_data.items() if v["status"] == "FAIL"}
        output = llm(f"Improve to address: {failed}\nOriginal: {output}")
    
    return output

Key insight: Use structured JSON output for reliable parsing of critique results.

Pattern 2: Evaluator-Optimizer

Separate generation and evaluation into distinct components for clearer responsibilities.

class EvaluatorOptimizer:
    def __init__(self, score_threshold: float = 0.8):
        self.score_threshold = score_threshold
    
    def generate(self, task: str) -> str:
        return llm(f"Complete: {task}")
    
    def evaluate(self, output: str, task: str) -> dict:
        return json.loads(llm(f"""
        Evaluate output for task: {task}
        Output: {output}
        Return JSON: {{"overall_score": 0-1, "dimensions": {{"accuracy": ..., "clarity": ...}}}}
        """))
    
    def optimize(self, output: str, feedback: dict) -> str:
        return llm(f"Improve based on feedback: {feedback}\nOutput: {output}")
    
    def run(self, task: str, max_iterations: int = 3) -> str:
        output = self.generate(task)
        for _ in range(max_iterations):
            evaluation = self.evaluate(output, task)
            if evaluation["overall_score"] >= self.score_threshold:
                break
            output = self.optimize(output, evaluation)
        return output

Pattern 3: Code-Specific Reflection

Test-driven refinement loop for code generation.

class CodeReflector:
    def reflect_and_fix(self, spec: str, max_iterations: int = 3) -> str:
        code = llm(f"Write Python code for: {spec}")
        tests = llm(f"Generate pytest tests for: {spec}\nCode: {code}")
        
        for _ in range(max_iterations):
            result = run_tests(code, tests)
            if result["success"]:
                return code
            code = llm(f"Fix error: {result['error']}\nCode: {code}")
        return code

Evaluation Strategies

Outcome-Based

Evaluate whether output achieves the expected result.

def evaluate_outcome(task: str, output: str, expected: str) -> str:
    return llm(f"Does output achieve expected outcome? Task: {task}, Expected: {expected}, Output: {output}")

LLM-as-Judge

Use LLM to compare and rank outputs.

def llm_judge(output_a: str, output_b: str, criteria: str) -> str:
    return llm(f"Compare outputs A and B for {criteria}. Which is better and why?")

Rubric-Based

Score outputs against weighted dimensions.

RUBRIC = {
    "accuracy": {"weight": 0.4},
    "clarity": {"weight": 0.3},
    "completeness": {"weight": 0.3}
}

def evaluate_with_rubric(output: str, rubric: dict) -> float:
    scores = json.loads(llm(f"Rate 1-5 for each dimension: {list(rubric.keys())}\nOutput: {output}"))
    return sum(scores[d] * rubric[d]["weight"] for d in rubric) / 5

Best Practices

Practice	Rationale
Clear criteria	Define specific, measurable evaluation criteria upfront
Iteration limits	Set max iterations (3-5) to prevent infinite loops
Convergence check	Stop if output score isn't improving between iterations
Log history	Keep full trajectory for debugging and analysis
Structured output	Use JSON for reliable parsing of evaluation results

Quick Start Checklist

## Evaluation Implementation Checklist

### Setup
- [ ] Define evaluation criteria/rubric
- [ ] Set score threshold for "good enough"
- [ ] Configure max iterations (default: 3)

### Implementation
- [ ] Implement generate() function
- [ ] Implement evaluate() function with structured output
- [ ] Implement optimize() function
- [ ] Wire up the refinement loop

### Safety
- [ ] Add convergence detection
- [ ] Log all iterations for debugging
- [ ] Handle evaluation parse failures gracefully

Surgical code refactoring to improve maintainability without changing behavior.

19k installsAudited