agentic-eval
github/awesome-copilot
Patterns for self-improving AI agents through iterative evaluation and refinement loops.
What is agentic-eval?
Agentic evaluation enables agents to assess and improve their own outputs by implementing self-critique, reflection, and refinement cycles. Use this skill when building quality-critical generation systems, test-driven code refinement, or LLM-as-judge evaluation pipelines that require iterative improvement.
- Implement self-critique and reflection loops for agent outputs
- Build evaluator-optimizer pipelines with separate generation and evaluation components
- Create test-driven refinement workflows for code generation
- Design rubric-based and LLM-as-judge evaluation systems
- Measure output quality against structured criteria with convergence detection
- Integrate iterative improvement into code, reports, and analysis generation
How to install agentic-eval
npx skills add https://github.com/github/awesome-copilot --skill agentic-evalHow to use agentic-eval
- 1.Define clear evaluation criteria or rubric for your task
- 2.Choose an evaluation strategy (outcome-based, LLM-as-judge, or rubric-based)
- 3.Implement a generate() function for initial output creation
- 4.Implement an evaluate() function that returns structured JSON scores
- 5.Implement an optimize() function that refines output based on feedback
- 6.Wire up the refinement loop with iteration limits (typically 3-5 iterations)
- 7.Add convergence detection to stop when scores plateau
Use cases
- Refining generated code through automated test execution and error fixing
- Evaluating reports or analysis against multiple quality dimensions with weighted scoring
- Implementing self-critique loops where agents assess their own outputs and improve them
- Building code generation systems that validate against test suites before returning results
- Creating multi-turn refinement workflows for quality-critical content generation
- AI agent developers building quality-critical systems
- Code generation tool builders
- Teams implementing test-driven development with agents
- Developers creating self-improving or iterative generation pipelines
agentic-eval FAQ
Set max iterations to 3-5 to prevent infinite loops. Add convergence detection to stop early if output quality isn't improving between iterations.
Use outcome-based evaluation for clear success criteria, LLM-as-judge for comparative ranking, or rubric-based scoring for multi-dimensional quality assessment. Rubric-based with weighted dimensions works well for complex outputs.
Use structured JSON output format for all evaluations. Always validate JSON parsing and handle failures gracefully with fallback logic.
Yes. The code-specific reflection pattern runs generated code against test suites, detects failures, and iteratively fixes errors until tests pass.
Yes. Keep full trajectory of all iterations for debugging, analysis, and understanding where improvements occurred.
Full instructions (SKILL.md)
Source of truth, from github/awesome-copilot.
name: agentic-eval description: | Patterns and techniques for evaluating and improving AI agent outputs. Use this skill when:
- Implementing self-critique and reflection loops
- Building evaluator-optimizer pipelines for quality-critical generation
- Creating test-driven code refinement workflows
- Designing rubric-based or LLM-as-judge evaluation systems
- Adding iterative improvement to agent outputs (code, reports, analysis)
- Measuring and improving agent response quality
Agentic Evaluation Patterns
Patterns for self-improvement through iterative evaluation and refinement.
Overview
Evaluation patterns enable agents to assess and improve their own outputs, moving beyond single-shot generation to iterative refinement loops.
Generate → Evaluate → Critique → Refine → Output
↑ │
└──────────────────────────────┘
When to Use
- Quality-critical generation: Code, reports, analysis requiring high accuracy
- Tasks with clear evaluation criteria: Defined success metrics exist
- Content requiring specific standards: Style guides, compliance, formatting
Pattern 1: Basic Reflection
Agent evaluates and improves its own output through self-critique.
def reflect_and_refine(task: str, criteria: list[str], max_iterations: int = 3) -> str:
"""Generate with reflection loop."""
output = llm(f"Complete this task:\n{task}")
for i in range(max_iterations):
# Self-critique
critique = llm(f"""
Evaluate this output against criteria: {criteria}
Output: {output}
Rate each: PASS/FAIL with feedback as JSON.
""")
critique_data = json.loads(critique)
all_pass = all(c["status"] == "PASS" for c in critique_data.values())
if all_pass:
return output
# Refine based on critique
failed = {k: v["feedback"] for k, v in critique_data.items() if v["status"] == "FAIL"}
output = llm(f"Improve to address: {failed}\nOriginal: {output}")
return output
Key insight: Use structured JSON output for reliable parsing of critique results.
Pattern 2: Evaluator-Optimizer
Separate generation and evaluation into distinct components for clearer responsibilities.
class EvaluatorOptimizer:
def __init__(self, score_threshold: float = 0.8):
self.score_threshold = score_threshold
def generate(self, task: str) -> str:
return llm(f"Complete: {task}")
def evaluate(self, output: str, task: str) -> dict:
return json.loads(llm(f"""
Evaluate output for task: {task}
Output: {output}
Return JSON: {{"overall_score": 0-1, "dimensions": {{"accuracy": ..., "clarity": ...}}}}
"""))
def optimize(self, output: str, feedback: dict) -> str:
return llm(f"Improve based on feedback: {feedback}\nOutput: {output}")
def run(self, task: str, max_iterations: int = 3) -> str:
output = self.generate(task)
for _ in range(max_iterations):
evaluation = self.evaluate(output, task)
if evaluation["overall_score"] >= self.score_threshold:
break
output = self.optimize(output, evaluation)
return output
Pattern 3: Code-Specific Reflection
Test-driven refinement loop for code generation.
class CodeReflector:
def reflect_and_fix(self, spec: str, max_iterations: int = 3) -> str:
code = llm(f"Write Python code for: {spec}")
tests = llm(f"Generate pytest tests for: {spec}\nCode: {code}")
for _ in range(max_iterations):
result = run_tests(code, tests)
if result["success"]:
return code
code = llm(f"Fix error: {result['error']}\nCode: {code}")
return code
Evaluation Strategies
Outcome-Based
Evaluate whether output achieves the expected result.
def evaluate_outcome(task: str, output: str, expected: str) -> str:
return llm(f"Does output achieve expected outcome? Task: {task}, Expected: {expected}, Output: {output}")
LLM-as-Judge
Use LLM to compare and rank outputs.
def llm_judge(output_a: str, output_b: str, criteria: str) -> str:
return llm(f"Compare outputs A and B for {criteria}. Which is better and why?")
Rubric-Based
Score outputs against weighted dimensions.
RUBRIC = {
"accuracy": {"weight": 0.4},
"clarity": {"weight": 0.3},
"completeness": {"weight": 0.3}
}
def evaluate_with_rubric(output: str, rubric: dict) -> float:
scores = json.loads(llm(f"Rate 1-5 for each dimension: {list(rubric.keys())}\nOutput: {output}"))
return sum(scores[d] * rubric[d]["weight"] for d in rubric) / 5
Best Practices
| Practice | Rationale |
|---|---|
| Clear criteria | Define specific, measurable evaluation criteria upfront |
| Iteration limits | Set max iterations (3-5) to prevent infinite loops |
| Convergence check | Stop if output score isn't improving between iterations |
| Log history | Keep full trajectory for debugging and analysis |
| Structured output | Use JSON for reliable parsing of evaluation results |
Quick Start Checklist
## Evaluation Implementation Checklist
### Setup
- [ ] Define evaluation criteria/rubric
- [ ] Set score threshold for "good enough"
- [ ] Configure max iterations (default: 3)
### Implementation
- [ ] Implement generate() function
- [ ] Implement evaluate() function with structured output
- [ ] Implement optimize() function
- [ ] Wire up the refinement loop
### Safety
- [ ] Add convergence detection
- [ ] Log all iterations for debugging
- [ ] Handle evaluation parse failures gracefully
Related skills
More from github/awesome-copilot and the wider catalog.
git-commit
Execute semantic git commits with conventional message analysis and intelligent staging.
excalidraw-diagram-generator
Generate Excalidraw diagrams from natural language descriptions.
documentation-writer
Create structured technical documentation using the Diátaxis framework for tutorials, how-to guides, references, and explanations.
gh-cli
GitHub CLI comprehensive reference for repositories, issues, PRs, Actions, projects, releases, and all GitHub operations from the command line.
prd
Generate comprehensive Product Requirements Documents with executive summaries, user stories, technical specs, and risk analysis.
refactor
Surgical code refactoring to improve maintainability without changing behavior.