llm-evaluation
wshobson/agents
A structured framework for evaluating LLM outputs with automated metrics, human review, and LLM-as-judge.
What is llm-evaluation?
This skill provides a framework and conceptual guide for evaluating LLM applications, covering automated metrics (BLEU, ROUGE, BERTScore, classification, retrieval metrics), human evaluation dimensions, and LLM-as-judge approaches. Use it when you need to systematically measure LLM quality, compare models/prompts, detect regressions, or build an evaluation pipeline.
- Defines categories of LLM evaluation: automated metrics, human evaluation, and LLM-as-judge
- Lists common text-generation, classification, and retrieval metrics
- Provides a Python EvaluationSuite/Metric scaffold to run metrics over test cases and aggregate scores
- Supports custom metric functions via Metric.custom()
- Points to references/details.md for deeper worked examples and patterns
How to install llm-evaluation
npx skills add https://github.com/wshobson/agents --skill llm-evaluation- Python environment with numpy
- A model object exposing an async predict method
- Implementations of metric calculation functions (e.g., calculate_accuracy, calculate_bleu, calculate_bertscore) or equivalents
- Test case dataset with inputs and expected outputs/context
How to use llm-evaluation
- 1.Identify the evaluation type needed for your use case: automated metrics, human evaluation, or LLM-as-judge.
- 2.Define test cases as a list of dicts with input, expected output, and optional context.
- 3.Select or define Metric objects (e.g., Metric.accuracy(), Metric.bleu(), Metric.bertscore(), or Metric.custom() for bespoke checks).
- 4.Instantiate an EvaluationSuite with the chosen metrics.
- 5.Run suite.evaluate(model, test_cases) against your model to get aggregated and raw scores.
- 6.Consult references/details.md for deeper patterns and worked examples when the quick-start approach is insufficient.
Use cases
- Comparing two prompts or model versions on a fixed test set
- Detecting performance regressions before deploying a new model/prompt
- Building a RAG evaluation pipeline with MRR/NDCG/Precision@K metrics
- Setting up an LLM-as-judge pipeline to score outputs at scale
- Establishing a baseline evaluation suite for an LLM application
- ML/AI engineers building evaluation pipelines for LLM applications
- Teams comparing models or prompt versions before deployment
- Engineers establishing regression testing for AI systems
- Researchers benchmarking RAG or classification systems
llm-evaluation FAQ
It documents the LLM-as-judge approach conceptually (pointwise, pairwise, reference-based, reference-free) but the quick start code focuses on automated metrics; further worked examples are in references/details.md.
The skill references metrics like accuracy, BLEU, BERTScore, ROUGE, METEOR, perplexity, precision/recall/F1, AUC-ROC, MRR, NDCG, and Precision@K/Recall@K, plus a custom metric hook, but does not implement all of them in the shown code.
No, the SKILL.md shows generic Python code (EvaluationSuite/Metric classes) that works with any model exposing a predict method; no provider-specific integration is described.
Yes, the Metric.custom(name, fn) pattern lets you plug in a custom scoring function such as a groundedness checker.
Full instructions (SKILL.md)
Source of truth, from wshobson/agents.
name: llm-evaluation description: Implement comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and benchmarking. Use when testing LLM performance, measuring AI application quality, or establishing evaluation frameworks.
LLM Evaluation
Master comprehensive evaluation strategies for LLM applications, from automated metrics to human evaluation and A/B testing.
When to Use This Skill
- Measuring LLM application performance systematically
- Comparing different models or prompts
- Detecting performance regressions before deployment
- Validating improvements from prompt changes
- Building confidence in production systems
- Establishing baselines and tracking progress over time
- Debugging unexpected model behavior
Core Evaluation Types
1. Automated Metrics
Fast, repeatable, scalable evaluation using computed scores.
Text Generation:
- BLEU: N-gram overlap (translation)
- ROUGE: Recall-oriented (summarization)
- METEOR: Semantic similarity
- BERTScore: Embedding-based similarity
- Perplexity: Language model confidence
Classification:
- Accuracy: Percentage correct
- Precision/Recall/F1: Class-specific performance
- Confusion Matrix: Error patterns
- AUC-ROC: Ranking quality
Retrieval (RAG):
- MRR: Mean Reciprocal Rank
- NDCG: Normalized Discounted Cumulative Gain
- Precision@K: Relevant in top K
- Recall@K: Coverage in top K
2. Human Evaluation
Manual assessment for quality aspects difficult to automate.
Dimensions:
- Accuracy: Factual correctness
- Coherence: Logical flow
- Relevance: Answers the question
- Fluency: Natural language quality
- Safety: No harmful content
- Helpfulness: Useful to the user
3. LLM-as-Judge
Use stronger LLMs to evaluate weaker model outputs.
Approaches:
- Pointwise: Score individual responses
- Pairwise: Compare two responses
- Reference-based: Compare to gold standard
- Reference-free: Judge without ground truth
Quick Start
from dataclasses import dataclass
from typing import Callable
import numpy as np
@dataclass
class Metric:
name: str
fn: Callable
@staticmethod
def accuracy():
return Metric("accuracy", calculate_accuracy)
@staticmethod
def bleu():
return Metric("bleu", calculate_bleu)
@staticmethod
def bertscore():
return Metric("bertscore", calculate_bertscore)
@staticmethod
def custom(name: str, fn: Callable):
return Metric(name, fn)
class EvaluationSuite:
def __init__(self, metrics: list[Metric]):
self.metrics = metrics
async def evaluate(self, model, test_cases: list[dict]) -> dict:
results = {m.name: [] for m in self.metrics}
for test in test_cases:
prediction = await model.predict(test["input"])
for metric in self.metrics:
score = metric.fn(
prediction=prediction,
reference=test.get("expected"),
context=test.get("context")
)
results[metric.name].append(score)
return {
"metrics": {k: np.mean(v) for k, v in results.items()},
"raw_scores": results
}
# Usage
suite = EvaluationSuite([
Metric.accuracy(),
Metric.bleu(),
Metric.bertscore(),
Metric.custom("groundedness", check_groundedness)
])
test_cases = [
{
"input": "What is the capital of France?",
"expected": "Paris",
"context": "France is a country in Europe. Paris is its capital."
},
]
results = await suite.evaluate(model=your_model, test_cases=test_cases)
Detailed patterns and worked examples
Detailed pattern documentation lives in references/details.md. Read that file when the navigation tier above is insufficient.
Related skills
More from wshobson/agents and the wider catalog.
tailwind-design-system
Build production-ready design systems with Tailwind CSS v4, design tokens, and component libraries.
typescript-advanced-types
Master TypeScript's advanced type system: generics, conditional types, mapped types, and utility types for type-safe applications.
nodejs-backend-patterns
Build production-ready Node.js backends with Express/Fastify, middleware patterns, auth, and database integration.
python-performance-optimization
Profile and optimize Python code using cProfile, memory profilers, and performance best practices.
brand-landingpage
Brand-first landing page designer with guided interviews and Stitch-powered iteration.
python-testing-patterns
Implement comprehensive testing strategies with pytest, fixtures, mocking, and test-driven development.