PluginBench
Skill
Pass
Audit score 90

llm-evaluation

wshobson/agents

A structured framework for evaluating LLM outputs with automated metrics, human review, and LLM-as-judge.

What is llm-evaluation?

This skill provides a framework and conceptual guide for evaluating LLM applications, covering automated metrics (BLEU, ROUGE, BERTScore, classification, retrieval metrics), human evaluation dimensions, and LLM-as-judge approaches. Use it when you need to systematically measure LLM quality, compare models/prompts, detect regressions, or build an evaluation pipeline.

  • Defines categories of LLM evaluation: automated metrics, human evaluation, and LLM-as-judge
  • Lists common text-generation, classification, and retrieval metrics
  • Provides a Python EvaluationSuite/Metric scaffold to run metrics over test cases and aggregate scores
  • Supports custom metric functions via Metric.custom()
  • Points to references/details.md for deeper worked examples and patterns

How to install llm-evaluation

npx skills add https://github.com/wshobson/agents --skill llm-evaluation
Prerequisites
  • Python environment with numpy
  • A model object exposing an async predict method
  • Implementations of metric calculation functions (e.g., calculate_accuracy, calculate_bleu, calculate_bertscore) or equivalents
  • Test case dataset with inputs and expected outputs/context
Claude Code
Cursor
Windsurf
Cline

How to use llm-evaluation

  1. 1.Identify the evaluation type needed for your use case: automated metrics, human evaluation, or LLM-as-judge.
  2. 2.Define test cases as a list of dicts with input, expected output, and optional context.
  3. 3.Select or define Metric objects (e.g., Metric.accuracy(), Metric.bleu(), Metric.bertscore(), or Metric.custom() for bespoke checks).
  4. 4.Instantiate an EvaluationSuite with the chosen metrics.
  5. 5.Run suite.evaluate(model, test_cases) against your model to get aggregated and raw scores.
  6. 6.Consult references/details.md for deeper patterns and worked examples when the quick-start approach is insufficient.

Use cases

Good for
  • Comparing two prompts or model versions on a fixed test set
  • Detecting performance regressions before deploying a new model/prompt
  • Building a RAG evaluation pipeline with MRR/NDCG/Precision@K metrics
  • Setting up an LLM-as-judge pipeline to score outputs at scale
  • Establishing a baseline evaluation suite for an LLM application
Who it's for
  • ML/AI engineers building evaluation pipelines for LLM applications
  • Teams comparing models or prompt versions before deployment
  • Engineers establishing regression testing for AI systems
  • Researchers benchmarking RAG or classification systems

llm-evaluation FAQ

Does this skill provide an LLM-as-judge implementation?

It documents the LLM-as-judge approach conceptually (pointwise, pairwise, reference-based, reference-free) but the quick start code focuses on automated metrics; further worked examples are in references/details.md.

What metrics are included out of the box?

The skill references metrics like accuracy, BLEU, BERTScore, ROUGE, METEOR, perplexity, precision/recall/F1, AUC-ROC, MRR, NDCG, and Precision@K/Recall@K, plus a custom metric hook, but does not implement all of them in the shown code.

Is this tied to a specific LLM provider or framework?

No, the SKILL.md shows generic Python code (EvaluationSuite/Metric classes) that works with any model exposing a predict method; no provider-specific integration is described.

Can I add my own evaluation metric?

Yes, the Metric.custom(name, fn) pattern lets you plug in a custom scoring function such as a groundedness checker.

Full instructions (SKILL.md)

Source of truth, from wshobson/agents.


name: llm-evaluation description: Implement comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and benchmarking. Use when testing LLM performance, measuring AI application quality, or establishing evaluation frameworks.

LLM Evaluation

Master comprehensive evaluation strategies for LLM applications, from automated metrics to human evaluation and A/B testing.

When to Use This Skill

  • Measuring LLM application performance systematically
  • Comparing different models or prompts
  • Detecting performance regressions before deployment
  • Validating improvements from prompt changes
  • Building confidence in production systems
  • Establishing baselines and tracking progress over time
  • Debugging unexpected model behavior

Core Evaluation Types

1. Automated Metrics

Fast, repeatable, scalable evaluation using computed scores.

Text Generation:

  • BLEU: N-gram overlap (translation)
  • ROUGE: Recall-oriented (summarization)
  • METEOR: Semantic similarity
  • BERTScore: Embedding-based similarity
  • Perplexity: Language model confidence

Classification:

  • Accuracy: Percentage correct
  • Precision/Recall/F1: Class-specific performance
  • Confusion Matrix: Error patterns
  • AUC-ROC: Ranking quality

Retrieval (RAG):

  • MRR: Mean Reciprocal Rank
  • NDCG: Normalized Discounted Cumulative Gain
  • Precision@K: Relevant in top K
  • Recall@K: Coverage in top K

2. Human Evaluation

Manual assessment for quality aspects difficult to automate.

Dimensions:

  • Accuracy: Factual correctness
  • Coherence: Logical flow
  • Relevance: Answers the question
  • Fluency: Natural language quality
  • Safety: No harmful content
  • Helpfulness: Useful to the user

3. LLM-as-Judge

Use stronger LLMs to evaluate weaker model outputs.

Approaches:

  • Pointwise: Score individual responses
  • Pairwise: Compare two responses
  • Reference-based: Compare to gold standard
  • Reference-free: Judge without ground truth

Quick Start

from dataclasses import dataclass
from typing import Callable
import numpy as np

@dataclass
class Metric:
    name: str
    fn: Callable

    @staticmethod
    def accuracy():
        return Metric("accuracy", calculate_accuracy)

    @staticmethod
    def bleu():
        return Metric("bleu", calculate_bleu)

    @staticmethod
    def bertscore():
        return Metric("bertscore", calculate_bertscore)

    @staticmethod
    def custom(name: str, fn: Callable):
        return Metric(name, fn)

class EvaluationSuite:
    def __init__(self, metrics: list[Metric]):
        self.metrics = metrics

    async def evaluate(self, model, test_cases: list[dict]) -> dict:
        results = {m.name: [] for m in self.metrics}

        for test in test_cases:
            prediction = await model.predict(test["input"])

            for metric in self.metrics:
                score = metric.fn(
                    prediction=prediction,
                    reference=test.get("expected"),
                    context=test.get("context")
                )
                results[metric.name].append(score)

        return {
            "metrics": {k: np.mean(v) for k, v in results.items()},
            "raw_scores": results
        }

# Usage
suite = EvaluationSuite([
    Metric.accuracy(),
    Metric.bleu(),
    Metric.bertscore(),
    Metric.custom("groundedness", check_groundedness)
])

test_cases = [
    {
        "input": "What is the capital of France?",
        "expected": "Paris",
        "context": "France is a country in Europe. Paris is its capital."
    },
]

results = await suite.evaluate(model=your_model, test_cases=test_cases)

Detailed patterns and worked examples

Detailed pattern documentation lives in references/details.md. Read that file when the navigation tier above is insufficient.