Skill

Pass

Audit score 90

llm-evaluation

wshobson/agents

A structured framework for evaluating LLM outputs with automated metrics, human review, and LLM-as-judge.

What is llm-evaluation?

This skill provides a framework and conceptual guide for evaluating LLM applications, covering automated metrics (BLEU, ROUGE, BERTScore, classification, retrieval metrics), human evaluation dimensions, and LLM-as-judge approaches. Use it when you need to systematically measure LLM quality, compare models/prompts, detect regressions, or build an evaluation pipeline.

Defines categories of LLM evaluation: automated metrics, human evaluation, and LLM-as-judge
Lists common text-generation, classification, and retrieval metrics
Provides a Python EvaluationSuite/Metric scaffold to run metrics over test cases and aggregate scores
Supports custom metric functions via Metric.custom()
Points to references/details.md for deeper worked examples and patterns

How to install llm-evaluation

npx skills add https://github.com/wshobson/agents --skill llm-evaluation

Prerequisites

Python environment with numpy
A model object exposing an async predict method
Implementations of metric calculation functions (e.g., calculate_accuracy, calculate_bleu, calculate_bertscore) or equivalents
Test case dataset with inputs and expected outputs/context

Claude Code

Cursor

Windsurf

Cline

How to use llm-evaluation

1.Identify the evaluation type needed for your use case: automated metrics, human evaluation, or LLM-as-judge.
2.Define test cases as a list of dicts with input, expected output, and optional context.
3.Select or define Metric objects (e.g., Metric.accuracy(), Metric.bleu(), Metric.bertscore(), or Metric.custom() for bespoke checks).
4.Instantiate an EvaluationSuite with the chosen metrics.
5.Run suite.evaluate(model, test_cases) against your model to get aggregated and raw scores.
6.Consult references/details.md for deeper patterns and worked examples when the quick-start approach is insufficient.

Use cases

Good for

Comparing two prompts or model versions on a fixed test set
Detecting performance regressions before deploying a new model/prompt
Building a RAG evaluation pipeline with MRR/NDCG/Precision@K metrics
Setting up an LLM-as-judge pipeline to score outputs at scale
Establishing a baseline evaluation suite for an LLM application

Who it's for

ML/AI engineers building evaluation pipelines for LLM applications
Teams comparing models or prompt versions before deployment
Engineers establishing regression testing for AI systems
Researchers benchmarking RAG or classification systems

llm-evaluation FAQ

Does this skill provide an LLM-as-judge implementation?

It documents the LLM-as-judge approach conceptually (pointwise, pairwise, reference-based, reference-free) but the quick start code focuses on automated metrics; further worked examples are in references/details.md.

What metrics are included out of the box?

The skill references metrics like accuracy, BLEU, BERTScore, ROUGE, METEOR, perplexity, precision/recall/F1, AUC-ROC, MRR, NDCG, and Precision@K/Recall@K, plus a custom metric hook, but does not implement all of them in the shown code.

Is this tied to a specific LLM provider or framework?

No, the SKILL.md shows generic Python code (EvaluationSuite/Metric classes) that works with any model exposing a predict method; no provider-specific integration is described.

Can I add my own evaluation metric?

Yes, the Metric.custom(name, fn) pattern lets you plug in a custom scoring function such as a groundedness checker.

Full instructions (SKILL.md)

Source of truth, from wshobson/agents.

name: llm-evaluation description: Implement comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and benchmarking. Use when testing LLM performance, measuring AI application quality, or establishing evaluation frameworks.

LLM Evaluation

Master comprehensive evaluation strategies for LLM applications, from automated metrics to human evaluation and A/B testing.

When to Use This Skill

Measuring LLM application performance systematically
Comparing different models or prompts
Detecting performance regressions before deployment
Validating improvements from prompt changes
Building confidence in production systems
Establishing baselines and tracking progress over time
Debugging unexpected model behavior

Core Evaluation Types

1. Automated Metrics

Fast, repeatable, scalable evaluation using computed scores.

Text Generation:

BLEU: N-gram overlap (translation)
ROUGE: Recall-oriented (summarization)
METEOR: Semantic similarity
BERTScore: Embedding-based similarity
Perplexity: Language model confidence

Classification:

Accuracy: Percentage correct
Precision/Recall/F1: Class-specific performance
Confusion Matrix: Error patterns
AUC-ROC: Ranking quality

Retrieval (RAG):

MRR: Mean Reciprocal Rank
NDCG: Normalized Discounted Cumulative Gain
Precision@K: Relevant in top K
Recall@K: Coverage in top K

2. Human Evaluation

Manual assessment for quality aspects difficult to automate.

Dimensions:

Accuracy: Factual correctness
Coherence: Logical flow
Relevance: Answers the question
Fluency: Natural language quality
Safety: No harmful content
Helpfulness: Useful to the user

3. LLM-as-Judge

Use stronger LLMs to evaluate weaker model outputs.

Approaches:

Pointwise: Score individual responses
Pairwise: Compare two responses
Reference-based: Compare to gold standard
Reference-free: Judge without ground truth

Quick Start

from dataclasses import dataclass
from typing import Callable
import numpy as np

@dataclass
class Metric:
    name: str
    fn: Callable

    @staticmethod
    def accuracy():
        return Metric("accuracy", calculate_accuracy)

    @staticmethod
    def bleu():
        return Metric("bleu", calculate_bleu)

    @staticmethod
    def bertscore():
        return Metric("bertscore", calculate_bertscore)

    @staticmethod
    def custom(name: str, fn: Callable):
        return Metric(name, fn)

class EvaluationSuite:
    def __init__(self, metrics: list[Metric]):
        self.metrics = metrics

    async def evaluate(self, model, test_cases: list[dict]) -> dict:
        results = {m.name: [] for m in self.metrics}

        for test in test_cases:
            prediction = await model.predict(test["input"])

            for metric in self.metrics:
                score = metric.fn(
                    prediction=prediction,
                    reference=test.get("expected"),
                    context=test.get("context")
                )
                results[metric.name].append(score)

        return {
            "metrics": {k: np.mean(v) for k, v in results.items()},
            "raw_scores": results
        }

# Usage
suite = EvaluationSuite([
    Metric.accuracy(),
    Metric.bleu(),
    Metric.bertscore(),
    Metric.custom("groundedness", check_groundedness)
])

test_cases = [
    {
        "input": "What is the capital of France?",
        "expected": "Paris",
        "context": "France is a country in Europe. Paris is its capital."
    },
]

results = await suite.evaluate(model=your_model, test_cases=test_cases)

Detailed patterns and worked examples

Detailed pattern documentation lives in references/details.md. Read that file when the navigation tier above is insufficient.

Related skills

More from wshobson/agents and the wider catalog.

tailwind-design-system

wshobson/agents

Build production-ready design systems with Tailwind CSS v4, design tokens, and component libraries.

52k installsAudited

typescript-advanced-types

wshobson/agents

Master TypeScript's advanced type system: generics, conditional types, mapped types, and utility types for type-safe applications.

51k installsAudited

nodejs-backend-patterns

wshobson/agents

Build production-ready Node.js backends with Express/Fastify, middleware patterns, auth, and database integration.

38k installsAudited

python-performance-optimization

wshobson/agents

Profile and optimize Python code using cProfile, memory profilers, and performance best practices.

28k installsAudited

brand-landingpage

wshobson/agents

Brand-first landing page designer with guided interviews and Stitch-powered iteration.

26k installsAudited

python-testing-patterns

wshobson/agents

Implement comprehensive testing strategies with pytest, fixtures, mocking, and test-driven development.

26k installsAudited