PluginBench
Skill
Review
Audit score 70

minimal-run-and-audit

lllllllama/ai-paper-reproduction-skill

Execute and audit deep learning repo smoke tests with standardized evidence capture and scientific changelog.

What is minimal-run-and-audit?

Minimal-run-and-audit is a rigor-focused skill for capturing execution evidence from documented inference, evaluation, or smoke test commands in deep learning repositories. Use it after a reproduction target and setup plan exist, when you need normalized outputs and audit trails without training orchestration or broad repo analysis.

  • Execute selected smoke tests, inference runs, or evaluation commands with full evidence capture
  • Generate standardized `repro_outputs/` files with execution results and metadata
  • Create `SCIENTIFIC_CHANGELOG.md` documenting any changes that alter evaluation, preprocessing, or metrics
  • Produce `COMPARABILITY_REPORT.md` comparing results against README and paper baselines
  • Write `PATCHES.md` when repository files are modified during execution
  • Distinguish between verified, partial, and blocked execution states

How to install minimal-run-and-audit

npx skills add https://github.com/lllllllama/ai-paper-reproduction-skill --skill minimal-run-and-audit
Prerequisites
  • Selected reproduction target and runnable command already identified
  • Environment and assets defined and available
  • Access to `references/reporting-policy.md` and `research-rigor-principles.md`
  • Python environment with `scripts/run_command.py` and `scripts/write_outputs.py` available
Claude Code
Cursor
Windsurf
Cline

How to use minimal-run-and-audit

  1. 1.Identify the specific smoke test, inference, or evaluation command to execute
  2. 2.Provide environment assumptions and any required asset paths
  3. 3.Execute the command via the skill's run handler, capturing all stdout/stderr
  4. 4.Normalize outputs into `repro_outputs/` directory structure
  5. 5.Document any repository file changes in `PATCHES.md` with scientific impact notes
  6. 6.Generate `SCIENTIFIC_CHANGELOG.md` if execution altered evaluation, preprocessing, or metrics
  7. 7.Create `COMPARABILITY_REPORT.md` mapping results against README and paper baselines
  8. 8.Return execution summary with verified/partial/blocked status

Use cases

Good for
  • Running a documented inference command on a pre-trained checkpoint and capturing outputs for reproducibility verification
  • Executing a smoke test suite to validate environment setup without full training
  • Auditing evaluation metrics from a repository's standard benchmark command and comparing against paper claims
  • Capturing execution evidence when repository code has been patched, with clear scientific impact notes
  • Normalizing outputs from multiple short verification runs into a standardized report structure
Who it's for
  • ML researchers verifying paper reproduction fidelity
  • Reproducibility engineers auditing deep learning repositories
  • AI agents executing targeted verification commands with evidence trails
  • Teams needing standardized, auditable execution reports for scientific rigor

minimal-run-and-audit FAQ

Should I use this skill for training execution?

No. This skill is for smoke tests, inference runs, and evaluation commands only. Do not use for training startup, resume, or long-running training state management.

What happens if the repository code needs to be modified to run?

Document all changes in `PATCHES.md` and clearly flag any modifications that alter scientific meaning (evaluation, preprocessing, metrics) in `SCIENTIFIC_CHANGELOG.md`. Do not hide or normalize risky code edits.

Can this skill choose which reproduction target to run?

No. The reproduction target and runnable command must be selected beforehand. This skill only executes and audits the chosen command.

What should go in `COMPARABILITY_REPORT.md`?

Document how execution results compare against README documentation and paper baselines, noting any discrepancies or partial verification states.

When should I use this versus the main reproduction skill?

Use this skill when you have a specific, documented command ready to execute and need evidence capture and reporting. Use the main skill for broader target selection, setup planning, and orchestration.

Full instructions (SKILL.md)

Source of truth, from lllllllama/ai-paper-reproduction-skill.


name: minimal-run-and-audit description: Rigor Run skill for README-first deep learning repo reproduction. Use when the task is specifically to capture or normalize evidence from the selected smoke test or documented inference or evaluation command and write standardized repro_outputs/ files, including patch notes when repository files changed. Do not use for training execution, initial repo intake, generic environment setup, paper lookup, target selection, hidden scientific-meaning changes, or end-to-end orchestration by itself.

minimal-run-and-audit

Use this as the Rigor Run skill. The installed slug remains minimal-run-and-audit for compatibility.

Use the shared operating principles in ../../references/agent-operating-principles.md; this skill should make run evidence auditable without turning every command into a rigid protocol.

When to apply

  • After a reproduction target and setup plan exist.
  • When the main skill needs execution evidence and normalized outputs.
  • When a smoke test, documented inference run, documented evaluation run, or other short non-training verification is appropriate.
  • When the user already knows what command should be attempted and wants execution plus reporting only.

When not to apply

  • During initial repo scanning.
  • When environment or assets are still undefined enough to make execution meaningless.
  • When the task is a literature lookup rather than repository execution.
  • When the user is still deciding which reproduction target should count as the main run.

Clear boundaries

  • This skill owns normalized reporting for an attempted command.
  • It may receive execution evidence from the main skill or a thin helper.
  • It does not choose the overall target on its own.
  • It does not perform broad paper analysis.
  • It does not own training startup, resume, or long-running training state.
  • It should not normalize risky code edits into acceptable practice.
  • It must not hide changes that alter evaluation, preprocessing, checkpoints, metrics, or other scientific meaning.

Input expectations

  • selected reproduction goal
  • runnable commands or smoke commands
  • environment and asset assumptions
  • optional patch metadata

Output expectations

  • execution result summary
  • standardized repro_outputs/ files
  • SCIENTIFIC_CHANGELOG.md for changed scientific meaning and evidence status
  • COMPARABILITY_REPORT.md for README/paper/baseline comparability
  • clear distinction between verified, partial, and blocked states
  • PATCHES.md when repo files changed

Notes

Use references/reporting-policy.md, ../../references/research-rigor-principles.md, scripts/run_command.py, and scripts/write_outputs.py.