PluginBench
Skill
Review
Audit score 70

ai-paper-reproduction

lllllllama/ai-paper-reproduction-skill

End-to-end README-first reproduction of AI paper repositories with auditable outputs and conservative patch rules.

What is ai-paper-reproduction?

Orchestrates minimal trustworthy reproduction of AI paper code repositories by reading the README first, selecting the smallest documented target (inference preferred over evaluation over training), coordinating setup and execution, and generating a standardized repro_outputs/ bundle with evidence and assumptions. Use this when you need a complete, auditable reproduction workflow that respects research safety and avoids silent protocol changes.

  • Reads README and repository signals to extract documented reproduction intent
  • Selects the smallest trustworthy reproduction target (documented inference preferred)
  • Coordinates environment setup, asset preparation, and conservative code patching
  • Executes trusted runs with smoke checks and captures process evidence
  • Records assumptions, deviations, and human decision points for auditability
  • Generates standardized repro_outputs/ bundle (SUMMARY.md, COMMANDS.md, LOG.md, status.json, PATCHES.md)

How to install ai-paper-reproduction

npx skills add https://github.com/lllllllama/ai-paper-reproduction-skill --skill ai-paper-reproduction
Prerequisites
  • A code repository with a README and documented commands or scripts
  • Git access to the repository
  • Environment tools (Python, conda, or equivalent) as specified in the README
Claude Code
Cursor
Windsurf
Cline

How to use ai-paper-reproduction

  1. 1.Provide the repository URL or local path and clarify your reproduction goal (inference, evaluation, or training)
  2. 2.The skill reads the README and scans the repository structure to extract documented commands
  3. 3.Review the proposed minimum trustworthy target and confirm or adjust it
  4. 4.The skill prepares the environment, applies any necessary conservative patches, and executes the target
  5. 5.Review the generated repro_outputs/ bundle (SUMMARY.md, LOG.md, status.json) for evidence and assumptions
  6. 6.Use COMMANDS.md to re-run or share the reproduction steps

Use cases

Good for
  • Reproduce a published AI model's inference or evaluation from its official repository
  • Verify a paper's documented training startup without running full training
  • Audit a repository for reproducibility gaps and generate an evidence trail
  • Coordinate multi-stage reproduction (intake → setup → execution → reporting) in one workflow
  • Create auditable snapshots of reproduction attempts with explicit assumptions and blockers
Who it's for
  • Researchers verifying published AI results
  • ML engineers reproducing baseline models
  • Auditors checking repository reproducibility
  • Teams needing standardized, auditable reproduction records
  • Agents coordinating multi-stage repository workflows

ai-paper-reproduction FAQ

When should I use this skill vs. a sub-skill like repo-intake-and-plan or minimal-run-and-audit?

Use this skill for end-to-end reproduction workflows that span multiple stages (intake + setup + execution + reporting). Use sub-skills only when the user explicitly asks for a single narrow phase that a sub-skill already covers cleanly.

What if the repository requires code changes to run?

The skill applies conservative patches only: command-line arguments, environment variables, path fixes, and dependency version fixes first. High-risk changes (model architecture, loss functions, core logic) are escalated for explicit human review. All patches are recorded in PATCHES.md with justification.

Does this skill run full training by default?

No. The skill prefers documented inference, then evaluation, then training startup verification. Full training is only run when you explicitly ask for it later. Training startup is captured and paused for human confirmation before broader training continues.

What does the repro_outputs/ bundle contain?

SUMMARY.md (high-value summary), COMMANDS.md (copyable commands), LOG.md (process evidence and assumptions), status.json (machine-readable state), and PATCHES.md (only if patches were applied). All outputs follow stable templates for auditability.

What if the README leaves a critical gap for reproduction?

The skill uses paper-context-resolver only for narrow reproduction-critical gaps that block the current target. It does not perform broad research assistance or paper summarization outside repository-grounded reproduction.

Full instructions (SKILL.md)

Source of truth, from lllllllama/ai-paper-reproduction-skill.


name: ai-paper-reproduction description: Main orchestrator for README-first AI repo reproduction. Use when the user wants an end-to-end, minimal-trustworthy reproduction flow that reads the repository first, selects the smallest documented inference or evaluation target, coordinates intake, setup, trusted execution, optional trusted training, optional repository analysis, and optional paper-gap resolution, enforces conservative patch rules, records evidence assumptions deviations and human decision points, and writes the standardized repro_outputs/ bundle. Do not use for paper summary, generic environment setup, isolated repo scanning, standalone command execution, silent protocol changes, or broad research assistance outside repository-grounded reproduction.

ai-paper-reproduction

Use when

  • The user wants the agent to reproduce an AI paper repository.
  • The target is a code repository with a README, scripts, configs, or documented commands.
  • The goal is a minimal trustworthy run, not unlimited experimentation.
  • The user needs standardized outputs that another human or model can audit quickly.
  • The task spans more than one stage, such as intake plus setup, or setup plus execution plus reporting.

Do not use when

  • The task is a general literature review or paper summary.
  • The task is to design a new model, benchmark suite, or training pipeline from scratch.
  • The repository is not centered on AI or does not expose a documented reproduction path.
  • The user primarily wants a deep code refactor rather than README-first reproduction.
  • The user is explicitly asking for only one narrow phase that a sub-skill already covers cleanly.
  • The user is explicitly authorizing exploratory branch-only experimentation instead of trusted reproduction.

Success criteria

  • README is treated as the primary source of reproduction intent.
  • A minimum trustworthy target is selected and justified.
  • Documented inference is preferred over evaluation, and evaluation is preferred over training.
  • Any repo edits remain conservative, explicit, and auditable.
  • Assumptions, protocol deviations, and human decision points are surfaced rather than hidden.
  • repro_outputs/ is generated with consistent structure and stable machine-readable fields.
  • Final user-facing explanation is short and follows the user's language when practical.

Interaction and usability policy

  • Keep the workflow simple enough for a new user to understand quickly.
  • Prefer short, concrete plans over exhaustive research.
  • Expose commands, assumptions, blockers, and evidence.
  • Avoid turning the skill into an opaque automation layer.
  • Preserve a low learning cost for both humans and downstream agents.

Language policy

  • Human-readable Markdown outputs should follow the user's language when it is clear.
  • If the user's language is unclear, default to concise English.
  • Machine-readable fields, filenames, keys, and enum values stay in stable English.
  • Paths, package names, CLI commands, config keys, and code identifiers remain unchanged.

See references/language-policy.md.

Reproduction policy

Core priority order:

  1. documented inference
  2. documented evaluation
  3. documented training startup or partial verification
  4. full training only when the user explicitly asks later

Rules:

  • README-first: use repository files to clarify, not casually override, the README.
  • Aim for minimal trustworthy reproduction rather than maximum task coverage.
  • Treat smoke tests, startup verification, and early-step checks as valid training evidence when full training is not appropriate.
  • In trusted reproduction, a documented training command should first be checked through startup verification or a short monitoring window, then paused for explicit human confirmation before broader training continues.
  • In explicitly authorized explore-lane execution, the training record can continue without the trusted-lane confirmation pause, but it must stay isolated from trusted conclusions.
  • Record unresolved gaps rather than fabricating confidence.

Patch policy

  • Prefer no code changes.
  • Prefer safer adjustments first:
    • command-line arguments
    • environment variables
    • path fixes
    • dependency version fixes
    • dependency file fixes such as requirements.txt or environment.yml
  • Avoid changing:
    • model architecture
    • core inference semantics
    • core training logic
    • loss functions
    • experiment meaning
  • If repository files must change:
    • create a patch branch first using repro/YYYY-MM-DD-short-task
    • apply low-risk changes before medium-risk changes
    • avoid high-risk changes by default
    • commit only verified groups of changes
    • keep verified patch commits sparse, usually 0-2
    • use commit messages in the form repro: <scope> for documented <command>

See references/patch-policy.md.

Research safety boundary

  • Preserve experiment meaning over convenience.
  • Do not silently change dataset, split, checkpoint, preprocessing, metric, loss, or model semantics.
  • Distinguish direct evidence from inference and from user-approved decisions.
  • Prefer a recorded blocker over an unrecorded workaround.
  • Escalate for explicit human review before any change that could alter scientific meaning or reported conclusions.

See references/research-safety-principles.md.

Workflow

  1. Read README and repo signals.
  2. Call repo-intake-and-plan to scan the repository and extract documented commands.
  3. Select the smallest trustworthy reproduction target.
  4. Call env-and-assets-bootstrap to prepare environment assumptions and asset paths.
  5. Call analyze-project only when repo structure, insertion points, or suspicious implementation patterns need a read-only pass before continuing.
  6. Run a conservative smoke check or documented inference or evaluation command with minimal-run-and-audit.
  7. If the selected trustworthy target is documented training startup, short-run verification, or resume, hand execution to run-train instead of minimal-run-and-audit.
  8. When training is selected inside trusted reproduction, let run-train capture the startup evidence first, then surface a human review checkpoint before any fuller training claim.
  9. Stop for human review if protocol meaning, model semantics, or result interpretation would otherwise be changed implicitly.
  10. Use paper-context-resolver only if README and repo files leave a narrow reproduction-critical gap that blocks the current target.
  11. Never auto-route into explore-code or explore-run; exploration requires explicit user authorization.
  12. Write the standardized outputs with evidence, assumptions, deviations, and next safe action.
  13. Give the user a short final note in the user's language.

Required outputs

Always target:

repro_outputs/
  SUMMARY.md
  COMMANDS.md
  LOG.md
  status.json
  PATCHES.md   # only if patches were applied

Use the templates under assets/ and the field rules in references/output-spec.md.

Reporting policy

  • Put the shortest high-value summary in SUMMARY.md.
  • Put copyable commands in COMMANDS.md.
  • Put process evidence, assumptions, failures, and decisions in LOG.md.
  • Put durable machine-readable state in status.json.
  • Put branch, commit, validation, and README-fidelity impact in PATCHES.md when needed.
  • Distinguish verified facts from inferred guesses.

Maintainability notes

  • Keep this skill narrow: README-first AI repo reproduction only.
  • Push specialized logic into sub-skills or helper scripts.
  • Prefer stable templates and simple schemas over ad hoc prose.
  • Keep machine-readable outputs backward compatible when possible.
  • Add new evidence sources only when they improve auditability without raising learning cost.
  • Treat repo-intake-and-plan and paper-context-resolver as narrow helpers, not primary public entrypoints.