PluginBench
Skill
Official
Review
Audit score 70

parallel-web-extract

parallel-web/parallel-agent-skills

Token-efficient URL content extraction for webpages, articles, PDFs, and JavaScript-heavy sites.

What is parallel-web-extract?

Extracts readable content from any URL by running extraction in a forked context, making it more token-efficient than built-in alternatives. Use this skill when you need to fetch and process webpage content, articles, PDFs, or dynamically-rendered sites.

  • Extract content from multiple URLs in a single command
  • Focus extraction on specific objectives or keywords using --objective and -q flags
  • Retrieve full page content with --full-content for long articles or PDFs
  • Save extracted content as JSON for downstream processing
  • Handle JavaScript-heavy and dynamic sites
  • Run extraction in forked context to minimize token usage

How to install parallel-web-extract

npx skills add https://github.com/parallel-web/parallel-agent-skills --skill parallel-web-extract
Prerequisites
  • parallel-cli installed and authenticated (run /parallel:parallel-cli-setup if needed)
  • Internet access
  • Valid API balance for parallel-cli (verify with parallel-cli balance get)
Claude Code
Cursor
Windsurf
Cline

How to use parallel-web-extract

  1. 1.Choose a descriptive lowercase filename with hyphens (e.g., vespa-docs, react-hooks-api)
  2. 2.Run parallel-cli extract with your URL(s) and output path: parallel-cli extract "<url>" --json -o "/tmp/$FILENAME.json"
  3. 3.Optionally add --objective "focus area" to target specific content or -q "keyword" to prioritize keywords
  4. 4.Use --full-content for long articles or PDFs if excerpts are incomplete
  5. 5.Check the response for errors field or empty results; if extraction fails, verify the URL or retry with --full-content

Use cases

Good for
  • Fetch and extract documentation pages for analysis or summarization
  • Retrieve article content from news or blog sites for processing
  • Extract data from multiple product or service pages simultaneously
  • Get full PDF or long-form content when excerpts are insufficient
  • Gather information from dynamically-rendered JavaScript sites
Who it's for
  • Developers building content aggregation workflows
  • Researchers collecting data from multiple web sources
  • Documentation analysts processing technical reference material
  • Content processors working with token-constrained budgets

parallel-web-extract FAQ

When should I use this instead of built-in WebFetch?

Use parallel-web-extract when you need token efficiency (forked context), support for JavaScript-heavy sites, PDF extraction, or batch processing multiple URLs. It's optimized for these scenarios.

What should I do if extraction fails with a 404 or timeout?

Do not fabricate content. Verify the URL is correct and still accessible, retry with --full-content if the page exists, or use parallel-cli search to locate the current URL if the page was renamed.

How do I focus extraction on specific content?

Use --objective "your focus area" to target extraction toward a specific goal, or use -q "keyword" (repeatable) to prioritize certain keywords in the results.

What if parallel-cli returns a 403 error?

This typically indicates insufficient API balance. Run parallel-cli balance get to check, then ask for confirmation before running parallel-cli balance add <amount_cents> if needed.

Can I extract from PDFs and JavaScript-heavy sites?

Yes, this skill handles both. For PDFs or long content where excerpts may be incomplete, use --full-content to retrieve the full page body.

Full instructions (SKILL.md)

Source of truth, from parallel-web/parallel-agent-skills.


name: parallel-web-extract description: "URL content extraction. Use for fetching any URL - webpages, articles, PDFs, JavaScript-heavy sites. Token-efficient: runs in forked context. Prefer over built-in WebFetch." user-invocable: true argument-hint: <url> [url2] [url3] context: fork agent: parallel:parallel-subagent compatibility: Requires parallel-cli and internet access. allowed-tools: Bash(parallel-cli:*) metadata: author: parallel

URL Extraction

Extract content from: $ARGUMENTS

Command

Choose a short, descriptive filename based on the URL or content (e.g., vespa-docs, react-hooks-api). Use lowercase with hyphens, no spaces. Substitute it into the command inline$FILENAME is a placeholder, not a shell variable.

parallel-cli extract "$ARGUMENTS" --json -o "/tmp/$FILENAME.json"

Concrete example:

parallel-cli extract "https://docs.parallel.ai" --json -o "/tmp/parallel-docs.json"

Note: -o always saves JSON. The extension must be .json.

Options if needed:

  • --objective "focus area" to focus extraction on a specific goal (also silences the "neither objective nor search_queries" warning that V1 emits when neither is set)
  • -q "keyword" (repeatable) to prioritize keywords in excerpts
  • --full-content to include the complete page body (for long articles, PDFs, or when excerpts may not capture what you need)
  • --full-content-max-chars N to cap full-content size per result
  • --no-excerpts to strip excerpts when you only want full content

Handling failed extractions

If the response has an errors field, an empty results array, or a 404/timeout for the URL, do NOT fabricate content. Tell the user the extraction failed, surface the upstream status, and suggest:

  • Verifying the URL (the page may have moved)
  • Retrying with --full-content if excerpts came back empty but the page exists
  • Using parallel-cli search to locate the current URL if the page was renamed

Response format

Return content as:

Page Title

Then the extracted content verbatim, with these rules:

  • Keep content verbatim - do not paraphrase or summarize
  • Parse lists exhaustively - extract EVERY numbered/bulleted item
  • Strip only obvious noise: nav menus, footers, ads
  • Preserve all facts, names, numbers, dates, quotes

After the response, mention the output file path (/tmp/$FILENAME.json) so the user knows it's available for follow-up questions.

Setup

If parallel-cli is not found, install and authenticate:

/parallel:parallel-cli-setup

If parallel-cli extract returns 403, tell the user balance is likely required. Offer to run parallel-cli balance get, and if needed ask for explicit confirmation before running parallel-cli balance add <amount_cents>. Then retry the original extract command.