parallel-web-extract
parallel-web/parallel-agent-skills
Token-efficient URL content extraction for webpages, articles, PDFs, and JavaScript-heavy sites.
What is parallel-web-extract?
Extracts readable content from any URL by running extraction in a forked context, making it more token-efficient than built-in alternatives. Use this skill when you need to fetch and process webpage content, articles, PDFs, or dynamically-rendered sites.
- Extract content from multiple URLs in a single command
- Focus extraction on specific objectives or keywords using --objective and -q flags
- Retrieve full page content with --full-content for long articles or PDFs
- Save extracted content as JSON for downstream processing
- Handle JavaScript-heavy and dynamic sites
- Run extraction in forked context to minimize token usage
How to install parallel-web-extract
npx skills add https://github.com/parallel-web/parallel-agent-skills --skill parallel-web-extract- parallel-cli installed and authenticated (run /parallel:parallel-cli-setup if needed)
- Internet access
- Valid API balance for parallel-cli (verify with parallel-cli balance get)
How to use parallel-web-extract
- 1.Choose a descriptive lowercase filename with hyphens (e.g., vespa-docs, react-hooks-api)
- 2.Run parallel-cli extract with your URL(s) and output path: parallel-cli extract "<url>" --json -o "/tmp/$FILENAME.json"
- 3.Optionally add --objective "focus area" to target specific content or -q "keyword" to prioritize keywords
- 4.Use --full-content for long articles or PDFs if excerpts are incomplete
- 5.Check the response for errors field or empty results; if extraction fails, verify the URL or retry with --full-content
Use cases
- Fetch and extract documentation pages for analysis or summarization
- Retrieve article content from news or blog sites for processing
- Extract data from multiple product or service pages simultaneously
- Get full PDF or long-form content when excerpts are insufficient
- Gather information from dynamically-rendered JavaScript sites
- Developers building content aggregation workflows
- Researchers collecting data from multiple web sources
- Documentation analysts processing technical reference material
- Content processors working with token-constrained budgets
parallel-web-extract FAQ
Use parallel-web-extract when you need token efficiency (forked context), support for JavaScript-heavy sites, PDF extraction, or batch processing multiple URLs. It's optimized for these scenarios.
Do not fabricate content. Verify the URL is correct and still accessible, retry with --full-content if the page exists, or use parallel-cli search to locate the current URL if the page was renamed.
Use --objective "your focus area" to target extraction toward a specific goal, or use -q "keyword" (repeatable) to prioritize certain keywords in the results.
This typically indicates insufficient API balance. Run parallel-cli balance get to check, then ask for confirmation before running parallel-cli balance add <amount_cents> if needed.
Yes, this skill handles both. For PDFs or long content where excerpts may be incomplete, use --full-content to retrieve the full page body.
Full instructions (SKILL.md)
Source of truth, from parallel-web/parallel-agent-skills.
name: parallel-web-extract description: "URL content extraction. Use for fetching any URL - webpages, articles, PDFs, JavaScript-heavy sites. Token-efficient: runs in forked context. Prefer over built-in WebFetch." user-invocable: true argument-hint: <url> [url2] [url3] context: fork agent: parallel:parallel-subagent compatibility: Requires parallel-cli and internet access. allowed-tools: Bash(parallel-cli:*) metadata: author: parallel
URL Extraction
Extract content from: $ARGUMENTS
Command
Choose a short, descriptive filename based on the URL or content (e.g., vespa-docs, react-hooks-api). Use lowercase with hyphens, no spaces. Substitute it into the command inline — $FILENAME is a placeholder, not a shell variable.
parallel-cli extract "$ARGUMENTS" --json -o "/tmp/$FILENAME.json"
Concrete example:
parallel-cli extract "https://docs.parallel.ai" --json -o "/tmp/parallel-docs.json"
Note: -o always saves JSON. The extension must be .json.
Options if needed:
--objective "focus area"to focus extraction on a specific goal (also silences the "neither objective nor search_queries" warning that V1 emits when neither is set)-q "keyword"(repeatable) to prioritize keywords in excerpts--full-contentto include the complete page body (for long articles, PDFs, or when excerpts may not capture what you need)--full-content-max-chars Nto cap full-content size per result--no-excerptsto strip excerpts when you only want full content
Handling failed extractions
If the response has an errors field, an empty results array, or a 404/timeout for the URL, do NOT fabricate content. Tell the user the extraction failed, surface the upstream status, and suggest:
- Verifying the URL (the page may have moved)
- Retrying with
--full-contentif excerpts came back empty but the page exists - Using
parallel-cli searchto locate the current URL if the page was renamed
Response format
Return content as:
Then the extracted content verbatim, with these rules:
- Keep content verbatim - do not paraphrase or summarize
- Parse lists exhaustively - extract EVERY numbered/bulleted item
- Strip only obvious noise: nav menus, footers, ads
- Preserve all facts, names, numbers, dates, quotes
After the response, mention the output file path (/tmp/$FILENAME.json) so the user knows it's available for follow-up questions.
Setup
If parallel-cli is not found, install and authenticate:
/parallel:parallel-cli-setup
If parallel-cli extract returns 403, tell the user balance is likely required. Offer to run parallel-cli balance get, and if needed ask for explicit confirmation before running parallel-cli balance add <amount_cents>. Then retry the original extract command.
Related skills
More from parallel-web/parallel-agent-skills and the wider catalog.
parallel-deep-research
Exhaustive multi-source research for complex topics when users explicitly request deep, comprehensive, or thorough investigation.
parallel-web-search
Fast, cost-effective web search for current information and research queries.
parallel-data-enrichment
Bulk enrich company, people, or product data with web-sourced fields like CEO names, funding, and contact info.
status
Check the status of a running research task by its run ID.
result
Retrieve completed research task results by run ID using Parallel CLI.
parallel-monitor
Continuously track the web for changes on a recurring cadence. Use when the user asks to 'monitor', 'track changes to', 'watch', or 'alert me when' something on the web changes — e.g., 'Track price changes for iPhone 16', 'Alert me when Tesla files a new 8-K', 'Monitor competitor pricing pages weekly'. Also use to list, inspect, update, or delete existing monitors.