PluginBench
Skill
Official
Review
Audit score 70

firecrawl-knowledge-base

firecrawl/firecrawl-workflows

Build organized, LLM-ready knowledge bases from web content using Firecrawl.

What is firecrawl-knowledge-base?

Scrapes and structures web content into markdown files optimized for RAG, fine-tuning, documentation mirrors, or local reference. Use when you need to convert URLs or topics into organized, machine-readable corpora with preserved code examples and metadata.

  • Scrapes documentation sites and web pages into clean markdown files
  • Generates RAG-ready chunks with manifest.json for semantic search
  • Creates training datasets in JSONL format with metadata
  • Organizes output by hostname and path following Firecrawl conventions
  • Preserves code examples, tables, and formatting during extraction
  • Supports parallel scraping across multiple sources or sections

How to install firecrawl-knowledge-base

npx skills add https://github.com/firecrawl/firecrawl-workflows --skill firecrawl-knowledge-base
Prerequisites
  • Firecrawl API key (set FIRECRAWL_API_KEY environment variable)
  • Source URL(s) or topic to scrape
  • Target output directory for knowledge base files
Claude Code
Cursor
Windsurf
Cline

How to use firecrawl-knowledge-base

  1. 1.Provide the source URL, topic, or documentation site to scrape
  2. 2.Specify your goal: reference docs, RAG chunks, training data, or docs mirror
  3. 3.Choose depth level: quick (key pages), thorough (full coverage), or exhaustive (all linked content)
  4. 4.Let the skill infer structure or answer 1-3 clarifying questions if needed
  5. 5.Review the generated markdown files, chunks, and metadata in .firecrawl/ directory
  6. 6.Use output files directly in RAG systems, training pipelines, or as local documentation

Use cases

Good for
  • Mirror official documentation locally with table of contents and source tracking
  • Build RAG datasets from technical docs, tutorials, and community discussions
  • Create fine-tuning datasets from curated web content with training metadata
  • Generate topic-specific corpora from search results across multiple sources
  • Extract and organize API documentation into structured reference files
Who it's for
  • AI engineers building RAG systems
  • ML practitioners preparing training datasets
  • Documentation teams creating offline mirrors
  • Knowledge workers building local reference libraries
  • Developers integrating web content into agent workflows

firecrawl-knowledge-base FAQ

What output formats does this skill support?

Reference (markdown + sources.json), RAG (markdown + chunks + manifest.json), Training (JSONL + metadata), and Docs mirror (complete markdown with table of contents).

How does it handle code examples and tables?

The skill preserves code formatting and table structure during extraction, removing only boilerplate navigation while keeping content-critical elements.

Can I scrape multiple sources in parallel?

Yes, the skill supports parallel work using sub-agents or task runners—one per docs section, source type, or scraping phase.

What file structure does it create?

Files are organized under .firecrawl/<hostname>/<path>/ with index.md files, plus manifest.json (RAG), training-data.jsonl (training), or sources.json (reference).

Do I need to provide exact URLs or can I use topics?

Both work—provide specific URLs for targeted scraping or topics for broader search-based corpus collection.

Full instructions (SKILL.md)

Source of truth, from firecrawl/firecrawl-workflows.


name: firecrawl-knowledge-base description: Build a knowledge base from web content with Firecrawl. Use for local reference docs, RAG-ready chunks, fine-tuning datasets, documentation mirrors, topic corpora, or LLM-ready markdown organized from web sources. license: ISC metadata: author: firecrawl version: "0.1.0" homepage: https://www.firecrawl.dev source: https://github.com/firecrawl/firecrawl-workflows inputs:

  • name: FIRECRAWL_API_KEY description: Firecrawl API key for hosted Firecrawl requests. required: true

Firecrawl Knowledge Base

Use this to turn URLs or topics into organized LLM-ready content.

Onboarding Interview

Infer the source, goal, depth, and output location from context. If the source and goal are clear, proceed immediately.

Ask at most 1-3 concise questions only if blocked, such as the source URL/topic, whether the output is reference/RAG/training/docs, or training format if training is requested.

Firecrawl Collection Plan

Use Firecrawl map for documentation sites, search for topic-based corpora, scrape pages into markdown, and preserve code examples and tables.

For files, follow the Firecrawl download-style convention:

.firecrawl/
  <hostname>/
    <path>/
      index.md

Parallel Work

If appropriate, use sub-agents or equivalent parallel task runners:

  • one docs section per researcher
  • official docs, tutorials, community discussions, and references by source type
  • source scraping vs chunk generation vs manifest generation

Output Modes

  • Reference: markdown files, index.md, and sources.json.
  • RAG: markdown files plus chunk files and manifest.json.
  • Training: scraped source files plus training-data.jsonl and training-metadata.json.
  • Docs mirror: complete markdown mirror with a table of contents.

Final Deliverable

# Knowledge Base: [Source]

## Summary
[What was collected and why]

## Output Structure
[Files/directories created]

## Coverage
[Sections, source types, counts]

## Usage Notes
[How to use in RAG, docs, training, or agent context]

## Sources
[URLs collected]

## Rerun Inputs
workflow: firecrawl-knowledge-base
source: [url/topic]
goal: [reference/rag/train/docs]
depth: [quick/thorough/exhaustive]
output_dir: [.firecrawl/]

Quality Bar

  • Preserve code examples and formatting.
  • Remove boilerplate navigation where possible.
  • Include source URLs in frontmatter or metadata.