Skill

Official

Review

Audit score 70

firecrawl-knowledge-base

firecrawl/firecrawl-workflows

Build organized, LLM-ready knowledge bases from web content using Firecrawl.

What is firecrawl-knowledge-base?

Scrapes and structures web content into markdown files optimized for RAG, fine-tuning, documentation mirrors, or local reference. Use when you need to convert URLs or topics into organized, machine-readable corpora with preserved code examples and metadata.

Scrapes documentation sites and web pages into clean markdown files
Generates RAG-ready chunks with manifest.json for semantic search
Creates training datasets in JSONL format with metadata
Organizes output by hostname and path following Firecrawl conventions
Preserves code examples, tables, and formatting during extraction
Supports parallel scraping across multiple sources or sections

How to install firecrawl-knowledge-base

npx skills add https://github.com/firecrawl/firecrawl-workflows --skill firecrawl-knowledge-base

Prerequisites

Firecrawl API key (set FIRECRAWL_API_KEY environment variable)
Source URL(s) or topic to scrape
Target output directory for knowledge base files

Claude Code

Cursor

Windsurf

Cline

How to use firecrawl-knowledge-base

1.Provide the source URL, topic, or documentation site to scrape
2.Specify your goal: reference docs, RAG chunks, training data, or docs mirror
3.Choose depth level: quick (key pages), thorough (full coverage), or exhaustive (all linked content)
4.Let the skill infer structure or answer 1-3 clarifying questions if needed
5.Review the generated markdown files, chunks, and metadata in .firecrawl/ directory
6.Use output files directly in RAG systems, training pipelines, or as local documentation

Use cases

Good for

Mirror official documentation locally with table of contents and source tracking
Build RAG datasets from technical docs, tutorials, and community discussions
Create fine-tuning datasets from curated web content with training metadata
Generate topic-specific corpora from search results across multiple sources
Extract and organize API documentation into structured reference files

Who it's for

AI engineers building RAG systems
ML practitioners preparing training datasets
Documentation teams creating offline mirrors
Knowledge workers building local reference libraries
Developers integrating web content into agent workflows

firecrawl-knowledge-base FAQ

What output formats does this skill support?

Reference (markdown + sources.json), RAG (markdown + chunks + manifest.json), Training (JSONL + metadata), and Docs mirror (complete markdown with table of contents).

How does it handle code examples and tables?

The skill preserves code formatting and table structure during extraction, removing only boilerplate navigation while keeping content-critical elements.

Can I scrape multiple sources in parallel?

Yes, the skill supports parallel work using sub-agents or task runners—one per docs section, source type, or scraping phase.

What file structure does it create?

Files are organized under .firecrawl/<hostname>/<path>/ with index.md files, plus manifest.json (RAG), training-data.jsonl (training), or sources.json (reference).

Do I need to provide exact URLs or can I use topics?

Both work—provide specific URLs for targeted scraping or topics for broader search-based corpus collection.

Full instructions (SKILL.md)

Source of truth, from firecrawl/firecrawl-workflows.

name: firecrawl-knowledge-base description: Build a knowledge base from web content with Firecrawl. Use for local reference docs, RAG-ready chunks, fine-tuning datasets, documentation mirrors, topic corpora, or LLM-ready markdown organized from web sources. license: ISC metadata: author: firecrawl version: "0.1.0" homepage: https://www.firecrawl.dev source: https://github.com/firecrawl/firecrawl-workflows inputs:

name: FIRECRAWL_API_KEY description: Firecrawl API key for hosted Firecrawl requests. required: true

Firecrawl Knowledge Base

Use this to turn URLs or topics into organized LLM-ready content.

Onboarding Interview

Infer the source, goal, depth, and output location from context. If the source and goal are clear, proceed immediately.

Ask at most 1-3 concise questions only if blocked, such as the source URL/topic, whether the output is reference/RAG/training/docs, or training format if training is requested.

Firecrawl Collection Plan

Use Firecrawl map for documentation sites, search for topic-based corpora, scrape pages into markdown, and preserve code examples and tables.

For files, follow the Firecrawl download-style convention:

.firecrawl/
  <hostname>/
    <path>/
      index.md

Parallel Work

If appropriate, use sub-agents or equivalent parallel task runners:

one docs section per researcher
official docs, tutorials, community discussions, and references by source type
source scraping vs chunk generation vs manifest generation

Output Modes

Reference: markdown files, index.md, and sources.json.
RAG: markdown files plus chunk files and manifest.json.
Training: scraped source files plus training-data.jsonl and training-metadata.json.
Docs mirror: complete markdown mirror with a table of contents.

Final Deliverable

# Knowledge Base: [Source]

## Summary
[What was collected and why]

## Output Structure
[Files/directories created]

## Coverage
[Sections, source types, counts]

## Usage Notes
[How to use in RAG, docs, training, or agent context]

## Sources
[URLs collected]

## Rerun Inputs
workflow: firecrawl-knowledge-base
source: [url/topic]
goal: [reference/rag/train/docs]
depth: [quick/thorough/exhaustive]
output_dir: [.firecrawl/]

Quality Bar

Preserve code examples and formatting.
Remove boilerplate navigation where possible.
Include source URLs in frontmatter or metadata.

Extract structured lead lists from prospect databases and web directories using Firecrawl browser.

23k installs