firecrawl-knowledge-base
firecrawl/firecrawl-workflows
Build organized, LLM-ready knowledge bases from web content using Firecrawl.
What is firecrawl-knowledge-base?
Scrapes and structures web content into markdown files optimized for RAG, fine-tuning, documentation mirrors, or local reference. Use when you need to convert URLs or topics into organized, machine-readable corpora with preserved code examples and metadata.
- Scrapes documentation sites and web pages into clean markdown files
- Generates RAG-ready chunks with manifest.json for semantic search
- Creates training datasets in JSONL format with metadata
- Organizes output by hostname and path following Firecrawl conventions
- Preserves code examples, tables, and formatting during extraction
- Supports parallel scraping across multiple sources or sections
How to install firecrawl-knowledge-base
npx skills add https://github.com/firecrawl/firecrawl-workflows --skill firecrawl-knowledge-base- Firecrawl API key (set FIRECRAWL_API_KEY environment variable)
- Source URL(s) or topic to scrape
- Target output directory for knowledge base files
How to use firecrawl-knowledge-base
- 1.Provide the source URL, topic, or documentation site to scrape
- 2.Specify your goal: reference docs, RAG chunks, training data, or docs mirror
- 3.Choose depth level: quick (key pages), thorough (full coverage), or exhaustive (all linked content)
- 4.Let the skill infer structure or answer 1-3 clarifying questions if needed
- 5.Review the generated markdown files, chunks, and metadata in .firecrawl/ directory
- 6.Use output files directly in RAG systems, training pipelines, or as local documentation
Use cases
- Mirror official documentation locally with table of contents and source tracking
- Build RAG datasets from technical docs, tutorials, and community discussions
- Create fine-tuning datasets from curated web content with training metadata
- Generate topic-specific corpora from search results across multiple sources
- Extract and organize API documentation into structured reference files
- AI engineers building RAG systems
- ML practitioners preparing training datasets
- Documentation teams creating offline mirrors
- Knowledge workers building local reference libraries
- Developers integrating web content into agent workflows
firecrawl-knowledge-base FAQ
Reference (markdown + sources.json), RAG (markdown + chunks + manifest.json), Training (JSONL + metadata), and Docs mirror (complete markdown with table of contents).
The skill preserves code formatting and table structure during extraction, removing only boilerplate navigation while keeping content-critical elements.
Yes, the skill supports parallel work using sub-agents or task runners—one per docs section, source type, or scraping phase.
Files are organized under .firecrawl/<hostname>/<path>/ with index.md files, plus manifest.json (RAG), training-data.jsonl (training), or sources.json (reference).
Both work—provide specific URLs for targeted scraping or topics for broader search-based corpus collection.
Full instructions (SKILL.md)
Source of truth, from firecrawl/firecrawl-workflows.
name: firecrawl-knowledge-base description: Build a knowledge base from web content with Firecrawl. Use for local reference docs, RAG-ready chunks, fine-tuning datasets, documentation mirrors, topic corpora, or LLM-ready markdown organized from web sources. license: ISC metadata: author: firecrawl version: "0.1.0" homepage: https://www.firecrawl.dev source: https://github.com/firecrawl/firecrawl-workflows inputs:
- name: FIRECRAWL_API_KEY description: Firecrawl API key for hosted Firecrawl requests. required: true
Firecrawl Knowledge Base
Use this to turn URLs or topics into organized LLM-ready content.
Onboarding Interview
Infer the source, goal, depth, and output location from context. If the source and goal are clear, proceed immediately.
Ask at most 1-3 concise questions only if blocked, such as the source URL/topic, whether the output is reference/RAG/training/docs, or training format if training is requested.
Firecrawl Collection Plan
Use Firecrawl map for documentation sites, search for topic-based corpora, scrape pages into markdown, and preserve code examples and tables.
For files, follow the Firecrawl download-style convention:
.firecrawl/
<hostname>/
<path>/
index.md
Parallel Work
If appropriate, use sub-agents or equivalent parallel task runners:
- one docs section per researcher
- official docs, tutorials, community discussions, and references by source type
- source scraping vs chunk generation vs manifest generation
Output Modes
- Reference: markdown files,
index.md, andsources.json. - RAG: markdown files plus chunk files and
manifest.json. - Training: scraped source files plus
training-data.jsonlandtraining-metadata.json. - Docs mirror: complete markdown mirror with a table of contents.
Final Deliverable
# Knowledge Base: [Source]
## Summary
[What was collected and why]
## Output Structure
[Files/directories created]
## Coverage
[Sections, source types, counts]
## Usage Notes
[How to use in RAG, docs, training, or agent context]
## Sources
[URLs collected]
## Rerun Inputs
workflow: firecrawl-knowledge-base
source: [url/topic]
goal: [reference/rag/train/docs]
depth: [quick/thorough/exhaustive]
output_dir: [.firecrawl/]
Quality Bar
- Preserve code examples and formatting.
- Remove boilerplate navigation where possible.
- Include source URLs in frontmatter or metadata.
Related skills
More from firecrawl/firecrawl-workflows and the wider catalog.
firecrawl-deep-research
Produce rigorous, cited analytical reports on complex topics with multi-angle research and contrarian views.
firecrawl-research-papers
Find and synthesize research papers, whitepapers, and technical reports using semantic search and paper expansion.
firecrawl-website-design-clone
Extract any website's design system into an agent-ready DESIGN.md using Firecrawl scraping.
firecrawl-market-research
Extract market, financial, and company metrics from web sources using Firecrawl for structured research reports.
firecrawl-seo-audit
Comprehensive SEO audit with site mapping, on-page analysis, keyword opportunities, and competitor SERP comparison using Firecrawl.
firecrawl-lead-gen
Extract structured lead lists from prospect databases and web directories using Firecrawl browser.