-
Notifications
You must be signed in to change notification settings - Fork 50
Cookboook with Firecrawl #200 #206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
750f3dd
32fb7a4
02db2ef
17d3bc8
df24bd9
5dff51a
6c01c0c
6a95ead
485a3ca
0c96b95
4ea0d28
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,9 @@ | ||
| # Example env for Firecrawl + Moss cookbook | ||
| # Copy to .env and fill in values before running the notebook. | ||
|
|
||
| # Moss credentials | ||
| MOSS_PROJECT_ID=your_moss_project_id | ||
| MOSS_PROJECT_KEY=your_moss_project_key | ||
|
|
||
| # Firecrawl API key | ||
| FIRECRAWL_API_KEY=your_firecrawl_api_key |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,97 @@ | ||
| # Firecrawl + Moss Cookbook Example | ||
|
|
||
| Use Firecrawl to turn one or more URLs into clean markdown, then index the results into Moss and query them semantically from a notebook. | ||
|
|
||
| > This is a cookbook example, not a packaged integration. Open [firecrawl_moss.ipynb](firecrawl_moss.ipynb) to follow the full URL-to-query pipeline. | ||
|
|
||
| ## Installation | ||
|
|
||
| ```bash | ||
| pip install firecrawl-py moss python-dotenv | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you please add pyproject
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
@yatharthk2 Added pyproject and updated readme (Removed Markdown Normalization from the architecture diagram) |
||
| ``` | ||
|
|
||
| ## Setup | ||
|
|
||
| Set these environment variables in your shell or a `.env` file: | ||
|
|
||
| ```bash | ||
| FIRECRAWL_API_KEY=your-firecrawl-api-key | ||
| MOSS_PROJECT_ID=your-project-id | ||
| MOSS_PROJECT_KEY=your-project-key | ||
| ``` | ||
|
|
||
| ## Quick Start | ||
|
|
||
| 1. Open [firecrawl_moss.ipynb](firecrawl_moss.ipynb) in Jupyter or VS Code. | ||
| 2. Run the setup and helper cells. | ||
| 3. Set `urls` to the pages you want to ingest. | ||
| 4. Run `await build_and_query_knowledge_base(urls)` to crawl, index, and query the content. | ||
|
|
||
| ## Workflow | ||
|
|
||
| The notebook is structured for efficiency: | ||
|
|
||
| 1. **Prepare** (one-time): Crawl URLs → normalize markdown → index into Moss | ||
| 2. **Query** (repeated): Run semantic queries against the indexed knowledge base without re-crawling | ||
|
|
||
| This design lets you crawl once (which can be slow/expensive) and then iterate on queries quickly. | ||
|
|
||
| ## Architecture | ||
|
|
||
| ``` | ||
| ┌─────────────┐ | ||
| │ URLs │ | ||
| └──────┬──────┘ | ||
| │ | ||
| | | ||
| │ | ||
| ┌──────▼─────────────────┐ | ||
| │ Crawled Pages │ | ||
| │ (raw HTML/markdown) │ | ||
| └──────┬─────────────────┘ | ||
| │ | ||
| | | ||
| │ | ||
| ┌──────▼─────────────────┐ | ||
| │ Markdown │ | ||
| │ (one DocumentInfo │ | ||
| │ per page) │ | ||
| └──────┬─────────────────┘ | ||
| │ | ||
| ├──> Moss Create Index | ||
| │ | ||
| ┌──────▼─────────────────┐ | ||
| │ Indexed Knowledge │ | ||
| │ Base (local or cloud) │ | ||
| └──────┬─────────────────┘ | ||
| │ | ||
| ├──> Semantic Query (reusable) | ||
| │ (no re-crawling needed) | ||
| │ | ||
| ┌──────▼─────────────────┐ | ||
| │ Top-K Results │ | ||
| │ (scored passages) │ | ||
| └─────────────────────────┘ | ||
| ``` | ||
|
|
||
| ## What the notebook does | ||
|
|
||
| ```python | ||
| from firecrawl import Firecrawl | ||
| from moss import DocumentInfo, MossClient, QueryOptions | ||
|
|
||
| job = Firecrawl(api_key=FIRECRAWL_API_KEY).crawl( | ||
| url="https://example.com", | ||
| limit=3, | ||
| scrape_options={"formats": ["markdown"]}, | ||
| ) | ||
|
|
||
| documents = [DocumentInfo(id="1", text=job.data[0].markdown, metadata={"source_url": "https://example.com"})] | ||
| await MossClient(MOSS_PROJECT_ID, MOSS_PROJECT_KEY).create_index("firecrawl-demo", documents) | ||
| ``` | ||
|
|
||
| ## Files | ||
|
|
||
| | File | Description | | ||
| |------|-------------| | ||
| | `firecrawl_moss.ipynb` | Notebook that crawls URLs, indexes markdown into Moss, and runs semantic search | | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚩 AGENTS.md cookbook table not updated with firecrawl entry
The AGENTS.md file documents all cookbook integrations in a table under "Framework Cookbooks". This new
firecrawl/cookbook is not listed there. While AGENTS.md is descriptive (documenting repo state for AI agents) rather than prescriptive (mandating rules), keeping it synchronized helps agents understand the repo. Other recently added cookbooks likepydantic-ai/andlanggraph/are already in the table, suggesting it should be maintained.Was this helpful? React with 👍 or 👎 to provide feedback.