Usage

This Python script, crawl.py, is designed to crawl a website and generate Markdown files containing the content of each page.

Here's a detailed explanation of how it works.

Usage

The script can be run in three ways:

With no arguments: python crawl.py
- This will print the usage instructions, showing the available options.
With the default sitemap: python crawl.py --sitemap
- This will use the default sitemap specified in the script to crawl the website.
With a custom URL: python crawl.py <URL>
- Replace <URL> with the URL of the website or specific page you want to crawl.
- If a URL is provided without the --sitemap flag, the script will crawl only that single page and save its content to a Markdown file.
- If the --sitemap flag is used with a URL, the script will use that URL as the starting point for crawling the entire website.

# Run with default sitemap:
python crawl.py

# Specify custom sitemap:
python crawl.py --sitemap <URL>

How it Works

The script first retrieves the sitemap (either the default or the provided custom URL) and extracts all the URLs from it.
For each URL, the script fetches the HTML content of the corresponding web page.

The script then extracts the main content from the HTML, removing any unwanted elements like navigation menus, headers, footers, etc.

The extracted content is then saved to an individual Markdown file, with the filename based on the URL. For example, if the URL is https://example.com/page1, the Markdown file will be named page1.md.
Additionally, the script appends the content of each page to a combined Markdown file named combined.md.
The individual Markdown files are saved in the crawls/ directory, while the combined.md file is saved in the current working directory.

Code

This is repo I'm replicating:

https://github.com/coleam00/ottomator-agents/blob/main/crawl4AI-agent/crawl4AI-examples/3-crawl_docs_FAST.py

Prepare System for execution

# first run

asdf install python latest
asdf local python latest
python -m pip install --upgrade pip

asdf plugin add rust
asdf install rust latest
asdf local rust latest

# must do this (to the same version as local)
sdf global rust latest

# and actually not always the latest version
asdf install python 3.11.7
asdf global python 3.11.7
asdf local python 3.11.7
pip install --upgrade pip

# setup versions from .tool-versions
asdf install

# setup app
pip install -U crawl4ai

Examples

python crawl.py --sitemap "https://www.cnc24.com/sitemap.xml?lang=de"
python crawl.py -s https://www.techpilot.com/de/sitemap.xml
python crawl.py -s https://www.machiningdoctor.com/glossary-sitemap.xml

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.gitignore		.gitignore
.tool-versions		.tool-versions
README.md		README.md
crawl.py		crawl.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Usage

How it Works

Code

Prepare System for execution

Examples

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Usage

How it Works

Code

Prepare System for execution

Examples

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages