Skip to content

michaelblaess/sitemap-generator

Repository files navigation

Sitemap Generator

English · Deutsch


Stars Forks Issues Pull Requests

Last Commit License Python

Crawls websites and generates standard-compliant sitemap.xml files. Uses Playwright for JavaScript rendering or httpx for fast HTTP crawling.

Screenshots

Main View

Main View

Sitemap Tree

Sitemap Tree

Crawl History

Crawl History

Installation

One-Liner (Standalone, no Python required)

Linux / macOS:

curl -fsSL https://raw.githubusercontent.com/michaelblaess/sitemap-generator/main/install.sh | bash

Windows (PowerShell):

irm https://raw.githubusercontent.com/michaelblaess/sitemap-generator/main/install.ps1 | iex

Usage

# Simple crawl (httpx mode, fast)
sitemap-generator https://example.com

# With JavaScript rendering (Playwright)
sitemap-generator https://example.com --render

# Save sitemap directly
sitemap-generator https://example.com --output sitemap.xml

# Limit crawl depth
sitemap-generator https://example.com --max-depth 5

# More concurrency
sitemap-generator https://example.com --concurrency 16

# Ignore robots.txt
sitemap-generator https://example.com --ignore-robots

# With cookies (e.g. for login)
sitemap-generator https://example.com --cookie session=abc123

CLI Parameters

Parameter Description Default
URL Start URL of the website -
--output, -o Output path for sitemap.xml sitemap_<host>_<timestamp>.xml
--max-depth, -d Maximum crawl depth 10
--concurrency, -c Parallel requests 8
--timeout, -t Timeout per page (seconds) 30
--render Render JavaScript with Playwright off
--no-headless Browser visible (debugging) off
--ignore-robots Ignore robots.txt off
--user-agent Custom User-Agent Chrome 131
--cookie Set cookie (NAME=VALUE, multiple) -

Keyboard Shortcuts (TUI)

Key Function
c Start crawl
x Cancel crawl / JSON error report
m Save sitemap
s Settings
g Export form report (JSON)
j JIRA table to clipboard
e Show errors only
b Sitemap tree
f Sitemap diff
d Copy URL details
l Toggle log
h History
i Info dialog
q Quit

Copying / exporting the log runs via right-click on the log panel.

Features

  • Dual mode: httpx (fast, HTML only) or Playwright (JavaScript rendering)
  • robots.txt: Respected by default, --ignore-robots to disable
  • Auto-split: With >50,000 URLs, an automatic sitemap index with partial sitemaps
  • Priority: Automatically based on crawl depth (home page = 1.0)
  • lastmod: From HTTP Last-Modified header
  • URL normalization: Duplicates avoided through normalization
  • Form detection: <form> tags are detected, marked in the table and exportable as JSON
  • Live TUI: Progress, statistics and URL details in real time
  • Resizable panels: Splitters to freely resize the URL table, log and stats panels
  • Log panel: Right-click context menu — copy, export to file, or hide
  • Settings dialog: Language, robots.txt, Playwright, concurrency, timeout and crawl depth — persisted across runs
  • Filter with history: Filter the URL table by URL/status; recent filter terms in a dropdown

Browser Strategy

  1. System Chrome preferred (faster startup, less memory)
  2. Bundled Chromium as fallback (included in standalone installation)

Privacy

Important: Crawling a website may be perceived as unusual traffic by the operator. Please note:

  • Inform the website operator before crawling, especially for large websites
  • Respect robots.txt (enabled by default)
  • Use reasonable concurrency and timeout values
  • This tool is intended for your own websites and authorized analyses

Development

Setup

git clone https://github.com/michaelblaess/sitemap-generator.git
cd sitemap-generator

# Windows
.\bootstrap.ps1

# Linux/macOS
./bootstrap.sh

Local Start

# Windows
.\run.ps1 https://example.com

# Linux/macOS
./run.sh https://example.com

Creating a Release

git tag vX.Y.Z
git push origin vX.Y.Z

GitHub Actions automatically builds executables for Windows, Linux and macOS.

License

Apache License 2.0 - see LICENSE

About

Crawls websites and generates standards-compliant sitemap.xml files. Supports Playwright for JS rendering and httpx for fast HTTP crawling

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors