GitHub - hubzero/a11y-catscan: Multi-Engine WCAG Compliance Crawler

Multi-engine accessibility scans that survive real crawls.

a11y-catscan crawls a website with Playwright and runs four accessibility engines — axe-core, Siteimprove Alfa, IBM Equal Access, and HTML_CodeSniffer — sharing one Chromium instance. Findings are deduped across engines, streamed to JSONL/HTML/JSON reports, and exposed as MCP tools so an LLM can analyze them directly.

Status: beta. Production-shaped, exercising in dev; recovery cycle and worker pool work end-to-end on multi-thousand-page authenticated crawls. Architecture and per-module design notes live in DESIGN.md. Site handbook is rendered to GitHub Pages from docs-src/; see the documentation index below.

What's shipped

Four scan engines. axe-core (Deque), Siteimprove Alfa (ACT-rules native), IBM Equal Access, HTML_CodeSniffer. Run one or combine them — --engine axe,alfa,ibm,htmlcs — all sharing one Chromium so a multi-engine scan isn't 4× the page loads. Each finding carries an engine attribution.
Cross-engine dedup. Findings sharing (selector, primary-tag, outcome) collapse into one entry with engines: {axe: ..., ibm: ...} and per-engine impact upgraded to the worst severity. EARL outcomes (failed / cantTell / passed / inapplicable) are the internal vocabulary.
Streaming reports. JSONL is written one page per line so memory stays flat across 5000-page crawls; HTML and the LLM-friendly markdown summary stream from disk on demand.
Sliding-window async crawler. N-worker pool with one Chromium, periodic browser restart for memory hygiene (restart_every), atomic state save (--resume), graceful shutdown on SIGTERM/SIGINT, on-demand snapshot via SIGUSR1.
Authenticated scans with mid-scan session recovery. A Python login plugin authenticates once, the saved session state shortcuts subsequent starts, and if the session expires mid-crawl the scanner drains workers, re-logs-in, bans detected logout-trap URLs, and resumes. Persistent re-login failure trips a circuit breaker so the crawl exits instead of looping.
Allowlist with engine + outcome filters. YAML allowlist suppresses known-acceptable findings by rule, URL, target, engine, and outcome — all AND'd. O(1) average lookup via a rule-id index.
MCP server. --mcp exposes scan_page / analyze_report / find_issues / check_page / compare_scans / manage_scans / lookup_wcag / list_engines as Claude Code tools. URL-scheme validated to http(s).
Diff and rescan workflows. --diff PREV.jsonl shows fixed/new/remaining findings; --rescan PREV.jsonl re-scans only pages that previously had issues; --violations-from / --incompletes-from extract specific URL sets from prior reports.
Group-by analysis. --group-by {rule, selector, color, reason, wcag, level, engine, bp} prints a sorted summary with per-group page counts and one example.
Niceness + OOM-resistance. Defaults to nice 10 and oom_score_adj=1000 so the scanner doesn't starve production services on shared hosts.

Quick start

Requires Python 3.12 and Node.js 18+.

pip install -e .              # installs playwright, pyyaml, mcp
playwright install chromium
npm install                   # bundles the four engines

Scan one URL:

./a11y-catscan.py --page https://example.com/

Crawl with all four engines, write LLM-friendly report:

./a11y-catscan.py --engine all --max-pages 500 --llm \
    https://example.com/

Compare against last week's baseline:

./a11y-catscan.py --diff baseline.jsonl --max-pages 500 \
    https://example.com/

Full setup walkthrough in docs-src/getting-started.md.

Documentation

Site handbook (rendered to hubzero.github.io/a11y-catscan from these sources):

Topic	Source
Getting started — install, first scan, exit codes	`docs-src/getting-started.md`
Configuration — every YAML setting + CLI override	`docs-src/configuration.md`
Scan workflows — crawl, page, urls, rescan, diff, resume	`docs-src/scan-workflows.md`
Reports — JSON, JSONL, HTML, LLM markdown formats	`docs-src/reports.md`
Authentication — login plugin, session recovery, logout traps	`docs-src/authentication.md`
MCP server — tool surface for Claude Code	`docs-src/mcp.md`
Troubleshooting	`docs-src/troubleshooting.md`
FAQ	`docs-src/faq.md`

Internal references:

DESIGN.md — current-state design specification
CHANGELOG.md — date-organized log of changes

Engines

Engine	Flag	Type	License
axe-core (Deque)	`--engine axe`	Browser injection (default)	MPL-2.0
Siteimprove Alfa	`--engine alfa`	Node.js subprocess via CDP	MIT
IBM Equal Access	`--engine ibm`	Browser injection	Apache-2.0
HTML_CodeSniffer	`--engine htmlcs`	Browser injection	BSD-3

--engine all runs all four; engines that aren't listed are skipped. axe-core, IBM, and HTML_CodeSniffer inject JavaScript into the live page and run in-browser. Alfa's TypeScript engine runs as a Node.js subprocess and connects to the shared Chromium via CDP — no second page load.

Local development

The full test suite runs against the bundled fixtures:

pip install -e '.[dev]'
pytest                       # 368 tests, ~70s with browser
pytest -m "not browser"      # 285 fast tests, <10s

Coverage is configured in pyproject.toml; see tests/ for the layout (test_engine_normalizers.py, test_crawl_loop.py, test_mcp_tools.py, etc.).

License

MIT. See LICENSE.

Engine licenses: axe-core (MPL-2.0), Siteimprove Alfa (MIT), IBM Equal Access (Apache-2.0), HTML_CodeSniffer (BSD-3). The four engines are vendored via npm and ship under their own licenses; this repo wraps them.

Name		Name	Last commit message	Last commit date
Latest commit History 137 Commits
.github/workflows		.github/workflows
docs-src		docs-src
docs		docs
engines		engines
site-src		site-src
tests		tests
tools		tools
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
DESIGN.md		DESIGN.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
a11y-catscan.py		a11y-catscan.py
a11y-catscan.yaml.example		a11y-catscan.yaml.example
alfa-engine.mjs		alfa-engine.mjs
allowlist.py		allowlist.py
allowlist.yaml.example		allowlist.yaml.example
cli_modes.py		cli_modes.py
crawl.py		crawl.py
crawl_utils.py		crawl_utils.py
engine_mappings.py		engine_mappings.py
login-hubzero.py		login-hubzero.py
mcp_server.py		mcp_server.py
package-lock.json		package-lock.json
package.json		package.json
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
registry.py		registry.py
report_diff.py		report_diff.py
report_group.py		report_group.py
report_html.py		report_html.py
report_io.py		report_io.py
report_llm.py		report_llm.py
results.py		results.py
scanner.py		scanner.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What's shipped

Quick start

Documentation

Engines

Local development

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

What's shipped

Quick start

Documentation

Engines

Local development

License

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages