Skip to content

hubzero/a11y-catscan

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

137 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

a11y-catscan

Multi-engine accessibility scans that survive real crawls.

CI License: MIT docs Python 3.12+ Playwright 1.59 WCAG 2.2 MCP server status: beta

a11y-catscan crawls a website with Playwright and runs four accessibility engines — axe-core, Siteimprove Alfa, IBM Equal Access, and HTML_CodeSniffer — sharing one Chromium instance. Findings are deduped across engines, streamed to JSONL/HTML/JSON reports, and exposed as MCP tools so an LLM can analyze them directly.

Status: beta. Production-shaped, exercising in dev; recovery cycle and worker pool work end-to-end on multi-thousand-page authenticated crawls. Architecture and per-module design notes live in DESIGN.md. Site handbook is rendered to GitHub Pages from docs-src/; see the documentation index below.

What's shipped

  • Four scan engines. axe-core (Deque), Siteimprove Alfa (ACT-rules native), IBM Equal Access, HTML_CodeSniffer. Run one or combine them — --engine axe,alfa,ibm,htmlcs — all sharing one Chromium so a multi-engine scan isn't 4× the page loads. Each finding carries an engine attribution.
  • Cross-engine dedup. Findings sharing (selector, primary-tag, outcome) collapse into one entry with engines: {axe: ..., ibm: ...} and per-engine impact upgraded to the worst severity. EARL outcomes (failed / cantTell / passed / inapplicable) are the internal vocabulary.
  • Streaming reports. JSONL is written one page per line so memory stays flat across 5000-page crawls; HTML and the LLM-friendly markdown summary stream from disk on demand.
  • Sliding-window async crawler. N-worker pool with one Chromium, periodic browser restart for memory hygiene (restart_every), atomic state save (--resume), graceful shutdown on SIGTERM/SIGINT, on-demand snapshot via SIGUSR1.
  • Authenticated scans with mid-scan session recovery. A Python login plugin authenticates once, the saved session state shortcuts subsequent starts, and if the session expires mid-crawl the scanner drains workers, re-logs-in, bans detected logout-trap URLs, and resumes. Persistent re-login failure trips a circuit breaker so the crawl exits instead of looping.
  • Allowlist with engine + outcome filters. YAML allowlist suppresses known-acceptable findings by rule, URL, target, engine, and outcome — all AND'd. O(1) average lookup via a rule-id index.
  • MCP server. --mcp exposes scan_page / analyze_report / find_issues / check_page / compare_scans / manage_scans / lookup_wcag / list_engines as Claude Code tools. URL-scheme validated to http(s).
  • Diff and rescan workflows. --diff PREV.jsonl shows fixed/new/remaining findings; --rescan PREV.jsonl re-scans only pages that previously had issues; --violations-from / --incompletes-from extract specific URL sets from prior reports.
  • Group-by analysis. --group-by {rule, selector, color, reason, wcag, level, engine, bp} prints a sorted summary with per-group page counts and one example.
  • Niceness + OOM-resistance. Defaults to nice 10 and oom_score_adj=1000 so the scanner doesn't starve production services on shared hosts.

Quick start

Requires Python 3.12 and Node.js 18+.

pip install -e .              # installs playwright, pyyaml, mcp
playwright install chromium
npm install                   # bundles the four engines

Scan one URL:

./a11y-catscan.py --page https://example.com/

Crawl with all four engines, write LLM-friendly report:

./a11y-catscan.py --engine all --max-pages 500 --llm \
    https://example.com/

Compare against last week's baseline:

./a11y-catscan.py --diff baseline.jsonl --max-pages 500 \
    https://example.com/

Full setup walkthrough in docs-src/getting-started.md.

Documentation

Site handbook (rendered to hubzero.github.io/a11y-catscan from these sources):

Topic Source
Getting started — install, first scan, exit codes docs-src/getting-started.md
Configuration — every YAML setting + CLI override docs-src/configuration.md
Scan workflows — crawl, page, urls, rescan, diff, resume docs-src/scan-workflows.md
Reports — JSON, JSONL, HTML, LLM markdown formats docs-src/reports.md
Authentication — login plugin, session recovery, logout traps docs-src/authentication.md
MCP server — tool surface for Claude Code docs-src/mcp.md
Troubleshooting docs-src/troubleshooting.md
FAQ docs-src/faq.md

Internal references:

Engines

Engine Flag Type License
axe-core (Deque) --engine axe Browser injection (default) MPL-2.0
Siteimprove Alfa --engine alfa Node.js subprocess via CDP MIT
IBM Equal Access --engine ibm Browser injection Apache-2.0
HTML_CodeSniffer --engine htmlcs Browser injection BSD-3

--engine all runs all four; engines that aren't listed are skipped. axe-core, IBM, and HTML_CodeSniffer inject JavaScript into the live page and run in-browser. Alfa's TypeScript engine runs as a Node.js subprocess and connects to the shared Chromium via CDP — no second page load.

Local development

The full test suite runs against the bundled fixtures:

pip install -e '.[dev]'
pytest                       # 368 tests, ~70s with browser
pytest -m "not browser"      # 285 fast tests, <10s

Coverage is configured in pyproject.toml; see tests/ for the layout (test_engine_normalizers.py, test_crawl_loop.py, test_mcp_tools.py, etc.).

License

MIT. See LICENSE.

Engine licenses: axe-core (MPL-2.0), Siteimprove Alfa (MIT), IBM Equal Access (Apache-2.0), HTML_CodeSniffer (BSD-3). The four engines are vendored via npm and ship under their own licenses; this repo wraps them.

About

Multi-Engine WCAG Compliance Crawler

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors