Araneae is a link checker for documentation websites. Point it at one entry URL, and it crawls links that are safe for that site, checks each discovered target once, counts every link occurrence, and writes a JSON report. It also includes a small local web UI for triaging the report.
The primary audience is technical writers and docs maintainers who need to validate a published docs site or preview environment before release.
Araneae:
- Fetches the entry URL first.
- Parses HTML pages for
<a href="...">links. - Crawls only links in the entry URL origin by default.
- Accepts additional exact origins with
--allow-host. - Optionally restricts crawling to a path prefix with
--path-prefix. - Can seed a local docs build with
--local-rootso orphaned HTML pages are checked. - Counts duplicate link occurrences.
- Fetches each normalized target URL once, even if multiple fragment links point to it.
- Reports dead links, missing fragments, and non-200 HTTP responses.
- Records skipped out-of-scope links separately.
- Writes a stable JSON report and serves it in a local UI.
Araneae does not execute JavaScript, authenticate to private sites, crawl external sites by default, or check image/script/style assets in the first version.
For most technical writers, the recommended install path is a compiled release binary. Release binaries do not require Go or any other runtime dependency.
If you install from source or build Araneae yourself, you need Go. This repository currently targets Go 1.26.2.
- Official Go install instructions: go.dev/doc/install
- After installing Go, verify it is available:
go versionFor Araneae 1.0 and later, download the binary for your operating system from the GitHub releases page.
- Download the archive for your platform.
- Unpack it.
- Move the
araneaeexecutable somewhere on yourPATH. - Verify the command is available:
araneae helpUse this option if you are comfortable with Go tooling or need the latest code from the repository.
go install ./cmd/araneaego build -o araneae ./cmd/araneaego run ./cmd/araneae scan https://docs.example.com/Run a scan:
araneae scan https://docs.example.com/ --out report.jsonOpen the report in the local UI:
araneae serve report.jsonThe server prints the local URL it is serving. Use --addr to choose an address:
araneae serve report.json --addr 127.0.0.1:8080Flags may appear before or after the positional argument.
araneae scan <entry-url> [flags]
Important flags:
--out araneae-report.json: output report path.--max-pages 500: maximum number of same-scope fetch URLs to check, including the entry URL.--timeout 15s: per-request timeout.--concurrency 8: number of concurrent fetch workers.--max-requests-per-second 0: maximum request starts per second across all workers.0means unlimited.--allow-host https://www.example.com: additional exact origin that is safe to crawl. Can be repeated.--path-prefix /docs/: optional normalized path prefix that same-scope links must match.--local-root public: local static site root to seed the crawl with every.html/.htmpage.--user-agent "araneae/0.1": HTTP user agent.--fail-on-dead: exit non-zero after writing the report if dead links are found.--fail-on-non-200: exit non-zero after writing the report if any non-200 links are found.
Examples:
araneae scan https://docs.example.com/ \
--out report.json \
--max-pages 1000 \
--concurrency 8 \
--max-requests-per-second 5Allow a second exact origin:
araneae scan https://docs.example.com/ \
--allow-host https://www.example.comLimit crawling to a docs subtree:
araneae scan https://example.com/docs/ \
--path-prefix /docs/Check a local docs build for orphaned pages:
araneae scan http://localhost:8000/docs/ \
--local-root public/docs \
--path-prefix /docs/--local-root treats the directory as being served at the entry URL path. It maps index.html to the directory URL, for example guide/index.html becomes /docs/guide/.
Use in CI:
araneae scan https://preview.example.com/docs/ \
--out report.json \
--fail-on-dead \
--fail-on-non-200By default, Araneae crawls only the final entry URL origin after redirects. Origin means scheme, host, and port.
For example, this scan:
araneae scan https://docs.example.com/will crawl https://docs.example.com/..., but it will not crawl:
https://www.example.com/...https://api.example.com/...http://docs.example.com/...
Use --allow-host for additional safe origins. The match is exact by origin:
araneae scan https://docs.example.com/ \
--allow-host https://www.example.comUse --path-prefix to keep the crawl inside a subtree:
araneae scan https://example.com/docs/ --path-prefix /docs/Same-origin links outside the prefix are recorded under skipped_links with reason outside_path_prefix.
The scan writes a JSON report. The top-level shape is:
{
"schema_version": 1,
"generated_at": "2026-05-28T15:04:46Z",
"entry_url": "https://docs.example.com/",
"scope": {
"origin": "https://docs.example.com",
"allowed_origins": [],
"same_site_policy": "exact_origin_with_allowlist",
"path_prefix": ""
},
"limits": {
"max_pages": 500,
"request_timeout_seconds": 15,
"max_concurrency": 8,
"max_requests_per_second": 0
},
"summary": {
"links_discovered": 5,
"link_occurrences": 6,
"fetches_attempted": 4,
"ok_links": 3,
"dead_links": 2,
"non_200_links": 1,
"skipped_links": 1,
"skipped_external_links": 1,
"truncated": false,
"unvisited_urls": 0
},
"links": [],
"fetches": [],
"skipped_links": []
}Each links entry represents one normalized navigable URL. Fragment variants are separate links but share a fetch_url:
{
"url": "https://docs.example.com/install#requirements",
"fetch_url": "https://docs.example.com/install",
"count": 4,
"ok": false,
"dead": true,
"non_200": false,
"problem": "missing_fragment",
"status_code": 200,
"final_url": "https://docs.example.com/install",
"content_type": "text/html",
"error": "",
"sources": [
{
"page_url": "https://docs.example.com/",
"count": 2,
"texts": ["Requirements"]
}
]
}Problem values include:
http_status: a received HTTP status other than 200.network_error: DNS, connection, or other network failure.timeout: request timeout.tls_error: TLS/certificate failure.too_many_redirects: redirect limit exceeded.missing_fragment: linked fragment was not found on a 200 HTML page.parsing_error: HTML parsing failed.
dead is true for network failures, timeouts, TLS errors, HTTP 404/410, and missing fragments. non_200 is true for any received HTTP status other than 200.
skipped_links contains links Araneae saw but did not crawl, such as external origins or same-origin links outside --path-prefix.
The UI is served locally from the Go binary:
araneae serve report.jsonIt includes:
- Summary metrics.
- Problem links sorted by severity.
- All links table with status filters.
- Skipped links table.
- Search by link URL or source page.
- Sorting by count, status, URL, and source count.
- Link detail with sources, snippets, redirect chain, final URL, content type, and error details.
- Copy URL and copy source page actions when browser clipboard support is available.
The repository includes a small static site for manual checks:
cd examples/test-site
python3 -m http.server 8000Then scan it:
araneae scan http://127.0.0.1:8000/index.html \
--out report.json \
--max-pages 20 \
--concurrency 4See examples/test-site/README.md for details.
Run tests:
go test ./...Run the crawler race test:
go test -race ./internal/crawl