Crawl Tester

A powerful web crawler tool that extracts website content and converts it to Markdown format. Perfect for analyzing how search engines, AI agents, and bots perceive your website structure.

Example crawl on site rdjarbeng.com produces these lines (showed just first few lines)

[ ![Site Logo](https://rdjarbeng.com/assets/images/logo-small.png) ](https://rdjarbeng.com/)
[Posts](https://rdjarbeng.com/posts) [Personal](https://rdjarbeng.com/personal) [ Gallery ](https://rdjarbeng.com/gallery) [About](https://rdjarbeng.com/about)
# Richard Djarbeng's Blog
Occasion Spotlight
# Ghana, an Introduction- All That Glitters is Still Gold
Celebrating the 69th Independence Day of Ghana. Explore the rich culture, history, and golden opportunities of the first sub-Saharan nation to gain independence. 
[Read the Post](https://rdjarbeng.com/personal/a-country-called-ghana/)
![Happy 69th Independence Day Ghana](https://rdjarbeng.com/assets/images/posts/happy_indepence_ghana_post_cover.png)
## Latest Posts
[Tags](https://rdjarbeng.com/tags/) | [Categories](https://rdjarbeng.com/categories/)
[ ![Flat vector illustration of AI parsing news reports for flood warnings](https://rdjarbeng.com/assets/images/posts/covers/google_flood_cover.jpg) How Google Turned 5 Million News Articles into a Flash Flood Warning System  13 March 2026 · 7 min read  ](https://rdjarbeng.com/how-google-turned-5-million-news-articles-into-a-flash-flood-warning-system/)
[ ![AI Bill vs Headcount Costs illustration](https://rdjarbeng.com/assets/images/posts/covers/ai_bill_headcount_cover.jpg) Why Your Monthly AI Bill Might Soon Rival Your Headcount Costs From A Personal Experience  6 March 2026 · 2 min read  ](https://rdjarbeng.com/why-your-monthly-ai-bill-might-soon-rival-your-headcount-costs-from-a-personal-experience/)
[ ![Illustration of a structural gate blocking a claw with text The OpenClaw Ban Wave](https://rdjarbeng.com/assets/images/google_bans_antigravity_v2.png) Google Bans Hundreds of Paying Antigravity Users for Using OpenClaw - Then Says "We Heard You"  2 March 2026 · 3 min read  ](https://rdjarbeng.com/google-bans-hundreds-of-paying-antigravity-users-for-using-openclaw-then-says-we-heard-you/)

📋 Overview

Crawl Tester is a Python-based web crawler that helps you understand how your website appears to search engines and AI agents. It crawls a given URL and outputs the extracted content as clean, well-formatted Markdown, allowing you to verify that your site structure, metadata, and content are properly accessible to bots and agents.

Key Features

🕷️ Async Web Crawling: Fast, non-blocking crawling using async/await
📝 Markdown Output: Clean, readable Markdown conversion of web content
⚙️ Configurable: Easily customize crawl behavior and output
🤖 Bot-Friendly Analysis: Test how bots and search engines see your content
🎯 Flexible URLs: Crawl any website and verify its structure
⏱️ Timestamped Output: Automatic file naming with timestamps for organization

🚀 Quick Start

Prerequisites

Python 3.8+
pip (Python package manager)

Installation

Clone the repository:

git clone https://github.com/yourusername/crawl-tester.git
cd crawl-tester

Install dependencies:

pip install -r requirements.txt

Install Playwright browsers (one-time setup):

playwright install

Basic Usage

Run the crawler on a website:

python main.py --output my_site

This will crawl http://rdjarbeng.com by default and save output to my_site_YYYYMMDD_HHMMSS.md.

Custom URL

To crawl a different URL, modify the url parameter in main.py:

result = await crawler.arun(url="https://your-website.com", crawler_config=config)

📦 Dependencies

crawl4ai: Advanced web crawling library
playwright: Browser automation for rendering JavaScript-heavy sites
fastapi: (Optional) For building APIs around the crawler
uvicorn: (Optional) ASGI server for FastAPI

📚 Project Structure

crawl-tester/
├── main.py              # Main crawler script
├── requirements.txt     # Python dependencies
├── README.md           # This file
├── LICENSE             # MIT License
└── .gitignore         # Git ignore patterns

🔧 Configuration

CrawlerRunConfig Options

The crawler can be configured using CrawlerRunConfig(). Some useful options:

verbose: Enable detailed logging
wait_until: Wait for specific page load conditions
timeout: Set crawl timeout
Custom headers and user agents

Example:

config = CrawlerRunConfig(
    verbose=True,
    timeout=30
)
result = await crawler.arun(url="https://example.com", crawler_config=config)

💡 Use Cases

SEO Analysis: Verify how search engines see your content
Bot Testing: Check how AI agents perceive your site structure
Content Verification: Ensure metadata and structured data are properly exposed
Accessibility Audit: Verify semantic HTML is crawlable
Web Scraping: Extract and archive website content as Markdown
Monitoring: Track how bots index your site over time

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

Fork the repository
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🐛 Troubleshooting

Playwright not installed

If you get errors related to Playwright browsers, run:

playwright install

Connection timeouts

Increase the timeout in CrawlerRunConfig:

config = CrawlerRunConfig(timeout=60)  # 60 seconds

JavaScript rendering

The crawler automatically handles JavaScript-rendered content through Playwright.

📮 Support

For issues, questions, or suggestions, please open an Issue.

🙏 Acknowledgments

crawl4ai - Advanced web crawling library
Playwright - Browser automation
FastAPI - Modern web framework

Made with ❤️ by developers, for developers

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.github		.github
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
DEVELOPMENT.md		DEVELOPMENT.md
EXAMPLES.md		EXAMPLES.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
gallery2_item_static_site_20260215_173343.md		gallery2_item_static_site_20260215_173343.md
gallery_item_static_site_20260205_165430.md		gallery_item_static_site_20260205_165430.md
gallery_page_20260205_163922.md		gallery_page_20260205_163922.md
main.py		main.py
my_crawl_20260205_154937.md		my_crawl_20260205_154937.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Crawl Tester

Example crawl on site rdjarbeng.com produces these lines (showed just first few lines)

📋 Overview

Key Features

🚀 Quick Start

Prerequisites

Installation

Basic Usage

Custom URL

📦 Dependencies

📚 Project Structure

🔧 Configuration

CrawlerRunConfig Options

💡 Use Cases

🤝 Contributing

📄 License

🐛 Troubleshooting

Playwright not installed

Connection timeouts

JavaScript rendering

📮 Support

🙏 Acknowledgments

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Crawl Tester

Example crawl on site rdjarbeng.com produces these lines (showed just first few lines)

📋 Overview

Key Features

🚀 Quick Start

Prerequisites

Installation

Basic Usage

Custom URL

📦 Dependencies

📚 Project Structure

🔧 Configuration

CrawlerRunConfig Options

💡 Use Cases

🤝 Contributing

📄 License

🐛 Troubleshooting

Playwright not installed

Connection timeouts

JavaScript rendering

📮 Support

🙏 Acknowledgments

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages