A simple Python utility to crawl and extract API documentation from websites. Creates a single readable text file containing all documentation content. This text file can then easily be used as context for AI tools to assist a developer with creating an app integration or other tasks.
- Install dependencies:
pip install requests beautifulsoup4- Clone repository:
git clone [repository-url]
cd api-crawler- Edit
crawler.pyto set your target documentation URL:
base_url = "https://docs.example.com/api/" # Replace with API docs URL
links_list- Run crawler:
python api-crawler/crawler.py- Find extracted documentation in
documentation.txt
- Crawls all pages under specified documentation URL
- Extracts readable text content
- Preserves page structure with clear section boundaries
- Includes source URLs for reference
- Rate-limited to be server-friendly
The generated documentation.txt will contain sections formatted as:
================================================================================
PAGE: [Page Title]
URL: [Source URL]
================================================================================
[Page Content]
Modify get_page_text() function to adjust content extraction for specific documentation structures.
Feel free to submit issues and pull requests for improvements.