An intelligent web scraping agent that uses multiple scraping methods and LangChain for content analysis.
- Multiple scraping methods:
- Normal: Traditional web scraping using Playwright & BeautifulSoup
- Firecrawl: Advanced scraping with JavaScript support
- Crawl4AI: AI-powered content extraction
- Interactive UI using Gradio
- Content summarization using GPT-4
- Automatic fallback mechanism
- Clone the repository:
git clone https://github.com/yourusername/AI_ScrapeAgent.git
cd AI_ScrapeAgent- Create a virtual environment:
python -m venv scrapvenv
source scrapvenv/bin/activate # On Windows: scrapvenv\Scripts\activate- Install dependencies:
pip install -r requirements.txt- Create a
.envfile with your API keys:
OPENAI_API_KEY=your_openai_key
FIRECRAWL_API_KEY=your_firecrawl_key
- Run the Gradio UI:
python gradio_ui.py- Or use the command line interface:
python main.pyagent.py: LangChain agent implementationscraper.py: Web scraping methods implementationtools.py: LangChain tools definitionmain.py: Command-line interfacegradio_ui.py: Gradio web interface
- Python 3.8+
- See requirements.txt for full list of dependencies