A Streamlit-based platform for collecting, exploring, and analyzing company data from Yellow Pages, enriched with website content extraction and AI-powered chat assistant.
- Scrape company listings from Yellow Pages Indonesia
- Extract detailed company info and website content using NeuScraper
- Save and manage datasets
- Explore company data interactively
- Chat with a Google Gemini-powered AI assistant about any company
git clone https://github.com/dejanazul/caprae_capital_interview_pre-work
cd caprae_capital_interview_pre-workIt is recommended to use a virtual environment:
python -m venv venv
source venv/bin/activate
pip install -r requirements.txtCreate a [.env] file in the root directory and add your Google Gemini API key:
GEMINI_API_KEY=your_gemini_api_key_hereDownload checkpoint for Neural Scraper
git lfs install
git clone https://huggingface.co/Vincero/neural_scrapper_fixed1️⃣ Open the deployment directory
cd NeuScraper/app2️⃣ Fill in the neural scraper checkpoint path in app
args.model_path = "path/to/your/model/fixed_training_state_checkpoint.tar"3️⃣ Deploy NeuScraper
uvicorn app:app --reload --host 0.0.0.0 --port 1688In a new terminal, from the project root:
streamlit run app.py