This project builds a web research assistant that answers questions by searching the internet, evaluating sources, and generating summaries with proper citations. The agent aims to provide accurate information while being transparent about its sources and avoiding hallucinations.
Frontend: Streamlit for the web interface
Search: SERP API for web search results
AI: Google's Gemini 1.5 Flash for content summarization
Web Scraping: BeautifulSoup for content extraction
Language: Python 3.8+
Install dependencies:
pip install -r requirements.txt
or
pip install streamlit requests beautifulsoup4 google-generativeai
Get API keys:
Sign up for SERP API at serpapi.com Get a Gemini API key from Google AI Studio
Set up secrets: Create .streamlit/secrets.toml with:
SERP_API_KEY="Your api key here"
GEMINI_API_KEY="Your api key here"
Run the app:
streamlit run web_research_agent.py
The system follows a pipeline approach:
-
Query Processing: User's question will be sent to SERP API for search
-
Content Extraction: Top 5 URL's are analyzed for content
-
Source Evaluation: Quality of the source is analyzed
-
Answer Generation: Answer is then generated by the LLM model by taking into account the extracted content and query
-
SERP API finds relevant pages using Google's search engine
-
Extracts main content and limits to 5000 characters per page
-
Each source gets a score based on domain reputation and length of the content.
-
Removes duplicate sources from the same domain
-
Picks the 3 highest-quality sources for answer generation
The main prompt to Gemini follows this structure:
You are a research assistant. Based on the following sources, provide a concise answer to the query. If the information is not available in the sources, say "I don't know". Always cite your sources using numbers like [1], [2], etc.
Query: {user_question}
Sources:
Answer:
-
Implement proper chunking and vector storage for better context management
-
Add user feedback to improve source quality assessment over time