This project is a search engine, including web crawling, indexing, ranking, and query processing.
Playmaker-SearchEngine.mp4
- The web crawler is responsible for collecting documents from the web.
- It starts with a list of URL addresses (seed set) and downloads the documents identified by these URLs.
- Extracts hyperlinks from downloaded documents and adds them to the list of URLs to be downloaded.
- Key features:
- Avoids revisiting the same page.
- Crawls documents of specific types (HTML).
- Maintains state for resuming interrupted crawls.
- Handles robot.txt exclusions.
- Provides multithreaded implementation.
- Crawls a specified number of pages.
- Uses appropriate data structures for page visit order.
- Indexes the contents of downloaded HTML documents.
- Features:
- Persistence in secondary storage.
- Fast retrieval for word-based queries.
- Incremental update with newly crawled documents.
- Considers storage for result ranking and searching.
- Processes search queries.
- Performs necessary preprocessing and searches the index for relevant documents.
- Retrieves documents containing words with shared stems from the search query.
- Supports phrase searching with quotation marks.
- Results must match the order of words in the phrase.
- Ranks documents based on relevance and popularity.
- Calculates relevance based on query-word appearance and aggregation.
- Measures popularity using algorithms like PageRank.
- Implements a user-friendly web interface.
- Receives user queries and displays search results with snippets.
- Displays website title, URL, and relevant paragraph with query words in bold.
- Clone the repository.
- Install required dependencies.
- Run the main application file.
- Access the React web interface.