Welcome to WebTrek, your new search engine built with Python and powered by ElasticSearch. WebTrek combines cutting-edge technology with intuitive design to revolutionize the way you explore the vast realm of the internet.
At its core, WebTrek utilizes ElasticSearch, a powerful and scalable search engine, to index and search through a vast array of web content rapidly and efficiently. Whether you're looking for articles, blog posts, product reviews, or any other type of online content, WebTrek provides lightning-fast search results tailored to your needs.
But WebTrek doesn't stop there. In addition to its robust search capabilities, it also features a sophisticated web scraper component. This allows WebTrek to autonomously traverse the web, collecting fresh data from various sources to ensure that its index remains up-to-date and comprehensive.
-
ElasticSearch Installation: Ensure ElasticSearch is installed and running on your machine. Follow the steps outlined in the ElasticSearch download page for installation instructions.
-
Generate API Key: Create an API key for performing operations on ElasticSearch using Python. You can refer to the guide on how to generate an API key. Alternatively, you can use Kibana to generate the API key via its UI. Install Kibana from the Kibana download page and refer to the API keys documentation for assistance.
-
Update API Key: After generating the API key, capture the api_key value and update the "api_key" variable in elastic_logics.py as follows:
client = Elasticsearch(
"https://localhost:9200",
api_key="YOUR_API_KEY",
verify_certs=False
)
Ensure to replace YOUR_API_KEY_HERE with the actual API key obtained in the previous step.
- Install Dependencies: Install all dependencies using below command:
pip install -r requirements.txt
By following these setup instructions, you'll be ready to utilize ElasticSearch for WebTrek seamlessly.
- Download GoogleNews Vectors: Use link to download GoogleNews-vectors-negative300.bin file required for the ranking part. Place it under the main directory.
I have used two ways to collect data:
- Using Scrapper
- Using Common-Crawl
Navigate to the "Go to Scraper" button and click on it to access the Scraper component of WebTrek. Here, you'll have the opportunity to input the webpage or domain you wish to include in your search engine's index. Additionally, you'll be prompted to specify the path where temporary HTML files will be stored on your local machine.
Once you've provided this information, simply click on the "Scrape" button to initiate the scraping process. WebTrek will then gather the data from the specified webpage/domain and store it in your ElasticSearch node. Once completed, the scraped data will be seamlessly integrated into your search engine, ready for utilization.
- Go to Common-Crawl page using link. Choose a Crawl of your choice.
- Download warc.paths.gz.
- To unzip the file, execute the following command:
gunzip warc.paths.gz
This command will extract the warc.paths file. You can open it using any text editor. The file contains paths to the WARC files. For example:
crawl-data/CC-MAIN-2024-10/segments/1707947473347.0/warc/CC-MAIN-20240220211055-20240221001055-00000.warc.gz
crawl-data/CC-MAIN-2024-10/segments/1707947473347.0/warc/CC-MAIN-20240220211055-20240221001055-00001.warc.gz
- Download these files using below command:
wget https://data.commoncrawl.org/PATH_FROM_WARC.PATHS_FILE
Example:
wget https://data.commoncrawl.org/crawl-data/CC-MAIN-2024-10/segments/1707947473347.0/warc/CC-MAIN-20240220211055-20240221001055-00000.warc.gz
Before running common-crawl-prepare.py, ensure you update the following variables:
- Path to WARC File:
- Update 'Path_To_File/FileName1.warc' to the actual path where the WARC file is located after executing the previous steps.
- Temporary HTML Files Storage Folder:
- Update 'Path_To_Folder' to the desired folder path where you want to temporarily store the HTML files. After updating these variables, proceed with executing common-crawl-prepare.py for seamless processing of Common Crawl data.
warc_file_paths = [Path_To_File/FileName1.warc]
folder_path = 'Path_To_Folder'
After updating these variables, proceed with executing common-crawl-prepare.py for seamless processing of Common Crawl data.
python3 common-crawl-prepare.py
Once the setup is done. Run main.py using below command.
cd flask-server
python3 server.py
Once the client is up and running. By default React app will start on localhost port 3000.
cd client/webtrek
npm install
npm start
I have already attached screenshots of Main page and Scrapper page above.





