Web Trek - The Search Engine (React.js)

Welcome to WebTrek, your new search engine built with Python and powered by ElasticSearch. WebTrek combines cutting-edge technology with intuitive design to revolutionize the way you explore the vast realm of the internet.

At its core, WebTrek utilizes ElasticSearch, a powerful and scalable search engine, to index and search through a vast array of web content rapidly and efficiently. Whether you're looking for articles, blog posts, product reviews, or any other type of online content, WebTrek provides lightning-fast search results tailored to your needs.

But WebTrek doesn't stop there. In addition to its robust search capabilities, it also features a sophisticated web scraper component. This allows WebTrek to autonomously traverse the web, collecting fresh data from various sources to ensure that its index remains up-to-date and comprehensive.

Setup:

ElasticSearch Installation: Ensure ElasticSearch is installed and running on your machine. Follow the steps outlined in the ElasticSearch download page for installation instructions.
Generate API Key: Create an API key for performing operations on ElasticSearch using Python. You can refer to the guide on how to generate an API key. Alternatively, you can use Kibana to generate the API key via its UI. Install Kibana from the Kibana download page and refer to the API keys documentation for assistance.
Update API Key: After generating the API key, capture the api_key value and update the "api_key" variable in elastic_logics.py as follows:

client = Elasticsearch(
  "https://localhost:9200",
  api_key="YOUR_API_KEY",
  verify_certs=False
)

Ensure to replace YOUR_API_KEY_HERE with the actual API key obtained in the previous step.

Install Dependencies: Install all dependencies using below command:

pip install -r requirements.txt

By following these setup instructions, you'll be ready to utilize ElasticSearch for WebTrek seamlessly.

Download GoogleNews Vectors: Use link to download GoogleNews-vectors-negative300.bin file required for the ranking part. Place it under the main directory.

Data Gathering

I have used two ways to collect data:

Using Scrapper
Using Common-Crawl

Using Scrapper

Navigate to the "Go to Scraper" button and click on it to access the Scraper component of WebTrek. Here, you'll have the opportunity to input the webpage or domain you wish to include in your search engine's index. Additionally, you'll be prompted to specify the path where temporary HTML files will be stored on your local machine.

Once you've provided this information, simply click on the "Scrape" button to initiate the scraping process. WebTrek will then gather the data from the specified webpage/domain and store it in your ElasticSearch node. Once completed, the scraped data will be seamlessly integrated into your search engine, ready for utilization.

Main page:

Scrapper page:

Using Common-Crawl

Go to Common-Crawl page using link. Choose a Crawl of your choice.

Download warc.paths.gz.

To unzip the file, execute the following command:

gunzip warc.paths.gz

This command will extract the warc.paths file. You can open it using any text editor. The file contains paths to the WARC files. For example:

crawl-data/CC-MAIN-2024-10/segments/1707947473347.0/warc/CC-MAIN-20240220211055-20240221001055-00000.warc.gz
crawl-data/CC-MAIN-2024-10/segments/1707947473347.0/warc/CC-MAIN-20240220211055-20240221001055-00001.warc.gz

Download these files using below command:

wget https://data.commoncrawl.org/PATH_FROM_WARC.PATHS_FILE

Example:

wget https://data.commoncrawl.org/crawl-data/CC-MAIN-2024-10/segments/1707947473347.0/warc/CC-MAIN-20240220211055-20240221001055-00000.warc.gz

Before running common-crawl-prepare.py, ensure you update the following variables:

Path to WARC File:

Update 'Path_To_File/FileName1.warc' to the actual path where the WARC file is located after executing the previous steps.

Temporary HTML Files Storage Folder:

Update 'Path_To_Folder' to the desired folder path where you want to temporarily store the HTML files. After updating these variables, proceed with executing common-crawl-prepare.py for seamless processing of Common Crawl data.

warc_file_paths = [Path_To_File/FileName1.warc]
folder_path = 'Path_To_Folder'

After updating these variables, proceed with executing common-crawl-prepare.py for seamless processing of Common Crawl data.

python3 common-crawl-prepare.py

Start the server

Once the setup is done. Run main.py using below command.

cd flask-server
python3 server.py

Start the client

Once the client is up and running. By default React app will start on localhost port 3000.

cd client/webtrek
npm install
npm start

Screenshots

I have already attached screenshots of Main page and Scrapper page above.

Result page:

History page:

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
client/webtrek		client/webtrek
content/images		content/images
flask-server		flask-server
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Trek - The Search Engine (React.js)

Setup:

Data Gathering

Using Scrapper

Using Common-Crawl

Start the server

Start the client

Screenshots

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Web Trek - The Search Engine (React.js)

Setup:

Data Gathering

Using Scrapper

Using Common-Crawl

Start the server

Start the client

Screenshots

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages