Text-Based Search Engine Project

Project Overview

This project, developed as an assignment for the Information Retrieval subject, demonstrates the implementation of search engines using two distinct techniques: TF-IDF based vectorization and embedding-based vectorization. Our goal is to showcase efficient and accurate document retrieval in response to user queries, highlighting the differences and advantages of each approach.

Features

Dual search engine implementation: TF-IDF and Word Embedding based
Query suggestion functionality
Document clustering and topic detection
Similar document retrieval
Efficient offline processing and fast online querying

Technologies Used

Python: Primary programming language
NumPy: For numerical computations
Chroma DB: Vector database for efficient similarity search
Gensim: For Word2Vec model implementation
Scikit-learn: For TF-IDF vectorization and other machine learning utilities
FastAPI: For creating the web API
NLTK: For text processing and tokenization

Datasets

Antique: A non-factoid question answering dataset Link
Wikipedia: A subset of Wikipedia articles Link

Process Workflow

TF-IDF Based Search Engine

Process	Description
Offline Process	1. Load and preprocess documents 2. Create vocabulary 3. Compute TF-IDF matrix 4. Store TF-IDF matrix and vocabulary
Online Process	1. Receive user query 2. Preprocess query 3. Convert query to TF-IDF vector 4. Compute similarity with document vectors 5. Rank and return top results

Word2Vec Based Search Engine

Process	Description
Offline Process	1. Load and preprocess documents 2. Train or load pre-trained Word2Vec model 3. Compute document embeddings 4. Store embeddings in Chroma DB
Online Process	1. Receive user query 2. Preprocess query 3. Compute query embedding 4. Perform similarity search in Chroma DB 5. Rank and return top results

Implementation Details

TF-IDF Based Vectorization

The TF-IDF (Term Frequency-Inverse Document Frequency) approach involves:

Creating a vocabulary from all documents
Computing TF-IDF scores for each term in each document
Representing documents and queries as TF-IDF vectors
Using cosine similarity to find relevant documents

Embedding-Based Vectorization

The Word Embedding approach involves:

Using pre-trained or custom-trained Word2Vec models
Representing words as dense vectors
Computing document embeddings by averaging word vectors
Using vector similarity in embedding space to find relevant documents

Examples

Query Suggestion	Query Result

Topic Detection	Similar Documents

Performance Comparison

Metric	TF-IDF Based	Word Embedding Based
MAP	54%	70%
MRR	63%	80%

The Word Embedding based approach shows superior performance in both Mean Average Precision (MAP) and Mean Reciprocal Rank (MRR) metrics.

Additional Features

Query Suggestion

Our system provides query suggestions based on:

Processing the user's input query
Generating word vectors using Word2Vec
Finding similar terms using cosine similarity
Ranking and presenting the top suggestions

Documents Clustering

We implement document clustering to group similar documents and identify topics:

Using K-Means clustering algorithm
Applying Latent Dirichlet Allocation (LDA) for topic modeling

How to Use

[To be added in a future update]

Documentation

For complete documentation of the project in Arabic, please refer to the following link:

Arabic Documentation

Future Improvements

Implement more advanced embedding models (e.g., BERT, GPT)
Enhance query suggestion with user interaction data
Improve clustering algorithms for better topic detection
Optimize performance for larger datasets

Contributors

Alaa Aldeen Zamel
Anas Rish
Anas Durra
Mohammed Hadi Barakat
Mohammed Fares Dabbas

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 93 Commits
.idea		.idea
api		api
clustering		clustering
common		common
database		database
dataset		dataset
engines		engines
evaluation		evaluation
matchers		matchers
models		models
text_processors		text_processors
.env		.env
.gitignore		.gitignore
LICENSE		LICENSE
app.py		app.py
chroma.log		chroma.log
emb_of.png		emb_of.png
emb_on.png		emb_on.png
main.py		main.py
n_grams.png		n_grams.png
query_result.png		query_result.png
query_suggestion.png		query_suggestion.png
readme.md		readme.md
similar_documents.png		similar_documents.png
tf_of.png		tf_of.png
tf_on.png		tf_on.png
topic_detection.png		topic_detection.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text-Based Search Engine Project

Project Overview

Features

Technologies Used

Datasets