This Streamlit web application predicts whether two questions from Quora are duplicates, aiming to reduce redundancy and enhance user experience on Q&A platforms. By leveraging Natural Language Processing (NLP) techniques and machine learning models, the app provides real-time predictions on question similarity.
- User-Friendly Interface: Input two questions and receive instant feedback on their similarity.
- Preprocessing Pipeline: Includes text cleaning, stopword removal, and vectorization using Bag of Words (BoW).
- Machine Learning Model: Utilizes a trained model (e.g., Logistic Regression) to predict duplicate questions.
├── app.py # Main Streamlit application
├── helper.py # Helper functions for preprocessing
├── model.pkl # Trained ML model
├── cv.pkl # CountVectorizer object
├── stopwords.pkl # List of stopwords
├── requirements.txt # Python dependencies
└── README.md # Project documentation
Would recommend using powershell/windows for streamlit apps .
- Python 3.7 or higher
- pip (Python package installer)
- Clone the repository:
https://github.com/vennelavarshini18/Quora-Duplicate-Question-Pairs-Detector.git
- Navigate to the project directory:
cd Quora-Duplicate-Question-Pairs-Detector - Create a virtual environment (optional but recommended):
python -m venv venv
- Activate the virtual environment:
- Windows:
venv\Scripts\activate
- Mac/Linux:
source venv/bin/activate
- Windows:
- Install the required dependencies:
pip install -r requirements.txt
After installation, follow these steps to run the project:
- Start the application:
streamlit run app.py
- Open your browser and go to:
(or the address displayed in your terminal)
https://127.0.0.1:5000/
- Vectorization: Bag of Words (BoW) using CountVectorizer
- Model: Logistic Regression trained on preprocessed question pairs
- Evaluation Metric: Accuracy score on a validation set
Contributions are welcome! Please fork the repository and submit a pull request for any enhancements or bug fixes.
