This project implements a simple sentiment analysis service that predicts whether a piece of text has positive or negative sentiment. The model is trained on the IMDb movie reviews dataset and exposed through a REST API built with FastAPI.
The API allows developers to send text and receive a predicted sentiment along with a confidence score.
Sentiment-api/ │ ├── app/ │ ├── init.py │ ├── main.py # FastAPI application │ ├── model.py # Model loading and prediction logic │ └── schemas.py # Pydantic request/response schemas │ ├── model/ │ └── sentiment_model.pkl # Saved model files │ ├── train.py # Training script ├── requirements.txt └── README.md
Python version used: Python 3.9.6
Required libraries are listed in requirements.txt.
git clone [(https://github.com/nimsala1234/sentiment-analysis-api)] cd Sentiment-api
pip install -r requirements.txt
Run the training script: python train.py
This will:
• Download the IMDb dataset
• Train the TF-IDF + Logistic Regression model
• Evaluate performance
• Save the trained model to:
model/sentiment_model.pkl
Run the FastAPI server with: python -m uvicorn app.main:app --reload
The server will start at: http://127.0.0.1:8000
You can also access interactive API documentation at: http://127.0.0.1:8000/docs
GET /health
Response: { "status": "ok" }
Endpoint: POST /predict
Example using curl: curl -X POST "http://127.0.0.1:8000/predict " -H "Content-Type: application/json" -d '{"text": "I absolutely loved this movie!"}'
Example response: { "text": "I absolutely loved this movie!", "sentiment": "positive", "confidence": 0.93 }
Endpoint: POST /predict/batch
Example request: { "texts": [ "I loved this movie", "This film was terrible" ] }
Example response: { "results": [ { "text": "I loved this movie", "sentiment": "positive", "confidence": 0.92 }, { "text": "This film was terrible", "sentiment": "negative", "confidence": 0.95 } ] }
The sentiment classifier was trained using the IMDb movie reviews dataset, which contains 50,000 labeled reviews. Text data was preprocessed by removing HTML tags, URLs, punctuation, and converting text to lowercase.
A TF-IDF vectorizer was used to convert text into numerical feature vectors while capturing the importance of terms across documents. The classifier used was Logistic Regression, which performs well for high-dimensional sparse text data and provides interpretable probability outputs.
To ensure robustness, 5-fold cross-validation was performed on the training data before final evaluation on the test set.
Cross-validation scores: [0.865 0.8694 0.8636 0.8742 0.8688] Mean CV Accuracy: 0.8682000000000001
With more time, I would experiment with transformer-based models such as BERT and perform hyperparameter tuning to further improve accuracy.
Accuracy: 0.8878 Precision: 0.8878 Recall: 0.8878 F1 Score: 0.8878
These results indicate that the model performs well for sentiment classification.
The IMDb dataset contains binary sentiment labels (positive and negative). Therefore, the model predicts these two classes. A neutral sentiment class could be added by training on a dataset that includes neutral labels.