This is the extended version of the original courseAnalyzer project.
It adds a full AI analysis layer on top of the existing Hadoop/PySpark/MapReduce pipeline:
| New feature | Script |
|---|---|
| SBERT sentiment (positive / neutral / negative) | job_d_sbert_analysis.py |
| Trend Score (0–10) | inside job_d_sbert_analysis.py |
| Reliability Score (0–10) | inside job_d_sbert_analysis.py |
| Enriched MongoDB loader | load_analyzed_to_mongodb.py |
| Pretty result viewer | view_results.py |
Raw CSVs (HDFS)
│
▼
Job A – PySpark: clean courses → hdfs://.../processed/courses/
Job B – PySpark: clean reviews → hdfs://.../processed/reviews/
Job C – PySpark: clean labeled data → hdfs://.../processed/labeled/
│
▼
MapReduce Join (mrjob) → hdfs://.../final/courses_with_reviews/
│
▼
Job D – SBERT AI Analysis → hdfs://.../final/courses_analyzed/
• SBERT sentence embeddings
• Zero-shot sentiment classification (cosine similarity to anchor phrases)
• Trend Score formula (subscribers, review density, sentiment, rating)
• Reliability Score formula (volume, rating std-dev, sentiment entropy, SBERT confidence)
│
▼
MongoDB course_analysis
├── courses_analyzed (full docs with per-review sentiments)
└── course_scores (flat summary for dashboards)
| Tool | Version | Notes |
|---|---|---|
| Docker Desktop | ≥ 4.x | Runs Hadoop + Spark + MongoDB |
| Docker Compose | ≥ 2.x | Bundled with Docker Desktop |
| Python | 3.9 – 3.11 | 3.12 has minor torch issues |
| pip | latest | python -m pip install --upgrade pip |
| Hadoop CLI (optional) | 3.2.x | Only needed if running hadoop fs locally; otherwise use docker exec |
GPU optional — SBERT runs on CPU. A GPU shortens analysis from ~minutes to seconds for large datasets, but is not required.
pip install -r requirements.txtOn a CPU-only machine, install the CPU-only PyTorch wheel first to save disk space:
pip install torch --index-url https://download.pytorch.org/whl/cpu
pip install -r requirements.txtdocker-compose up -dWait ~15 seconds for HDFS to format. Verify:
- HDFS Namenode UI : http://localhost:9870
- Spark Master UI : http://localhost:8080
- MongoDB : localhost:27017
docker exec -it namenode hadoop fs -mkdir -p /data/raw/courses/
docker exec -it namenode hadoop fs -mkdir -p /data/raw/reviews/
docker exec -it namenode hadoop fs -mkdir -p /data/raw/labeled/
docker exec -it namenode hadoop fs -mkdir -p /data/raw/generic/
# Courses
docker exec -it namenode hadoop fs -put /data/3.1-data-sheet-udemy-courses-business-courses.csv /data/raw/courses/
docker exec -it namenode hadoop fs -put /data/3.1-data-sheet-udemy-courses-design-courses.csv /data/raw/courses/
docker exec -it namenode hadoop fs -put /data/3.1-data-sheet-udemy-courses-music-courses.csv /data/raw/courses/
docker exec -it namenode hadoop fs -put /data/3.1-data-sheet-udemy-courses-web-development.csv /data/raw/courses/
docker exec -it namenode hadoop fs -put /data/Coursera.csv /data/raw/courses/
docker exec -it namenode hadoop fs -put /data/coursera_course_dataset_v2_no_null.csv /data/raw/courses/
docker exec -it namenode hadoop fs -put /data/coursera_course_dataset_v3.csv /data/raw/courses/
docker exec -it namenode hadoop fs -put /data/edx.csv /data/raw/courses/
# Reviews
docker exec -it namenode hadoop fs -put /data/coursera_reviews.csv /data/raw/reviews/
docker exec -it namenode hadoop fs -put /data/udemy_reviews.csv /data/raw/reviews/
docker exec -it namenode hadoop fs -put /data/ratemyprofessors.csv /data/raw/reviews/
# Labeled
docker exec -it namenode hadoop fs -put /data/coursera_reviews_label_3.csv /data/raw/labeled/
docker exec -it namenode hadoop fs -put /data/reviews_by_course.csv /data/raw/labeled/
docker exec -it namenode hadoop fs -put /data/sentiment_sentences_large.csv /data/raw/labeled/
# Generic
docker exec -it namenode hadoop fs -put "/data/Reviews (2).csv" /data/raw/generic/The
docker-compose.ymlmounts./datainto the namenode at/data, so the above paths refer to files inside your localdata/folder.
python job_a_clean_courses.py
python job_b_clean_reviews.py
python job_c_clean_labeled.pypython mapreduce_join.py -r hadoop \
hdfs://namenode:9000/data/processed/courses/ \
hdfs://namenode:9000/data/processed/reviews/ \
--output-dir hdfs://namenode:9000/data/final/courses_with_reviews/python job_d_sbert_analysis.pyThis will:
- Download
all-MiniLM-L6-v2from HuggingFace (~80 MB, one-time) - Pull MapReduce output from HDFS
- Run SBERT embeddings + sentiment classification
- Compute
trend_scoreandreliability_score - Push enriched JSONL back to HDFS
python load_to_mongodb.pypython load_analyzed_to_mongodb.pyCreates two collections:
courses_analyzed– full documents including per-review sentiment labelscourse_scores– flat score summary for fast querying
# Top 10 by trend score (default)
python view_results.py
# Top 20 by reliability score, Coursera only
python view_results.py --top 20 --sort reliability_score --platform coursera
# Top 15 Udemy courses by trend
python view_results.py --top 15 --platform udemyMeasures how popular and momentum-driven a course is.
trend_score = 10 × (
0.35 × log10(1 + subscribers) / log10(1_000_001) # scale of audience
+ 0.20 × reviews_analyzed / max(subscribers, 1) # engagement ratio
+ 0.25 × positive_review_ratio # sentiment signal
+ 0.20 × avg_rating / 5.0 # quality signal
)
Measures how trustworthy and consistent the review set is.
reliability_score = 10 × (
0.30 × log(1 + review_count) / log(501) # volume → credibility
+ 0.25 × (1 − std_dev(ratings) / 2.5) # rating consistency
+ 0.25 × (1 − entropy(sentiments) / log2(3)) # sentiment agreement
+ 0.20 × avg_sbert_confidence # model confidence
)
{
"course_title": "Python for Everybody",
"platform": "coursera",
"rating": 4.8,
"avg_rating": 4.72,
"price": 0.0,
"num_subscribers": 500000,
"num_reviews": 12500,
"review_count_analyzed": 500,
"positive_ratio": 0.84,
"trend_score": 8.37,
"reliability_score": 7.92,
"avg_sentiment_confidence": 0.71,
"sentiment_distribution": {"positive": 420, "neutral": 55, "negative": 25},
"analyzed_at": "2025-04-25T10:30:00Z"
}Same as above but includes a "reviews" array where each review has:
review_clean: NLP-cleaned textreview_original: original textrating: numeric ratingsentiment_label:"positive"/"neutral"/"negative"sentiment_score: SBERT confidence float 0–1
| Issue | Fix |
|---|---|
hadoop command not found locally |
Run python job_d_sbert_analysis.py inside the namenode container, or install Hadoop 3.2.1 locally and set HADOOP_HOME |
| SBERT model download fails | Set HF_ENDPOINT env var or pre-download: python -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('all-MiniLM-L6-v2')" |
| OOM on large review sets | Reduce MAX_REVIEWS in job_d_sbert_analysis.py (default 500) |
| MongoDB connection refused | Confirm docker-compose up -d and port 27017 is free |
| PySpark can't reach HDFS | Ensure namenode hostname resolves – add 127.0.0.1 namenode to /etc/hosts |