Skip to content

pri-2711/courseAnalyzer_dataPreparation

Repository files navigation

Course Analyzer – Extended Pipeline (SBERT + Trend/Reliability Scores)

This is the extended version of the original courseAnalyzer project. It adds a full AI analysis layer on top of the existing Hadoop/PySpark/MapReduce pipeline:

New feature Script
SBERT sentiment (positive / neutral / negative) job_d_sbert_analysis.py
Trend Score (0–10) inside job_d_sbert_analysis.py
Reliability Score (0–10) inside job_d_sbert_analysis.py
Enriched MongoDB loader load_analyzed_to_mongodb.py
Pretty result viewer view_results.py

Architecture

Raw CSVs (HDFS)
      │
      ▼
Job A – PySpark: clean courses         → hdfs://.../processed/courses/
Job B – PySpark: clean reviews         → hdfs://.../processed/reviews/
Job C – PySpark: clean labeled data    → hdfs://.../processed/labeled/
      │
      ▼
MapReduce Join (mrjob)                  → hdfs://.../final/courses_with_reviews/
      │
      ▼
Job D – SBERT AI Analysis               → hdfs://.../final/courses_analyzed/
  • SBERT sentence embeddings
  • Zero-shot sentiment classification (cosine similarity to anchor phrases)
  • Trend Score  formula (subscribers, review density, sentiment, rating)
  • Reliability Score formula (volume, rating std-dev, sentiment entropy, SBERT confidence)
      │
      ▼
MongoDB  course_analysis
  ├── courses_analyzed      (full docs with per-review sentiments)
  └── course_scores         (flat summary for dashboards)

Prerequisites

1 · Software

Tool Version Notes
Docker Desktop ≥ 4.x Runs Hadoop + Spark + MongoDB
Docker Compose ≥ 2.x Bundled with Docker Desktop
Python 3.9 – 3.11 3.12 has minor torch issues
pip latest python -m pip install --upgrade pip
Hadoop CLI (optional) 3.2.x Only needed if running hadoop fs locally; otherwise use docker exec

GPU optional — SBERT runs on CPU. A GPU shortens analysis from ~minutes to seconds for large datasets, but is not required.

2 · Python packages

pip install -r requirements.txt

On a CPU-only machine, install the CPU-only PyTorch wheel first to save disk space:

pip install torch --index-url https://download.pytorch.org/whl/cpu
pip install -r requirements.txt

3 · Docker containers

docker-compose up -d

Wait ~15 seconds for HDFS to format. Verify:


Running the Pipeline

Step 1 – Upload raw CSVs to HDFS

docker exec -it namenode hadoop fs -mkdir -p /data/raw/courses/
docker exec -it namenode hadoop fs -mkdir -p /data/raw/reviews/
docker exec -it namenode hadoop fs -mkdir -p /data/raw/labeled/
docker exec -it namenode hadoop fs -mkdir -p /data/raw/generic/

# Courses
docker exec -it namenode hadoop fs -put /data/3.1-data-sheet-udemy-courses-business-courses.csv  /data/raw/courses/
docker exec -it namenode hadoop fs -put /data/3.1-data-sheet-udemy-courses-design-courses.csv    /data/raw/courses/
docker exec -it namenode hadoop fs -put /data/3.1-data-sheet-udemy-courses-music-courses.csv     /data/raw/courses/
docker exec -it namenode hadoop fs -put /data/3.1-data-sheet-udemy-courses-web-development.csv   /data/raw/courses/
docker exec -it namenode hadoop fs -put /data/Coursera.csv                                       /data/raw/courses/
docker exec -it namenode hadoop fs -put /data/coursera_course_dataset_v2_no_null.csv            /data/raw/courses/
docker exec -it namenode hadoop fs -put /data/coursera_course_dataset_v3.csv                    /data/raw/courses/
docker exec -it namenode hadoop fs -put /data/edx.csv                                           /data/raw/courses/

# Reviews
docker exec -it namenode hadoop fs -put /data/coursera_reviews.csv    /data/raw/reviews/
docker exec -it namenode hadoop fs -put /data/udemy_reviews.csv        /data/raw/reviews/
docker exec -it namenode hadoop fs -put /data/ratemyprofessors.csv     /data/raw/reviews/

# Labeled
docker exec -it namenode hadoop fs -put /data/coursera_reviews_label_3.csv         /data/raw/labeled/
docker exec -it namenode hadoop fs -put /data/reviews_by_course.csv                /data/raw/labeled/
docker exec -it namenode hadoop fs -put /data/sentiment_sentences_large.csv        /data/raw/labeled/

# Generic
docker exec -it namenode hadoop fs -put "/data/Reviews (2).csv" /data/raw/generic/

The docker-compose.yml mounts ./data into the namenode at /data, so the above paths refer to files inside your local data/ folder.

Step 2 – PySpark cleaning jobs

python job_a_clean_courses.py
python job_b_clean_reviews.py
python job_c_clean_labeled.py

Step 3 – MapReduce join

python mapreduce_join.py -r hadoop \
  hdfs://namenode:9000/data/processed/courses/ \
  hdfs://namenode:9000/data/processed/reviews/ \
  --output-dir hdfs://namenode:9000/data/final/courses_with_reviews/

Step 4 – SBERT AI Analysis (new)

python job_d_sbert_analysis.py

This will:

  1. Download all-MiniLM-L6-v2 from HuggingFace (~80 MB, one-time)
  2. Pull MapReduce output from HDFS
  3. Run SBERT embeddings + sentiment classification
  4. Compute trend_score and reliability_score
  5. Push enriched JSONL back to HDFS

Step 5 – Load original collections to MongoDB

python load_to_mongodb.py

Step 6 – Load SBERT-analyzed data to MongoDB (new)

python load_analyzed_to_mongodb.py

Creates two collections:

  • courses_analyzed – full documents including per-review sentiment labels
  • course_scores – flat score summary for fast querying

Step 7 – View results (new)

# Top 10 by trend score (default)
python view_results.py

# Top 20 by reliability score, Coursera only
python view_results.py --top 20 --sort reliability_score --platform coursera

# Top 15 Udemy courses by trend
python view_results.py --top 15 --platform udemy

Score Formulas

Trend Score (0–10)

Measures how popular and momentum-driven a course is.

trend_score = 10 × (
    0.35 × log10(1 + subscribers) / log10(1_000_001)   # scale of audience
  + 0.20 × reviews_analyzed / max(subscribers, 1)       # engagement ratio
  + 0.25 × positive_review_ratio                        # sentiment signal
  + 0.20 × avg_rating / 5.0                             # quality signal
)

Reliability Score (0–10)

Measures how trustworthy and consistent the review set is.

reliability_score = 10 × (
    0.30 × log(1 + review_count) / log(501)             # volume → credibility
  + 0.25 × (1 − std_dev(ratings) / 2.5)                 # rating consistency
  + 0.25 × (1 − entropy(sentiments) / log2(3))          # sentiment agreement
  + 0.20 × avg_sbert_confidence                         # model confidence
)

MongoDB Collections Reference

course_scores (flat, fast queries)

{
  "course_title":           "Python for Everybody",
  "platform":               "coursera",
  "rating":                 4.8,
  "avg_rating":             4.72,
  "price":                  0.0,
  "num_subscribers":        500000,
  "num_reviews":            12500,
  "review_count_analyzed":  500,
  "positive_ratio":         0.84,
  "trend_score":            8.37,
  "reliability_score":      7.92,
  "avg_sentiment_confidence": 0.71,
  "sentiment_distribution": {"positive": 420, "neutral": 55, "negative": 25},
  "analyzed_at":            "2025-04-25T10:30:00Z"
}

courses_analyzed (full, with per-review sentiments)

Same as above but includes a "reviews" array where each review has:

  • review_clean : NLP-cleaned text
  • review_original : original text
  • rating : numeric rating
  • sentiment_label : "positive" / "neutral" / "negative"
  • sentiment_score : SBERT confidence float 0–1

Troubleshooting

Issue Fix
hadoop command not found locally Run python job_d_sbert_analysis.py inside the namenode container, or install Hadoop 3.2.1 locally and set HADOOP_HOME
SBERT model download fails Set HF_ENDPOINT env var or pre-download: python -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('all-MiniLM-L6-v2')"
OOM on large review sets Reduce MAX_REVIEWS in job_d_sbert_analysis.py (default 500)
MongoDB connection refused Confirm docker-compose up -d and port 27017 is free
PySpark can't reach HDFS Ensure namenode hostname resolves – add 127.0.0.1 namenode to /etc/hosts

About

Preparing the dataset using Big data analytics concepts and tech stack (pyspark, hdfs) and using SBERT for analytics

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages