Sentio+

Overview

Sentio+ is an AI-powered decision-support platform that transforms large-scale, unstructured customer review data into actionable business insights using a Retrieval-Augmented Generation (RAG) architecture. It is designed as an internal intelligence tool for Product, CX, Strategy, and Leadership teams to understand why customers feel the way they do and what actions should be taken as a result.

Unlike traditional sentiment dashboards that stop at positive vs. negative classification, Sentio+ enables aspect-level reasoning over customer feedback, grounding every insight directly in real review evidence.

Core Problem Being Solved

Most sentiment analysis systems answer what customers feel, but fail to explain:

Why customers are dissatisfied
Which product aspects (e.g., usability, payments, performance, pricing) are driving ratings
What teams should fix first to improve outcomes

Sentio+ addresses this gap by combining:

Structured sentiment signals (ratings, categories, segments)
Semantic retrieval over raw review text
LLM-based synthesis grounded in retrieved evidence

This allows teams to ask business-critical questions such as:

"What usability issues are driving 1-star reviews in Finance apps?"
"How have payment-related complaints evolved over the last 6 months?"
"What features do users in the 'Everyone' content segment value most?"

The system translates unstructured feedback into decision-ready insights that inform product prioritization, roadmap planning, and customer experience improvements.

High-Level Architecture

Sentio+ follows a modular Retrieval-Augmented Generation (RAG) architecture designed for scalability, traceability, and business interpretability. The system is structured to ensure that every generated insight is grounded in real customer evidence and aligned with decision-making needs.

Architecture Flow

Data → Embeddings → Retrieval → LLM Reasoning → Business Insight

Design Principles

Evidence-first answers (no hallucinated insights)
Clear separation of preprocessing, retrieval, and generation
Metadata-aware retrieval for precise filtering and analysis

Tech Stack

Data & Preprocessing

Python for ETL and preprocessing logic
Pandas / NumPy for data cleaning and transformation
Jupyter Notebooks (/notebooks) for exploration, validation, and iterative development
KaggleHub for dataset retrieval

Vector Storage & Retrieval

ChromaDB for vector persistence and semantic search
Metadata-aware indexing (category, rating, date, segment)
Cosine similarity–based retrieval

LLM & RAG Layer

AWS Bedrock LLMs for grounded text generation
Retrieval-Augmented Generation (RAG) with top-K semantic search
Two-step reasoning: retrieval → synthesis

Backend

Python RAG services for embedding, retrieval, and response assembly
API-based LLM invocation

Frontend

Next.js web application (/web)
Chat-style interface for natural language queries

Storage & Infrastructure

Amazon S3 for raw and processed dataset storage
Local execution for development with cloud-based model services

Data

Sentio+ is powered by the Google Play Store Reviews dataset.

Source: Kaggle – Google Play Market Reviews
Scale: ~1M reviews across ~500 app titles (subset used during prototyping)
Sampling: Sampled 50,000 reviews
Key Fields:
- Review text
- Star rating (1–5)
- App category
- Review date
- Content rating (Everyone, Teen, etc.)

Purpose: Enable fine-grained analysis of customer sentiment, feature requests, and recurring pain points across app categories.

Key Innovation: Hybrid Stratified Signal Sampling

Rather than relying on naive random sampling, Sentio+ implements a hybrid stratified sampling strategy to maximize signal quality.

Breadth (Coverage): Reviews are balanced across all categories (Finance, Social, Productivity, etc.) and ratings (1-5 stars)
Depth (Signal Quality) Within each category/rating bucket, you prioritize:
- Long reviews (>150 characters) for detailed evidence
- Helpful reviews (high helpful_count) for peer-vetted insights
- Recent reviews (~60% from last 12 months) for current relevance

The resulting dataset treats each review as high-information testimony, avoiding low-signal noise such as one-word feedback ("Good app"). This dramatically improves downstream retrieval and LLM reasoning quality.

RAG Architecture

Phase 1: Preprocessing & Indexing

Merge review data (apps_reviews.csv) with app metadata (apps_info.csv)
Clean categories, filter for quality (length, helpfulness)
Create enriched text with context headers:

[APP: Google Wallet | CAT: Finance | RATING: 1/5 | DATE: 2024-09 | SEGMENT: Everyone] 
USER REVIEW: The payment gateway keeps timing out.

Load into ChromaDB with dual metadata (structured fields + enriched text)

Phase 2: Query & Retrieval

User asks a natural language question
System performs semantic search in ChromaDB (with optional hard filters by category, date, rating)
Rerank results by helpfulness/recency
Retrieve top-K most relevant review chunks

Phase 3: Generation & Grounding

LLM synthesizes insights from retrieved reviews
Response includes citations linking back to specific review_ids
UI displays original review excerpts as evidence
Users can drill down to full review context

Key Capabilities

Large-scale review ingestion via S3
Aspect-level sentiment reasoning
Metadata-aware semantic search
Evidence-grounded natural language insights
Trend detection across time and categories
Business-ready summaries for non-technical stakeholders

Example Walkthrough

User Question

"Why are 1-star reviews increasing for Finance apps in the last 6 months?"

Retrieval Step

Filters applied: category = Finance, rating = 1, date >= last 6 months
Top-K reviews retrieved mentioning payment failures, login issues, and crashes

LLM Synthesis Output

"Recent 1-star reviews in Finance apps are primarily driven by payment gateway timeouts and authentication failures following recent updates. Multiple users report being unable to complete transactions, leading to trust and reliability concerns."

Evidence (Cited Reviews)

Review A (2024-09): "The payment gateway keeps timing out during checkout"
Review B (2024-10): "After the update, I can't log in anymore"

This ensures every insight is explainable, auditable, and trusted.

Example Queries

"What are the most common reasons for 1-star reviews in Finance apps?"
"Which features are users requesting most in the last quarter?"
"How do Teen-rated app complaints differ from Everyone-rated apps?"
"What issues are driving churn-related feedback this year?"

Business Use Cases

Product Teams

Identify top recurring bugs and UX pain points
Prioritize features based on real customer impact

Strategy & Leadership

Detect systemic issues across product lines
Inform roadmap and investment decisions

Customer Experience (CX)

Understand root causes of negative sentiment
Track shifts in customer perception over time

Setup & Local Development

1. Clone the repository

git clone https://github.com/Carlomos7/sentio-plus.git

This project requires Python 3.13 or higher and Docker.

2. Set up Python environment and dependencies

3. Download dataset via Kagglehub

4. Run preprocessing notebooks/ETL scripts

5. Start ChromaDB and index embeddings

6. Launch backend services

Run Next.js frontend

Evaluation & Quality Considerations

While Sentio+ prioritizes decision support over raw accuracy metrics, quality is evaluated through:

Retrieval relevance: Are returned reviews actually answering the question?
Groundedness: Are all claims supported by cited evidence?
Business usefulness: Do insights translate into clear action items?
Consistency: Do similar queries yield stable themes over time?

Future work may introduce quantitative evaluation (e.g., retrieval precision@K, human-in-the-loop validation).

Project Positioning: Use Cases

Sentio+ is intentionally designed as a consulting-grade internal analytics tool, not a consumer chatbot. Its primary value lies in converting raw customer feedback into strategic, explainable insights that organizations can act on with confidence.

Product Prioritization: "What are the top 3 recurring bugs users want fixed in Finance apps?"
Competitive Analysis: "How do user complaints about subscription pricing compare across categories?"
Roadmap Planning: "What features are users requesting most in the last quarter?
CX Improvements: "Why are 1-star reviews spiking for our top apps?"
Audience Insights: "What do Teen-rated app users complain about vs Everyone-rated apps?"

Contributors

This project was originally collaborated on in Google Colab by:

Carlos • Kyle • Chenchen • Jeffrey • Marlon • Hayden

Name		Name	Last commit message	Last commit date
Latest commit History 160 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
app		app
data		data
docs		docs
etl		etl
notebooks		notebooks
streamlit-app		streamlit-app
web		web
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml

Folders and files

Latest commit

History

Repository files navigation

Sentio+

Overview

Core Problem Being Solved

High-Level Architecture

Architecture Flow

Design Principles

Tech Stack

Data & Preprocessing

Vector Storage & Retrieval

LLM & RAG Layer

Backend

Frontend

Storage & Infrastructure

Data

Key Innovation: Hybrid Stratified Signal Sampling

RAG Architecture

Phase 1: Preprocessing & Indexing

Phase 2: Query & Retrieval

Phase 3: Generation & Grounding

Key Capabilities

Example Walkthrough

User Question

Retrieval Step

LLM Synthesis Output

Evidence (Cited Reviews)

Example Queries

Business Use Cases

Product Teams

Strategy & Leadership

Customer Experience (CX)

Setup & Local Development

1. Clone the repository

2. Set up Python environment and dependencies

3. Download dataset via Kagglehub

4. Run preprocessing notebooks/ETL scripts

5. Start ChromaDB and index embeddings

6. Launch backend services

Run Next.js frontend

Evaluation & Quality Considerations

Project Positioning: Use Cases

Contributors

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages