Skip to content

Esabelle11/WhyBuddy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WhyBuddy — A Curious AI Learning Companion for Kids

An End-to-End Multimodal Educational AI System Powered by Gemma 4.


📌 Abstract

WhyBuddy is an AI-powered educational assistant designed for children under 8 years old that transforms “why” questions into structured learning experiences. It combines natural language reasoning, knowledge retrieval, image generation, text-to-speech, and video synthesis into a unified pipeline.

At its core, WhyBuddy leverages Gemma 4 (E2B instruction-tuned model) for child-friendly reasoning and explanation generation, paired with Stable Diffusion (DreamShaper-7) for visual storytelling and Edge TTS for narration. The system outputs a complete learning artifact: a narrated educational video with step-by-step visual slides. Reference Hakathon Writeup

Video Introduction

Watch Demo


🧩 System Architecture Overview

WhyBuddy is built as a modular multimodal pipeline with five core layers: This is an alt text.

1. 🧠 Reasoning Layer (Gemma 4)

  • Model: google/gemma-4-e2b-it (Loaded via Kaggle Hub + HuggingFace AutoProcessor)
  • Role: Converts raw child questions into structured explanations.
  • Generates:
    • TEACHER_ANSWER
    • IMAGE_PROMPTS (7-10 visual scene prompts)
  • Guardrails: Gemma is guided using a carefully engineered SYSTEM_PROMPT that enforces child-safe language, a structured output format, and a strict separation of explanations and visuals.

2. 🔍 Knowledge Retrieval Layer (Wikipedia Tool)

To improve factual grounding and reduce hallucinations, WhyBuddy integrates a lightweight retrieval system:

  • Uses wikipedia.search() + wikipedia.summary().
  • Activated dynamically via a router module.
  • Injects real-time factual context directly into the Gemma prompts.

3. 🧭 Intent Router (Task Classification Module)

A simple but effective rule-based router classifies user input to dynamically modify the prompt structure before sending it to Gemma:

Mode Trigger Condition Action / Prompt Modification
Math Mode Detects mathematical symbols (+, -, *, /, =) Steers Gemma toward step-by-step mathematical breakdown.
Knowledge Mode Detects interrogative keywords (what, why, how, explain) Activates the Wikipedia retrieval layer for grounding.
Tutor Mode Default state Maintains a conversational, interactive learning tone.

4. 🎨 Visual Generation Layer (Stable Diffusion)

  • Model: DreamShaper-7
  • Purpose: Converts text prompts into engaging, cartoon-style learning visuals.
  • Pipeline: Each generated IMAGE_PROMPTS sentence is automatically prefixed with "Cartoon style of..." and fed into Stable Diffusion.
  • Output: 7–10 sequential, slide-based educational images per question to maintain concept-to-visual alignment.

5. 🔊 Audio Layer (Edge TTS)

  • Engine: edge-tts (en-US-GuyNeural)
  • Purpose: Converts the TEACHER_ANSWER into natural, human-like narration.
  • Sanitization: Includes text cleaning protocols such as emoji removal, unicode normalization, and punctuation filtering to ensure seamless audio playback.

6. 🎬 Video Composition Layer (MoviePy)

WhyBuddy stitches the assets together into a full educational video:

  1. Images are generated via Stable Diffusion and converted to ImageClip elements.
  2. Direct slide durations are assigned (3–4 seconds per slide).
  3. Audio generated by Edge TTS is overlaid.
  4. Final Output: 🎥 final_{uuid}.mp4

🚀 Getting Started

Prerequisites

Make sure you have Python 3.8+ installed on your system.

Installation

  1. Clone the repository:
    git clone https://github.com/Esabelle11/WhyBuddy.git
    cd WhyBuddy
  2. Install dependencies (for first time run):
    pip install -r requirements.txt
  3. run project with streamlit framework:
    streamlit run app.py
  4. Ending by:
    control + c

🧠 How Gemma 4 Is Used (Core Innovation)

Gemma 4 serves as the central orchestration engine of WhyBuddy, driving three key elements:

🌟 Structured Educational Output

Gemma is constrained to always produce strict, deterministic blocks for predictable downstream parsing:

TEACHER_ANSWER: [Child-friendly explanation text goes here]
IMAGE_PROMPTS: [Prompt 1 | Prompt 2 | Prompt 3...]

🌟 Child-Adapted Reasoning

Through strict system prompt engineering, Gemma is forced to:

  • Simplify vocabulary drastically.
  • Utilize relatable analogies (e.g., toys, food, school, animals).
  • Avoid technical jargon while remaining factually accurate.
  • Maintain an encouraging, safe, and warm tone.

🌟 Multi-Stage Depth Adaptation

WhyBuddy dynamically tracks follow-up questions using an internal state counter (chat_state["depth"]). This allows Gemma to scale its explanation depth fluidly:

  • Level 0: Normal, comprehensive child-friendly explanation.
  • Level 1: Simpler language with shorter sentence structures.
  • Level 2: Centered heavily around a single, real-world analogy.
  • Level 3: Ultra-simple, high-engagement toddler explanation.

⚙️ Key Engineering Challenges & Solutions

  1. Model Loading Complexity (Gemma + SD)
  • Problem: Multi-gigabyte models caused initialization timeouts and memory failures in constrained Kaggle notebook environments.
  • Solution: Implemented explicit Kaggle Hub downloading (kagglehub.model_download) paired with conditional fallback error handling and automated GPU/VRAM capability detection.
  1. Structured Output Parsing
  • Problem: LLMs can occasionally output unstructured or inconsistent text, breaking regex patterns in downstream multimedia layers.
  • Solution: Enforced strict JSON/Marker formats via the SYSTEM_PROMPT combined with a robust custom fallback parser that splits content cleanly on pre-defined delimiter boundaries.
  1. Multimedia Synchronization
  • Problem: Aligning dynamic audio narration length with static generated visual slides.
  • Solution: Applied a fixed per-slide timing strategy (3–4 seconds per image scene) synchronized against the total narrative length, cleanly concatenated and exported via MoviePy.
  1. Performance Bottlenecks
  • Problem: Stable Diffusion inference scales poorly and slows down user experience loops.
  • Solution: Optimized generation by reducing inference to 25 steps, utilizing mixed-precision GPU acceleration (torch.float16), and maintaining a sequential, controlled execution pipeline.
  1. Child Safety & Content Control
  • Problem: Risks of generating complex, scary, or unsafe edge-case text/images.
  • Solution: Embedded a rigid safety verification layer directly within the SYSTEM_PROMPT alongside automated emoji, unicode, and profanity sanitization filters.

💡 Why These Technical Choices Are Correct

  • Wikipedia Retrieval: Drastically reduces hallucination risk and anchors abstract definitions in verified reality.
  • Gemma 4: Exceptional instruction-following capabilities and structured parsing reliability compared to alternative lightweight open-weights models. Stable Diffusion (DreamShaper): Successfully translates complex, abstract ideas into highly engaging visual learning cues optimized for early childhood psychology.
  • Edge TTS: Extremely lightweight, fast API requiring zero local resource overhead while delivering human-grade vocal inflection.
  • MoviePy: Provides a reliable programmatic multimedia stitching framework without requiring external heavy video rendering engines.

🚀 Final Output Capability

WhyBuddy produces a completely autonomous, AI-generated learning experience containing:

  • 📝 Explanation Text tailored to the child's exact age and curiosity level (Gemma 4).
  • 🎨 Visual Storybook Slides matching the narrative context (Stable Diffusion).
  • 🔊 Voice Narration with clear, friendly pronounciation (Edge TTS).
  • 🎬 An Educational Video File ready for playback (.mp4 via MoviePy).

🏁 Conclusion

WhyBuddy demonstrates how Gemma 4 can serve as the reliable reasoning core of a complex, multimodal educational system. By bridging the gap between LLM reasoning, retrieval augmentation, image generation, speech synthesis, and video composition, WhyBuddy showcases full-stack AI system engineering engineered for real-world interactive learning.

About

An AI-powered educational assistant combines natural language reasoning, knowledge retrieval, image generation, text-to-speech, and video synthesis into a unified pipeline. It was designed for children under 8 years old that transforms “why” questions into structured learning experiences.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages