WhyBuddy — A Curious AI Learning Companion for Kids

An End-to-End Multimodal Educational AI System Powered by Gemma 4.

📌 Abstract

WhyBuddy is an AI-powered educational assistant designed for children under 8 years old that transforms “why” questions into structured learning experiences. It combines natural language reasoning, knowledge retrieval, image generation, text-to-speech, and video synthesis into a unified pipeline.

At its core, WhyBuddy leverages Gemma 4 (E2B instruction-tuned model) for child-friendly reasoning and explanation generation, paired with Stable Diffusion (DreamShaper-7) for visual storytelling and Edge TTS for narration. The system outputs a complete learning artifact: a narrated educational video with step-by-step visual slides. Reference Hakathon Writeup

Video Introduction

🧩 System Architecture Overview

WhyBuddy is built as a modular multimodal pipeline with five core layers:

1. 🧠 Reasoning Layer (Gemma 4)

Model: google/gemma-4-e2b-it (Loaded via Kaggle Hub + HuggingFace AutoProcessor)
Role: Converts raw child questions into structured explanations.
Generates:
- TEACHER_ANSWER
- IMAGE_PROMPTS (7-10 visual scene prompts)
Guardrails: Gemma is guided using a carefully engineered SYSTEM_PROMPT that enforces child-safe language, a structured output format, and a strict separation of explanations and visuals.

2. 🔍 Knowledge Retrieval Layer (Wikipedia Tool)

To improve factual grounding and reduce hallucinations, WhyBuddy integrates a lightweight retrieval system:

Uses wikipedia.search() + wikipedia.summary().
Activated dynamically via a router module.
Injects real-time factual context directly into the Gemma prompts.

3. 🧭 Intent Router (Task Classification Module)

A simple but effective rule-based router classifies user input to dynamically modify the prompt structure before sending it to Gemma:

Mode	Trigger Condition	Action / Prompt Modification
Math Mode	Detects mathematical symbols (`+`, `-`, `*`, `/`, `=`)	Steers Gemma toward step-by-step mathematical breakdown.
Knowledge Mode	Detects interrogative keywords (`what`, `why`, `how`, `explain`)	Activates the Wikipedia retrieval layer for grounding.
Tutor Mode	Default state	Maintains a conversational, interactive learning tone.

4. 🎨 Visual Generation Layer (Stable Diffusion)

Model: DreamShaper-7
Purpose: Converts text prompts into engaging, cartoon-style learning visuals.
Pipeline: Each generated IMAGE_PROMPTS sentence is automatically prefixed with "Cartoon style of..." and fed into Stable Diffusion.
Output: 7–10 sequential, slide-based educational images per question to maintain concept-to-visual alignment.

5. 🔊 Audio Layer (Edge TTS)

Engine: edge-tts (en-US-GuyNeural)
Purpose: Converts the TEACHER_ANSWER into natural, human-like narration.
Sanitization: Includes text cleaning protocols such as emoji removal, unicode normalization, and punctuation filtering to ensure seamless audio playback.

6. 🎬 Video Composition Layer (MoviePy)

WhyBuddy stitches the assets together into a full educational video:

Images are generated via Stable Diffusion and converted to ImageClip elements.
Direct slide durations are assigned (3–4 seconds per slide).
Audio generated by Edge TTS is overlaid.
Final Output: 🎥 final_{uuid}.mp4

🚀 Getting Started

Prerequisites

Make sure you have Python 3.8+ installed on your system.

Installation

Clone the repository:

git clone https://github.com/Esabelle11/WhyBuddy.git
cd WhyBuddy

Install dependencies (for first time run):
```
pip install -r requirements.txt
```
run project with streamlit framework:
```
streamlit run app.py
```
Ending by:
```
control + c
```

🧠 How Gemma 4 Is Used (Core Innovation)

Gemma 4 serves as the central orchestration engine of WhyBuddy, driving three key elements:

🌟 Structured Educational Output

Gemma is constrained to always produce strict, deterministic blocks for predictable downstream parsing:

TEACHER_ANSWER: [Child-friendly explanation text goes here]
IMAGE_PROMPTS: [Prompt 1 | Prompt 2 | Prompt 3...]

🌟 Child-Adapted Reasoning

Through strict system prompt engineering, Gemma is forced to:

Simplify vocabulary drastically.
Utilize relatable analogies (e.g., toys, food, school, animals).
Avoid technical jargon while remaining factually accurate.
Maintain an encouraging, safe, and warm tone.

🌟 Multi-Stage Depth Adaptation

WhyBuddy dynamically tracks follow-up questions using an internal state counter (chat_state["depth"]). This allows Gemma to scale its explanation depth fluidly:

Level 0: Normal, comprehensive child-friendly explanation.
Level 1: Simpler language with shorter sentence structures.
Level 2: Centered heavily around a single, real-world analogy.
Level 3: Ultra-simple, high-engagement toddler explanation.

⚙️ Key Engineering Challenges & Solutions

Model Loading Complexity (Gemma + SD)

Problem: Multi-gigabyte models caused initialization timeouts and memory failures in constrained Kaggle notebook environments.
Solution: Implemented explicit Kaggle Hub downloading (kagglehub.model_download) paired with conditional fallback error handling and automated GPU/VRAM capability detection.

Structured Output Parsing

Problem: LLMs can occasionally output unstructured or inconsistent text, breaking regex patterns in downstream multimedia layers.
Solution: Enforced strict JSON/Marker formats via the SYSTEM_PROMPT combined with a robust custom fallback parser that splits content cleanly on pre-defined delimiter boundaries.

Multimedia Synchronization

Problem: Aligning dynamic audio narration length with static generated visual slides.
Solution: Applied a fixed per-slide timing strategy (3–4 seconds per image scene) synchronized against the total narrative length, cleanly concatenated and exported via MoviePy.

Performance Bottlenecks

Problem: Stable Diffusion inference scales poorly and slows down user experience loops.
Solution: Optimized generation by reducing inference to 25 steps, utilizing mixed-precision GPU acceleration (torch.float16), and maintaining a sequential, controlled execution pipeline.

Child Safety & Content Control

Problem: Risks of generating complex, scary, or unsafe edge-case text/images.
Solution: Embedded a rigid safety verification layer directly within the SYSTEM_PROMPT alongside automated emoji, unicode, and profanity sanitization filters.

💡 Why These Technical Choices Are Correct

Wikipedia Retrieval: Drastically reduces hallucination risk and anchors abstract definitions in verified reality.
Gemma 4: Exceptional instruction-following capabilities and structured parsing reliability compared to alternative lightweight open-weights models. Stable Diffusion (DreamShaper): Successfully translates complex, abstract ideas into highly engaging visual learning cues optimized for early childhood psychology.
Edge TTS: Extremely lightweight, fast API requiring zero local resource overhead while delivering human-grade vocal inflection.
MoviePy: Provides a reliable programmatic multimedia stitching framework without requiring external heavy video rendering engines.

🚀 Final Output Capability

WhyBuddy produces a completely autonomous, AI-generated learning experience containing:

📝 Explanation Text tailored to the child's exact age and curiosity level (Gemma 4).
🎨 Visual Storybook Slides matching the narrative context (Stable Diffusion).
🔊 Voice Narration with clear, friendly pronounciation (Edge TTS).
🎬 An Educational Video File ready for playback (.mp4 via MoviePy).

🏁 Conclusion

WhyBuddy demonstrates how Gemma 4 can serve as the reliable reasoning core of a complex, multimodal educational system. By bridging the gap between LLM reasoning, retrieval augmentation, image generation, speech synthesis, and video composition, WhyBuddy showcases full-stack AI system engineering engineered for real-world interactive learning.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.streamlit		.streamlit
sources		sources
.gitignore		.gitignore
README.md		README.md
app.py		app.py
models.py		models.py
pipeline.py		pipeline.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WhyBuddy — A Curious AI Learning Companion for Kids

📌 Abstract

Video Introduction

🧩 System Architecture Overview

1. 🧠 Reasoning Layer (Gemma 4)

2. 🔍 Knowledge Retrieval Layer (Wikipedia Tool)

3. 🧭 Intent Router (Task Classification Module)

4. 🎨 Visual Generation Layer (Stable Diffusion)

5. 🔊 Audio Layer (Edge TTS)

6. 🎬 Video Composition Layer (MoviePy)

🚀 Getting Started

Prerequisites

Installation

🧠 How Gemma 4 Is Used (Core Innovation)

🌟 Structured Educational Output

🌟 Child-Adapted Reasoning

🌟 Multi-Stage Depth Adaptation

⚙️ Key Engineering Challenges & Solutions

💡 Why These Technical Choices Are Correct

🚀 Final Output Capability

🏁 Conclusion

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

WhyBuddy — A Curious AI Learning Companion for Kids

📌 Abstract

Video Introduction

🧩 System Architecture Overview

1. 🧠 Reasoning Layer (Gemma 4)

2. 🔍 Knowledge Retrieval Layer (Wikipedia Tool)

3. 🧭 Intent Router (Task Classification Module)

4. 🎨 Visual Generation Layer (Stable Diffusion)

5. 🔊 Audio Layer (Edge TTS)

6. 🎬 Video Composition Layer (MoviePy)

🚀 Getting Started

Prerequisites

Installation

🧠 How Gemma 4 Is Used (Core Innovation)

🌟 Structured Educational Output

🌟 Child-Adapted Reasoning

🌟 Multi-Stage Depth Adaptation

⚙️ Key Engineering Challenges & Solutions

💡 Why These Technical Choices Are Correct

🚀 Final Output Capability

🏁 Conclusion

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages